Computational Learning Approaches to Data Analytics in Biomedical Applications [1 ed.] 0128144823, 9780128144824

Computational Learning Approaches to Data Analytics in Biomedical Applications provides a unified framework for biomedic

1,532 122 18MB

English Pages 220 [299] Year 2019

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Computational Learning Approaches to Data Analytics in Biomedical Applications [1 ed.]
 0128144823, 9780128144824

Table of contents :
Cover
Computational Learning Approaches to Data Analytics in Biomedical Applications
Copyright
Preface and Acknowledgements
1 - Introduction
References
2 - Data preprocessing
2.1 Introduction
2.2 Data preparation
2.2.1 Initial cleansing
2.2.2 Data imputation and missing values algorithms
2.2.2.1 Removal Methods
2.2.2.2 Utilization methods
2.2.2.3 Maximum likelihood
2.2.3 Imputation methods
2.2.3.1 Single imputation methods
2.2.3.1.1 Mean imputation
2.2.3.1.2 Substitution of related observations
2.2.3.1.3 Random selection
2.2.3.1.4 Weighted K-nearest neighbors (KNN) imputation
2.2.3.2 Multiple imputation
2.2.4 Feature enumeration
2.2.4.1 Special cases of categorical data representation using COBRIT traumatic brain injury data as an example
2.2.5 Detecting and removing redundant features
2.2.5.1 Pearson correlation
2.2.5.2 Spearman correlation
2.2.6 Recoding categorical features
2.2.7 Outlier detection
2.2.8 Normalization
2.2.9 Domain experts
2.2.10 Feature selection and extraction
2.3 Example
2.4 Summary
References
3 - Clustering algorithms
3.1 Introduction
3.2 Proximity measures
3.3 Clustering algorithms
3.3.1 Hierarchical clustering
3.3.2 Density-based clustering
3.3.3 Subspace clustering
3.3.3.1 Basic subspace clustering
3.3.3.1.1 Grid-based subspace clustering
3.3.3.1.2 Window-based subspace clustering
3.3.3.1.3 Density-based subspace clustering
3.3.3.2 Advanced subspace clustering
3.3.3.2.1 3D subspace clustering
3.3.4 Squared error-based clustering
3.3.5 Fuzzy clustering
3.3.6 Evolutionary computational technology-based clustering
3.3.7 Neural network–based clustering
3.3.8 Kernel learning-based clustering
3.3.9 Large-scale data clustering
3.3.10 High-dimensional data clustering
3.3.11 Sequential data clustering
3.3.11.1 Proximity-based sequence clustering
3.3.11.2 Feature-based sequence clustering
3.3.11.3 Model-based sequence clustering
3.4 Adaptive resonance theory
3.4.1 Fuzzy ART
3.4.2 Fuzzy ARTMAP
3.4.3 BARTMAP
3.5 Summary
References
4 - Selected approaches to supervised learning
4.1 Backpropagation and related approaches
4.1.1 Backpropagation
4.1.2 Backpropagation through time
4.2 Recurrent neural networks
4.3 Long short-term memory
4.4 Convolutional neural networks and deep learning
4.4.1 Structure of convolutional neural network
4.4.2 Deep belief networks
4.4.3 Variational autoencoders
4.5 Random forest, classification and Regression Tree, and related approaches
4.6 Summary
References
5 - Statistical analysis tools
5.1 Introduction
5.2 Tools for determining an appropriate analysis
5.3 Statistical applications in cluster analysis
5.3.1 Cluster evaluation tools: analyzing individual features
5.3.1.1 Hypothesis testing and the 2-sample t-test
5.3.1.2 Summary of hypothesis testing steps and application to clustering
5.3.1.3 One-way ANOVA
5.3.1.4 χ2 test for independence
5.3.2 Cluster evaluation tools: multivariate analysis of features
5.4 Software tools and examples
5.4.1 Statistical software tools
5.4.1.1 Example: clustering autism spectrum disorder phenotypes
5.4.1.2 Correlation analysis
5.4.1.3 Cluster evaluation of individual features
5.4.1.4 Summary of results
5.5 Summary
References
6 - Genomic data analysis
6.1 Introduction
6.2 DNA methylation
6.2.1 Introduction
6.2.2 DNA methylation technology
6.2.3 DNA methylation analysis
6.2.4 Clustering applications for DNA methylation data
6.3 SNP analysis
6.3.1 Association studies
6.3.2 Clustering with family-based association test (FBAT) analysis
6.3.2.1 Quality control filtering
6.3.2.2 Family-based association testing
6.3.2.3 Multiple testing
6.3.2.4 Adjustments for small sample size
6.3.2.5 Implementation and analysis of results
6.4 Biclustering for gene expression data analysis
6.4.1 Introduction to biclustering
6.4.2 Commonly used biclustering methods
6.4.3 Evolutionary-based biclustering methods
6.4.4 BARTMAP: a neural network-based biclustering algorithm
6.4.5 External and internal validation metrics related to biclustering
6.5 Summary
References
7 - Evaluation of cluster validation metrics
7.1 Introduction
7.2 Related works
7.3 Background
7.3.1 Commonly used internal validation indices
7.3.2 External validation indices
7.3.3 Statistical methods
7.4 Evaluation framework
7.5 Experimental results and analysis
7.6 Ensemble validation paradigm
7.7 Summary
References
8 - Data visualization
8.1 Introduction
8.2 Dimensionality reduction methods
8.2.1 Linear projection algorithms
8.2.1.1 Principal component analysis
8.2.1.2 Independent component analysis
8.2.2 Nonlinear projection algorithms
8.2.2.1 Isomap
8.2.2.2 T-Distributed Stochastic Neighbor Embedding (t-SNE)
8.2.2.3 LargeVis
8.2.2.4 Self-organizing maps
8.2.2.5 Visualization of commonly used biomedical data sets from the UCI machine learning repository ()
8.3 Topological data analysis
8.4 Visualization for neural network architectures
8.5 Summary
References
9 - Data analysis and machine learning tools in MATLAB and Python
9.1 Introduction
9.2 Importing data
9.2.1 Reading data in MATLAB
9.2.1.1 Interactive import function
9.2.1.2 Reading data as formatted tables
9.2.1.3 Reading data as cellular arrays
9.2.1.4 Reading data as numerical arrays and matrices
9.2.1.4.1 xlsread function
9.2.1.4.2 Functions for reading in data from text files as numerical arrays/matrices
9.2.1.4.3 Reading images in MATLAB
9.2.2 Reading data in Python
9.2.2.1 Overview of external libraries and modules for Python
9.2.2.2 Opening files in Python
9.2.2.2.1 Reading text files in Python
9.2.2.2.2 read_csv() function
9.2.2.2.3 Other read functions
9.2.3 Handling big data in MATLAB
9.2.3.1 How to create data stores in MATLAB
9.2.3.1.1 read function
9.2.3.1.2 readall function
9.2.3.1.3 hasdata function
9.2.3.1.4 partition function
9.2.3.1.5 numpartitions function
9.2.3.2 Tall arrays
9.3 Data preprocessing
9.3.1 Missing values handling
9.3.1.1 Handling missing values during reading
9.3.1.2 Finding and replacing missing values
9.3.2 Normalization
9.3.2.1 z-score
9.3.3 Outliers detection
9.4 Tools and functions for implementing machine learning algorithms
9.4.1 Clustering
9.4.1.1 k-means
9.4.1.2 Gaussian mixture model
9.4.1.3 Hierarchical clustering
9.4.1.4 Self-organizing map
9.4.2 Prediction and classification
9.4.2.1 Machine learning workflow
9.4.2.1.1 Data Preparation
9.4.2.1.2 Fitting and predicting tools
9.4.2.2 Multiclass support vector machines
9.4.2.3 Neural network classifier
9.4.2.4 Performance evaluation and cross-validation tools
9.4.3 Features reduction and features selection tools in MATLAB
9.4.3.1 Built-in feature selection method
9.4.3.2 Sequential features selection
9.4.4 Features reduction and features selection tools in Python
9.4.4.1 Removing features with low variance
9.4.4.2 Recursive feature elimination
9.5 Visualization
9.5.1 Multidimensional scaling
9.5.1.1 Pairwise distance calculation function pdist
9.5.1.2 Perform multidimensional scaling
9.5.2 Principal component analysis
9.5.3 Visualization functions
9.6 Clusters and classification evaluation functions
9.6.1 Cluster evaluation
9.6.2 Classification models evaluation
9.6.2.1 Confusion matrix confusionmat
9.7 Summary
References
Index
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Back Cover

Citation preview

Computational Learning Approaches to Data Analytics in Biomedical Applications Khalid K. Al-jabery Deputy Chief Engineer, Department of Information Systems & Technology, Basrah Oil Company, Basrah, Iraq

Tayo Obafemi-Ajayi Assistant Professor of Electrical Engineering, Engineering Program, Missouri State University, Springfield, MO, United States

Gayla R. Olbricht Associate Professor of Statistics, Department of Mathematics and Statistics, Missouri University of Science and Technology, Rolla, MO, United States

Donald C. Wunsch II Mary K. Finley Missouri Distinguished Professor of Computer Engineering, Director, Applied Computational Intelligence Laboratory, Missouri University of Science and Technology, Rolla, MO, United States

Academic Press is an imprint of Elsevier 125 London Wall, London EC2Y 5AS, United Kingdom 525 B Street, Suite 1650, San Diego, CA 92101, United States 50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom Copyright © 2020 Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein).

Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library ISBN: 978-0-12-814482-4 For information on all Academic Press publications visit our website at https://www.elsevier.com/books-and-journals

Publisher: Mara Conner Acquisition Editor: Chris Katsaropoulos Editorial Project Manager: Mariana L. Kuhl Production Project Manager: Nirmala Arumugam Cover Designer: Christian J. Bilbow Typeset by TNQ Technologies

Preface and Acknowledgements The idea for Computational Intelligence Approaches to Data Analytics in Biomedical Applications came up when we presented a tutorial by the same name at the IEEE Engineering in Medicine and Biology Conference. We recommend this excellent conference. The tutorial was well-attended, and even before the conference, multiple publishers had contacted us requesting a book on the subject. After some discussions, we chose the thorough, attentive team from Elsevier. We have been very pleased with their continued attention and encouragement throughout this process. It is no surprise that interest is high in this topic. The aging population, opportunities for personalized medicine, hopes for curing stubborn health problems and many other factors fuel the interest in data-driven solutions. What sets fire to this fuel are the new advances in computational intelligence capabilities, together with improvements in the computing hardware they depend on. We hope this text will provide researchers with a launching pad to explore the use of these and related techniques for their own projects. No effort of this scale can be accomplished without significant support. In addition to Elsevier’s staff members, we are grateful to Missouri University of Science and Technology (Missouri S & T) and Missouri State University (MSU) for providing an environment conducive to our research. We appreciate the related contributions of the many students of the Applied Computational Intelligence Laboratory (Missouri S & T), the Department of Mathematics and Statistics (Missouri S & T), and the Computational Learning Systems Lab (MSU) over the years. In addition, book-writing depended on the patience of our families. For the nights and weekends spent writing, we thank their patience. We appreciate obtaining access to Simons Simplex Collection data analyzed in chapter 5 on SFARI base: www.sfari.org. Furthermore, this has only been possible with financial support. Partial support for this research was received from the Missouri University of Science and Technology Intelligent Systems Center, the Mary K. Finley Missouri Endowment, the National Science Foundation, the Lifelong Learning Machines program from DARPA/Microsystems Technology Office, and the United States Army Research Laboratory (ARL); and it was accomplished under Cooperative Agreement Number W911NF-18-2-0260. Research was also sponsored by the Leonard Wood Institute in cooperation with the ARL and was accomplished under Cooperative Agreement Number W911 NF-14-2-0034. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Leonard Wood Institute, the ARL, or the United States Government. The United States Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.

ix

1 Introduction Data Analytics has become a transformational technology, fueled by advances in computer and sensor hardware, storage, cloud computing, pervasive data collection, and advances in algorithms, particularly machine learning. No single one of these would have fueled the advance nearly so much as their convergence has. This fundamentally changed the way many industries do business. Data analytics is even revolutionizing industries that were not formerly dominated by data, such as transportation. In biomedical applications, although data does not play the central role, learning from it is playing an increasingly important one. The ability to visualize data has arguably been one of the most curative improvements in the history of medicine. But this visualization had been dominated by a human. Similar comments may be made about other tasks created by biomedical data. “Manual,” that is, human-dominated, data analysis is no longer sustainable. These tasks will increasingly be delegated to computers. The techniques in this book are a large part of the reason for this shift. It is increasingly becoming possible, even necessary, for machines to do much of the analytics for us. The sheer growth in the volume of data is one reason for this. For example, healthcare data were predicted to increase from approximately 500 petabytes in 2012 to 25,000 petabytes by 2020 (Roski, Bo-Linn, & Andrews, 2014). This represents a faster increase than Moore’s Law, even as that helpful phenomenon ends (Chien & Karamcheti, 2013; Theis & Wong, 2017). The situation would be hopeless were it not for increased automation in analysis. Fortunately, automated analysis has improved dramatically. Some approaches were previously difficult because of the computational complexity of solving for parameters of large systems. The increased amount of data, although a challenge, can actually be helpful. Some of the most sophisticated techniques demand huge datasets because without them, systems with many parameters are subject to overfitting. Having enough data would have been cold comfort when most of these techniques were invented due to the high computational cost of solving for the parameters. The dramatic advances in high-performance computing have mitigated this impediment. This began long ago when the availability of parallel computing tools accelerated with expanded use of Graphics Processing Units. If you see a person addicted to electronic games, take a moment to thank him or her. That industry has grown larger than the movie and music industry combined (Nath, 2016). At the time, this was much larger than the market for high-performance computing and was a major driver for cost reductions. The visibility of Computational Intelligence methods has increased to the point that our industry has legs of its own, but the transformational cost reductions originated in the gaming industry, which continues to be an important driver of innovation.

Computational Learning Approaches to Data Analytics in Biomedical Applications. https://doi.org/10.1016/B978-0-12-814482-4.00001-2 Copyright © 2020 Elsevier Inc. All rights reserved.

1

2 Computational Learning Approaches to Data Analytics in Biomedical Applications

However, not all of the techniques we cover are computationally intensive. Some of them are linear or log-linear in computational complexity, that is, very fast. These are suitable for embedded applications such as next-generation smart sensors, wearable and monitoring technology, and much more. These include selected neural network and statistical clustering techniques. Some of these techniques are classics, which makes sense, since earlier computing techniques were limited in performance. The speedier of the recent innovations described herein also offer surprising computational efficiency. A related issue is the emergence of non-volatile memory technology (Bourzac, 2017; Merrikh-Bayat, Shouraki, & Rohani, 2011). This will lower the cost point and more importantly the energy consumption of embedded computing solutions. Much more of the burden of computation can now be shifted to cheaper, energy-efficient memory. Memory-intensive solutions to biomedical problems will therefore become much more pervasive. This alone would be a game-changer but there’s more. This technology also offers an opportunity to directly implement learning algorithms on almost any device or object (Versace, Kozma, & Wunsch, 2012). For the smallest, cheapest devices, these may tend to be the simpler algorithms, for example, a neural network with a very small number of layers and/or neurons. However, this simplicity will be balanced by the sheer number of devices or objects which can benefit from an embedded neural network. In the future, it will be much easier to determine whether medications have been taken, objects have been used, patterns of usage of just about anything, and much more. As valuable as these innovations will be, the use of such information in aggregate offers even greater potential. Unleashing the potential of the resulting massive datasets will be enhanced by the ability to process from “the edge,” that is close to the point where data is generated. Without this capability, even the most powerful systems will choke on the volume of increasing data. Nevertheless, methods of extraordinary computational complexity are no longer prohibitively difficult. The advances in low-cost, high-performance computing power is only part of the solution. As the business potential of improved computational intelligence capabilities has become obvious, more investment has followed the opportunity. The willingness of major corporations (such as Google (D’Orbino, 2017)) to invest billions in this technology has enabled brute-force solutions to even the most intractable and difficult problems (Silver et al., 2016). It has become increasingly apparent that the companies that win the computational intelligence competition will dominate their industries. Almost no investment is too large to achieve that objective. This includes high-performance computers and even application-specific integrated circuits optimized for machine learning applications (Jouppi, 2016). The resources that can be targeted at a problem are limited only by the importance of the problem. This is an especially significant opportunity for medical applications. Another crucial factor is that the human learning curve has improved. Many of the techniques we describe formerly required experts to implement them. Now, many tools have been devised that allow rapid prototyping or even complete solutions by relatively new practitioners. This trend is accelerating. For example, TensorFlow, Python machine

Chapter 1  Introduction

3

learning libraries, Weka, MATLAB toolboxes, open-source repositories at many research labs, and several other products enable more practitioners to implement algorithms and techniques. Moreover, these enhanced software and hardware capabilities are constantly becoming more widely available. Some of the tools can be tried out by anyone for free. Others are available for free or at modest cost depending on the novelty and importance of the application. Many companies have announced ambitious plans to expand widespread availability of their tools. Expensive as this may seem, it is actually an astute business decision; for it provides the opportunities for companies to establish their software and hardware as de facto standards for large classes of applications. This will remove what was once a massive barrier to entry for new innovations in the field, while simultaneously creating barriers to entry for corporate competitors. The result of these factors is that almost no area of medicine will be untouched by the extraordinary changes described here. This will create heretofore unimagined opportunities to prevent, cure or mitigate diseases. Familiarity with the technology described herein will help tremendously to bring about this desirable outcome. A brief comment on terminology is appropriate here. The topics discussed in this book are highly multidisciplinary, and each discipline uses its own terminology. Different disciplines sometimes use the same word for different meanings, subtle or significant. For most of this book we will discuss terminology as it occurs. However, we will differentiate here between variables and features. In this book, we consider a variable to be anything that can vary. Therefore, the value of an observation, a vector, or a component of a vector can all be considered variables. A feature is considered to be something that has been processed or decided on in some way. It can be a single variable, a set of variables, or even a function of some variables. The main thing is that a human or algorithm should have selected it, in order for it to be called a feature. Certain fields use these terms differently but this is how we will use them in this book. For most of this book’s chapters, a large body and growing body of literature exists. Therefore, none of these are intended to be comprehensive. Rather, they are illustrative, and together they give the reader a perspective on the range of available tools. Much of the best work in this field is collaborative. An appreciation of these methods will allow the reader to become a more effective collaborator with domain experts or computational intelligence practitioners. The remaining chapters in this book are organized as follows. Chapter two presents a general framework for data curation. It covers the different phases of data preprocessing and preparation. The framework fits a broad variety of datasets. This chapter provides a detailed overview for the most popular algorithms and techniques for data curation, imputation, feature extraction, correlation analysis, and practical application of these algorithms. We also provide techniques that have been developed from our experience in data processing. At the end of Chapter two, we present a practical example showing the effect of using different imputation methods on the performance and efficiency of Support Vector Machines. The chapter describes a

4 Computational Learning Approaches to Data Analytics in Biomedical Applications

methodology for converting raw and messy data to a well-organized data set that is ready for applying high level machine learning algorithms or any advanced methods of data analysis. Chapter three is an overview of clustering algorithms. See (Xu & Wunsch, 2005) for a survey (Xu & Wunsch, 2010), for a biomedical engineering survey, or (Xu & Wunsch, 2009), and (Kumar, Bezdek, Rajasegarar, Leckie, & Palaniswami, 2017) for a more thorough treatment of the topic. The literature is constantly growing in this important field, and the chapter will provide a roadmap. This chapter reviews proximity measures and explores popular clustering algorithms. Through the sections of Chapter three, the reader will be introduced to many clustering algorithms, exploring their origin, math, and applications. The last section covers adaptive resonance theory and its different algorithms. For a further synopsis of adaptive resonance theory-inspired algorithms, see (Brito da Silva, Elnabarawy, & Wunsch, 2019). In order to have them in the same place, this chapter’s summary of adaptive resonance theory algorithms includes a supervised version, Fuzzy ARTMAP. The main coverage of supervised learning is in the following chapter. Chapter four provides a selected review of supervised learning topics useful in bioinformatics. It begins with fundamentals, then illustrates a few popular methods. We hope that these will stir the reader’s interest, and provide the referenced works for further exploration. Chapter four starts with an overview of the most widely-used approach, backpropagation, including backpropagation through time and stochastic gradient descent. The following sections cover recurrent neural networks and approaches for their training (especially Extended Kalman Filter), Long Short-Term Memory, Convolutional Neural Networks and Deep Learning, Random Forest, Classification and Regression Trees, and related approaches. Chapter five presents several statistics concepts that are useful when analyzing biomedical data. These focus on evaluating feature importance after clustering, to improve interpretation. Foundational terminology and concepts are introduced to connect statistical data analysis tools with their application in cluster evaluation. The chapter provides a road map for choosing the appropriate statistical method and to illustrate how statistical analysis can be useful in clustering applications. Chapter six reviews several state-of-the-art methods regarding the analysis of three types of genomic data: DNA methylation, genotype (using single nucleotide polymorphisms (SNPs)) and gene expression. It also highlights the fundamental questions that these techniques attempt to address, given the rapid evolution of the technologies and increasingly diverse types of genetic data available for computational analysis. Chapter seven discusses various evaluation metrics applied for unsupervised learning applications commonly known as cluster validation indices (CVIs). The work presented in this chapter is primarily targeted at utilizing the correlation between internal and external CVIs to assess the performance of internal CVIs specifically for biomedical data analysis. It evaluates six clustering algorithms on fourteen real biological datasets. Additionally, three different external CVIs are considered. The evaluation framework incorporates rigorous statistical techniques supported by experimental results and analysis.

Chapter 1  Introduction

5

Chapter eight reviews fundamental visualization techniques and current state of the art methods. It also provides some examples of these methods on data sets drawn from the University of California - Irvine (UCI) machine learning repository (Blake & Merz, 1998) to help users choose which technique is most appropriate for their application. Chapter nine discusses the popular machine learning (ML) and data processing tools and functions in MATLAB and Python. It reviews available functions and methods (particularly in MATLAB) to implement and demonstrate biomedical data analysis techniques. This is achieved by traversing the entire path of data analysis, from the initial loading and reading in of the raw data to advanced implementation of ML algorithms, and final phase of knowledge inference. The various examples discussed in this chapter are intended to provide the reader the basic knowledge and understanding of the implementation of vital functions and modules for data analysis. Chapter nine is designed for those who are at beginner or intermediate levels of programming with MATLAB and/or Python. This chapter uses coded examples to discuss the necessary visualization functions and libraries in MATLAB and Python, respectively. The final section of this chapter is dedicated to the evaluation process for clustering and classification models. This chapter includes an online code repository that contains all the code written in the chapter plus additional examples and functions written in MATLAB and Python.

References Blake, C. L., & Merz, C. J. (1998). UCI Repository of machine learning databases. University of California. Bourzac, K. (2017). Has Intel created a universal memory technology? [News]. IEEE Spectrum, 54(5), 9e10. https://doi.org/10.1109/MSPEC.2017.7906883. Brito da Silva, L. E., Elnabarawy, I., & Wunsch, D. C. (2019). A survey of adaptive resonance theory neural network models for engineering applications. Neural Networks, 1e43. Chien, A. A., & Karamcheti, V. (2013). Moore’s Law: The first ending and a new beginning. Computer, 46(12), 48e53. https://doi.org/10.1109/MC.2013.431. D’Orbino, L. (2017). Battle of the brains: Google leads in the race to dominate artificial intelligence. Retrieved from The Economist website https://www.economist.com/news/business/21732125-techgiants-are-investing-billions-transformative-technology-google-leads-race. Jouppi, N. (2016). Google supercharges machine learning tasks with TPU custom chip. Retrieved from Google website https://cloud.google.com/blog/products/gcp/google-supercharges-machinelearning-tasks-with-custom-chip. Kumar, D., Bezdek, J. C., Rajasegarar, S., Leckie, C., & Palaniswami, M. (2017). A visual-numeric approach to clustering and anomaly detection for trajectory data. Visual Computer, 33(3), 265e281. https://doi. org/10.1007/s00371-015-1192-x. Merrikh-Bayat, F., Shouraki, S. B., & Rohani, A. (2011). Memristor crossbar-based hardware implementation of the IDS method. IEEE Transactions on Fuzzy Systems, 19(6), 1083e1096. https://doi. org/10.1109/TFUZZ.2011.2160024. Nath, T. (2016). Investing in video games: This industry pulls in more revenue than movies, music. Retrieved from Nasdaq website http://www.nasdaq.com/article/investing-in-video-games-thisindustry-pulls-in-more-revenue-than-movies-music-cm634585.

6 Computational Learning Approaches to Data Analytics in Biomedical Applications

Roski, J., Bo-Linn, G. W., & Andrews, T. A. (2014). Creating value in health care through big data: Opportunities and policy implications. Health Affairs, 33(7), 1115e1122. https://doi.org/10.1377/ hlthaff.2014.0147. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484e489. https://doi.org/ 10.1038/nature16961. Theis, T. N., & Wong, H.-S. P. (2017). The end of Moore’s Law: A new beginning for information technology. Computing in Science and Engineering, 19(2), 41e50. https://doi.org/10.1109/MCSE.2017.29. Versace, M., Kozma, R. T., & Wunsch, D. C. (2012). Adaptive resonance theory design in mixed memristive-fuzzy hardware. In Advances in neuromorphic memristor science and applications (pp. 133e153). https://doi.org/10.1007/978-94-007-4491-2_9. Xu, R., & Wunsch, D. (2005). Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16, 645e678. https://doi.org/10.1109/TNN.2005.845141. Xu, R., & Wunsch, D. C. (2009). Clustering. In Computational intelligence. https://doi.org/10.1002/ 9780470382776. Xu, R., & Wunsch, D. C. (2010). Clustering algorithms in biomedical research: A review. IEEE Reviews in Biomedical Engineering, 3, 120e154. https://doi.org/10.1109/RBME.2010.2083647.

2 Data preprocessing 2.1 Introduction Necessity is the mother of invention [Plato].

Computational intelligence needs clean, numeric, homogeneous, well organized, and normalized data. This motivates invention of the tools and techniques in the following sections. Various types of data require different preprocessing approaches. The methods presented provide information on the data curation process to prepare data for machine learning. Data quality is a major concern in big data processing and knowledge management systems. The data preprocessing phase (Pyle, 1999) includes, but is not limited to, converting text, symbols, and characters to numeric values, data imputation, and data cleansing. In this chapter, a general automated approach for data preparation is provided.

2.2 Data preparation The design and criteria for data cleansing depends on the data. It is very important to select the right curation process. Fig. 2.1 illustrates a general approach to data curation. The presented algorithm consists of the following phases:

2.2.1

Initial cleansing

In this phase we detect and remove any sample (i.e. subject) or variable that has a high ratio of corrupted values. Typically, we set a threshold for missing values and any subject or variable contain erroneous values more than the prespecified threshold we remove them. This method has been applied effectively in (Voth-Gaeddert, Al-Jabery, Olbricht, Wunsch II, Oerther, 2018).

2.2.2

Data imputation and missing values algorithms

Data imputation algorithms has been a popular research topic (Allison, 2012, pp. 1e21; Cahsai, Anagnostopoulos, & Triantafillou, 2015; Efron, 1994; Kantardzic, 2011; King, Honaker, Joseph, & Scheve, 2001; Li, Deogun, Spaulding, & Shuart, 2004, pp. 573e579; Rubin, 1996; Schafer & Schenker, 2000; Stephens & Scheet, 2005; Wasito & Mirkin, 2005; Computational Learning Approaches to Data Analytics in Biomedical Applications. https://doi.org/10.1016/B978-0-12-814482-4.00002-4 Copyright © 2020 Elsevier Inc. All rights reserved.

7

8 Computational Learning Approaches to Data Analytics in Biomedical Applications

FIG. 2.1 General data preparation paradigm.

Zhou, Zhou, Lui, & Ding, 2014). One relevant problem in data quality is the presence of missing values (MVs). The MV problem should be carefully addressed; otherwise bias might be introduced into the induced knowledge. Common solutions to the MV problem are to either fill-in the MVs (imputation) or ignore/exclude them. Imputation entails a MV substitution algorithm that replaces MVs in a dataset with some plausible values. Imputed data can be used alongside the observed data, but they are only as good as the

Chapter 2  Data preprocessing

9

assumptions used to create them. Since the imputed data are mostly distilled using the fully observed data, therefore; any discrepancy in the observed data or applying inefficient assumptions (i.e. algorithm) will lead to erroneous data. On the other hand, many computational intelligence and machine learning (ML) techniques (such as neural networks and support vector machines) fail if inputs contain MVs. In almost all real datasets there are missing values. These are due to many reasons: rejecting to answer questions in the case of humanitarian statistics or surveys, sensors malfunctioning, power surges, or bugs in the digitizing process. Combinations of these issues can cause data to be mixed-type, multivalued, or missing (Lam, Wei, & Wunsch, 2015). Handling missing values is vital in data curation and consequently to the experimental results, since the way missing values are treated will determine the final shape of the dataset. Regardless of the missing values, the final goal is to reach useful and efficient inference about the population of interest. The general methods for handling MVs are discussed in this chapter, along with some advanced methods that are mentioned briefly. In general, there are three types of MVAs: removal methods where observations (or in some cases variables) with MVs removed; utilization methods which utilize all the available data to cure the dataset from MVs; and imputation methods which attempt to replace or fill in missing data.

2.2.2.1 Removal Methods These are the simplest methods, since there is no need for complicated or advanced processing. The algorithms simply disregard any observation that has MVs in them. These approaches may lead to biased results, but there are some attempts to avoid that problem. Removing the samples or features that contains many missing values is also an option. However, this could significantly degrade and waste the datasets if not managed properly. MVAs that remove samples that contains MVs are known as complete-case or list-wise deletion methods. These algorithms are easiest to implement and full data analysis can be applied without modification. Most data processing software packages use these approaches by default. Unfortunately, these methods suffer from bias and loss of precision. This results from the fact that the sample of fully observed cases may not represent the full dataset. A technique to address this problem is the Weighted Complete-Case approach (Zhou et al., 2014). In this method, the sample is reweighted to make it more representative. For example, if the response rate is twice as likely among men as women, data from each man in the sample could receive a weight of 2 in order to make the data more representative. This strategy is mostly used in surveys. This method first creates a model to predict nonresponse in one variable as a function of the rest of the variables in the data. The inverse of the predicted probabilities of response from this model could then be used as survey weights to make the complete-case sample more representative. This method is more efficient in the case of one missing value. It becomes more complicated

10 Computational Learning Approaches to Data Analytics in Biomedical Applications

for more than one MV and there will be problems with standard errors if predicted response probabilities are close to 0 (Cahsai et al., 2015; Zhou et al., 2014). Also, one can simply remove the samples or features (i.e. variables) that have many missing values in a large dataset. “Many” here means more than a specific threshold. After specifying a threshold of missing values for samples and features, then scan the dataset horizontally and vertically, respectively. So that any sample that has more missing values than the threshold will be removed. The same approach can be followed with features too. Keeping in mind that it is imperative not to sacrifice important features. Thus domain experts must be consulted before removing any feature. However, the code must keep track of all samples and features that have been removed or modified during the cleaning process. It is better to use replacement methods with small datasets and avoid removing any samples from them. This approach drops variables from the analysis that have a large proportion of missing values. This is not recommended since it may exclude important variables in the regression model, required for causal interpretation, and may lead to bias and unnecessarily large standard errors. For important variables with more than trivial amounts of missing data, the explicit model-based methods in the next sections are suggested.

2.2.2.2 Utilization methods The following methods/algorithms utilize information from both completely and partially observed cases. For example, in repeated measures data, cases with incomplete data at one time point can still provide valid information about the relationship between variables at other time points. Mixed effects models are therefore a popular choice for dealing with longitudinal data with missing outcomes. Because the available case methods make use of more information than the complete-case methods, they are generally better at correcting for bias when data are missing at random than the complete-case methods. The terms “missing at random” and “missing completely at random” are used to describe assumptions about missing data that are needed for standard implementations of multiple imputation but the meanings of these terms are often confused. When observations of a variable are missing completely at random, the missing observations are a random subset of all observations; the missing and observed values will have similar distributions. Missing at random means there might be systematic differences between the missing and observed values, but these can be entirely explained by other observed variables. For example, if blood pressure data are missing at random, conditional on age and sex, then the distributions of missing and observed blood pressures will be similar among people of the same age and sex (e.g. within age/ sex strata) (Bhaskaran & Smeeth, 2014). This approach suffers from the fact that different analyses would lead to different subsets of the data, based on the variables used in the analyses process and the pattern of their MVs which will affect the final inference. These methods also suffer from loss of efficiency resulting from disregarding partially observed variables and may be biased especially if the cases with MVs differ systematically from

Chapter 2  Data preprocessing

11

the fully observed cases. The utilization methods include but are not limited to the following algorithms:

2.2.2.3 Maximum likelihood The first step in maximum likelihood (Allison, 2012, pp. 1e21) estimation is to construct the likelihood function. Suppose that we have n independent case “observations” ði ¼ 1; .:; nÞ on k variables ðyi1 ; yi2 ; .; yik ÞThe fundamental idea behind the maximum likelihood methods is conveyed by their name: find the values of the parameters that are most probable, or most likely, for the data that have actually been observed. L¼

n Y

fi ðyi1 ; yi2 ; . ; yik ; qÞ

i¼1

Where fi ð:Þ is the joint probability (or probability density) function for observation i, and q is a set of parameters to be estimated. To get the maximum likelihood estimates, we find the values of q that make L as large as possible. Many methods can accomplish this, any one of which should produce the right result. Now suppose that for a particular observation i, the first two variables, y1 and y2 , have missing data that satisfy the Missing at Random assumption. (More precisely, the missing data mechanism is assumed to be ignorable). The joint probability for that observation is just the probability of observing the remaining variables, yi3 through yik . If y1 and y2 are discrete, this is the joint probability above summed overall possible values of the two variables with missing data: fi ðyi1 ; yi2 ; . ; yik ; qÞ ¼

XX fi ðyi1 ; yi2 ; . ; yik ; qÞ: y1

y2

For continuous missing variables, integrals will be used instead of summations. Essentially, then, for each observation’s contribution to the likelihood function, we sum or integrate over the variables that have missing data, obtaining the marginal probability of observing those variables that have actually been observed. As usual, the overall likelihood is just the product of the likelihoods for all the observations. For example, if there are m observations with complete data and n  m observations with data missing on y1 and y2 , the likelihood function for the full dataset becomes the following: L¼

m Y i¼1

fi ðyi1 ; yi2 ; . ; yik ; qÞ

n Y

fi ðyi3 ; . ; yik ; qÞ

i¼mþ1

where the observations are ordered such that the first m have no missing data and the last n-m have missing data. This likelihood can then be maximized to get maximum likelihood estimates of q. In the remainder of the paper, we will explore several different ways to do this. The general algorithm used for finding maximum likelihood estimate of the parameters underlying distribution from a given dataset is the Expectation Maximization ðEMÞalgorithm (Dempster, Laird, & Rubin, 1977), summarized in (Zhou et al., 2014). The EM algorithm has two main applications: first, to find the maximum likelihood estimates when data truly have missing values (e.g., patients lost to follow-up); second, to find the maximum likelihood estimates when the likelihood is analytically intractable

12 Computational Learning Approaches to Data Analytics in Biomedical Applications

and difficult to be optimized, but the likelihood can be simplified by assuming that there exist additional, but missing parameters. The algorithm can be applied whether the missing data are ignorable or not by including a missing data model in the likelihood. The EM algorithm approach is a procedure that iterates between two steps: an expectation step (E step) and a maximization step (M step). The E step calculates the expected value of the full data log-likelihood, given the observed data and current estimated parameters. The M step performs maximization on the expectation of the full data loglikelihood, computed in the E step above. The steps are usually easy to program, and implementation of the EM algorithm is straightforward using standard statistical software packages (Allison, 2012, pp. 1e21).

2.2.3

Imputation methods

Unlike the previous methods which remove incomplete cases or those that have MVs, the imputation algorithms “fill in” MVs. The key realization behind imputation is that, although the MVs are not observed, information about these values may still be extracted from the other observed variables. Replacement or imputation (Rubin, 1996) is the most common method and is recommended when working with small datasets. In this method, missing values are replaced with the mean of the feature for continuous features as discussed in Section 2.2.3.1.1. For categorical/discrete features, missing values are replaced with one of the values repeated in the feature. This value is usually chosen randomly, but that it could also be chosen based on the probability distribution of the categories per feature. For example, suppose a dataset has 100 samples (N¼100), and a feature has the following categories {“a,”“b,”“c”“d,” and “e”} with modes {20, 30, 5, 3, and 42}. In this case, does it sound reasonable to give all the categories the same probability to replace a missing value in this feature? The reasonable answer is “No”. In this case, a module should be designed to replace the missing values based on a probability function that has been derived using the mode of each category. This approach will maintain the original form of the dataset, minimizing the possible damage that may occur during data compensation. Algorithm 2.1 describes a generalized approach for cleaning any type of dataset. The Algorithm also detects and removes any mono feature (constant features that take on only one value) since such features are useless in machine learning and keeping them will only waste computation resources and cause more processing delay. Algorithm 2.1: General data cleansing approach. Input: Dataset (N,M); N: number of samples, M: number of features. Output: Dataset(N-x r ,M-y r ); xr ; yr are number of removed samples features, respectively.(xr ¼ yr ¼ 0 initially). 01: Set H th ¼ Number of allowed missing values per sample; 02: Set V th ¼ Number of allowed missing values per feature; 03: Set cat_th: is a threshold help classifies discrete features into categorical or not, typically 10% of N

Chapter 2  Data preprocessing

13

Phase1: Horizontal checking (optional). 04: For each sample Xi; i  N 05: If missing values in Xi > H th then: remove Xi: increment xr ; 06: Evaluate Phase1: if x > 0:1a N: then increase H th and repeat Horizontal checkingb Phase2: Vertical checking. 07:For each feature Fj; j  M 08: If missing values in Fj > V th then: if (Fj vital) b then: keep; else: Remove Fj repeat 08 09: If Fj Mono feature: remove Fj and increment yr .//removing single value features 10: find all missing Mv ¼ fmiv g values in Fj: 11: if Fj categorical/discrete  12: find all categories CT ¼ {ck;k¼1::C : 13: for each ck ˛CT : find Pðck Þ ¼ ð# times ck appearedÞ=ðN xr Þ; c 14: for each miv ˛Mv : replace miv with ck ; ck ˛CT based on Pðck Þ 15: else: {Fjcontinuous} 16: for each miv ˛Mv : replace miv with mean (FjÞ //after each replacement mean is calculated again In general, imputation algorithms can be classified in two types single and multiple imputation.

2.2.3.1 Single imputation methods In the following methods, a single value is being imputed for each MV. In general, anytime a single imputation strategy is used, standard errors of parameter estimates will be underestimated since the resulting data are treated as a complete sample, ignoring the consequences of imputation. 2.2.3.1.1 Mean imputation A popular method replaces each MV with the mean of the correctly recorded values in the same variable. There are two types of this method: the Unconditional mean and the Conditional mean Imputation methods. The first method replaces MV with the unconditional mean for a specific variable. The second method used for datasets with few missing variables (this method is also known as regression imputation) is an improvement over the unconditional mean imputation. It replaces each MV with the conditional mean of the variable based on the fully observed variables in the dataset.

a Categories can be detected in many ways in MATLAB. Readers who are interested in coding and implementation should see Chapter 9 where we explicitly discuss coding tools. b Domain experts should also be consulted at this point too. c Here we can also set equal distribution and.Pðck Þ ¼ C1 ; 1  k  C.

14 Computational Learning Approaches to Data Analytics in Biomedical Applications

For example, if the variable Y is the only variable that has a MV, then a regression model for predicting Yb from the other variables can be used to impute Y. The first step fits the model to the cases where it is observed. The second step plugs in the X values for the nonrespondents into the regression equation, obtaining predicted values Yb for the missing values of Y . (Y is the original variable with MVs, Yb is the predicted variable values (Allison, 2012, pp. 1e21; Raghunathan, Lepkowski, Van Hoewyk, & Solenberger, 2001). Both methods are single imputation approaches, therefore; they have some disadvantages in that they underestimate standard errors and distort the strength of the relationship between variables in the dataset. Conditional mean imputation is acceptable for some problems if the standard errors are corrected (Schafer & Schenker, 2000), but it is problematic if the details of the distribution are of interest (Little & Rubin, 1989). For example, an approach that imputes missing incomes using conditional means tends to underestimate the percentage of cases in poverty. 2.2.3.1.2 Substitution of related observations In this method, the MV for a specific case is imputed from cases or observations that are related to it. However, this method requires wide knowledge of the relationship between cases in the dataset and applying it may introduce significant errors. 2.2.3.1.3 Random selection Also known as “Hot Deck imputation”. This method was discussed earlier briefly when the case of MVs in categorical variables was described. From experience, this method is practical for imputing MVs in categorical variables rather than for MVs in numerical ones. An advantage of this approach is that it does not require careful modeling to develop the selection criteria for imputing the value, although bias can still be introduced. Note that this approach can serve as a multiple imputation method by selecting multiple values at random from the pool (Cahsai et al., 2015). 2.2.3.1.4 Weighted K-nearest neighbors (KNN) imputation KNN has many attractive characteristics which made it widely used (Aittokallio, 2009). It does not require the creation of a predictive model for each dimension with MV and takes into account the correlation structure of the data. KNN is based on the assumption that points closein distance are potentially similar. For given input ðxi ; wi Þ in dataset X with xi ¼ zi ; zim ; KNN calculates a weighted Euclidean distance Dij between xi and xj ˛ X such that Pd Dij ¼

k¼1 wik wjk ðxik  Pd k¼1 wik wjk

xjk Þ2

!1=2

MVs in the k th variable of any observation is estimated by the weighted average of K P D1 m PK ij 1 xjk . non-MVs of the K most similar observations xj to xi , i.e. zbik ¼ j¼1

v¼1

Div

Chapter 2  Data preprocessing

15

2.2.3.2 Multiple imputation The goal of multiple imputation is to provide statistically valid inference in practical situations when users and data base constructors are distinct in their analyses, models, and capabilities and if the nature of the task prevents accepting any MVs. In multiple imputation as the name implies, multiple values are presented for replacing each MV. These suggested values were drawn from posterior predictive distribution of MVs. The process will lead to having multiple complete datasets. To estimate the parameters of interest for each of the complete datasets, standard full data methods are applied. The resulting estimates are then combined to produce estimates and confidence intervals resulting in multiple estimated parameters and their corresponding standard errors (Rubin, 1996; Zhou et al., 2014). The multiple imputation method was developed to reduce the standard and estimation errors that may result from the single imputation methods. It was originally developed in the settings of large public-use datasets from sample surveys and censuses. The advances of computational methods and software for creating multiple imputations attracted researchers to the method and participate in publicizing it (Zhou et al., 2014).

2.2.4

Feature enumeration

After curing all the samples in the dataset, the next step is to recode all nonnumeric entries. Usually dataset contains many character-based features; see for example the KDD 99 dataset in the (Tavallaee, Bagheri, Lu, & Ghorbani, 2009). In such cases, the proper method of processing such features needs to be determined. In general, a dataset X that has n samples and m features that are mixed categorical and numerical features can be represented as follows: X ¼ ½ fc1 ; fc2 ; . fcr ; fn1 ; fn2 ; .; f nmr ; Where r is the number of categorical features in X and ðm rÞ is the number of numerical features. In this section, the effect of different enumeration techniques on the performance of support vector machine (SVM) based anomaly detector will be demonstrated. There multiple ways for enumerating categorical features. Let us assume a categorical feature fcu ðu ¼ 1.rÞ, that has a domain of k values {c1 ; c2 ; .; ck g (See Algorithm 2.1 above). The following methods show all the possible enumeration techniques for such feature.  ASCII conversion: This method works perfectly on symbolic features as discussed in Table 2.1. Where all symbols in the feature are replaced with their ASCII code. This can be simply summarized as follows: For each cell i infcu   fcui ¼ ASCII fcui For symbolic feature or. s P fcui ð jÞ for string features ; where j is the index of characters in the string. fcui ¼ j¼1

Another approach can be achieved by mapping string values to symbolic and then replace them with their ASCII values.

16 Computational Learning Approaches to Data Analytics in Biomedical Applications

Results and comparison among different SVM models.

Data Normalized?

Yes

Symbolic mapping

Numerical Mapping Sigmoidal

SVM Kernel Fn. SVM Bx. Constraints

ASCII RBF

Inf

3500

1500

500

Accuracy %

69

69.69

69.69

69.69

Precision%

93.89

91.68

91.68

91.68

Recall%

48.77

51.4

51.41

51.4

F1-Score%

64.2

65.88

65.88

Iters to Converge

24367

44969

Training Time (s)

1261.3

No. SVs

29858

Inf

3500

Sigmoidal

1500

500

43

43

0.08

Na

0

65.88

44968

2548 30431

RBF

Inf

3500

1500

500

43

69

69.69

61.76

13

93.89

91.68

97.86

0

2x10

48.77

51.41

33.55

51.4

0

Na

4x10

64.2

65.88

49.97

65.9

44965

15612

12441

7427

24367

44969

55142

44965

2546

2855

838

760

605

2232

--

--

1560

30431

30431

ig

ig

4087

29858

24468

24468

30431

In f

3500

1500

69.7

43

42.98

43

91.7

0.087

0.04

0.09

0

8x10

2x10

0

2x10

3x10

13541

12603

13541

484

493

471

3950

4199

4262

Algorithm didn’t converge!

Table 2.1

500

Performance evaluation on 20% of the training data Accuracy %

76.36

75.78

75.78

75.78

ig

ig

99.99

22.59

80.75

80.7

23

53.39

53.39

53.39

Precision %

74.68

73.11

73.11

73.11

ig

ig

99.97

29.1

78.49

78.49

29.6

Na

Na

Na

Recall %

74.58

75.98

75.98

75.98

ig

ig

100

46

80.85

80.85

47.2

0

0

0

F1-Score %

74.63

74.52

74.52

74.52

ig

ig

99.99

35.66

79.65

79.65

36.4

Na

Na

Na

(continued on next page)

 Numerical encoding: This method applies to all categorical and multi-valued discrete features. In this method, symbolic and string values are mapped into numeric indices. The values of the indices can simply be specified in sequence as 0,1, .etc. Based on the appearance of the symbol in the feature (i.e. first symbol found will mapped to 0, second to 1, and so on). For each cell i infcu If fcui ¼ ¼ cind then fcui ¼ ind ; 1  ind  k The above method is simple and easy to implement. But it does not provide efficient encoding for all categorical features. Features with categories that have different distances from each other cannot be encoded using this method. Therefore; the method was modified for such features. This was achieved by careful selection of numerical values based on the meaning of distance between categories. For example, race and ethnicity, which are very common in biomedical datasets, are symbolic categorical features.

Chapter 2  Data preprocessing

Table 2.1

17

dcont’d

Data Normalize d?

No

Symbolic mapping

Numerical Mapping

SVM Kernel Fn.

ASCII

Sigmoidal

RBF

Sigmoidal

RBF

SVM Bx. Constraint s

Inf

3500

1500

500

Inf

3500

1500

500

Inf

3500

1500

500

Inf

3500

1500

500

Accuracy %

71

70.5

71.6

70.7

43

70

73.5

73.1

73

73.3

72.9

68

43

71.9 7

72.2

72.1 8

89

96.1

88.55

92.9

Na

97.5 9

79.5

97.5

85.2

91.2 5

91.3

96.2 7

Na

97.2 4

97.5

97.5

55.98

50.2

57.55

52.57

0

48.5

54.8

54

63.9

58.7

57.9

45.6

0

52.2 4

52.4

52.4 6

68.77

65.96

69.77

67.15

Na

64.8 1

70

69.6 1

73

71.4 5

70.9

61.8 8

Na

67.9 7

68.2

68.2 2

Iters to Converge

2500 6

1999 5

4618 1

2811 4

242

5229

1472 1

7637

2314 3

2622 7

2453 9

1818 6

702

2131 8

1588 3

1060 2

Training Time (s)

2408

1189

2821

1595

ig

279. 5

847

585

2091

1640

1485

706

ig

884. 6

797. 6

717. 7

No. of SVs

3005 7

1190 9

3061 5

1826 6

167

1870

3710

3627

2483 3

1671 4

1566 9

1152 2

306

3271

3503

3676

Precision%

Recall%

F1-Score%

Performance evaluation on first 20% of the training data Accuracy % Precision%

Recall%

F1-Score%

99.2

99.9 6

99.9 6

80.4

80

80

64.5

53.4

99.9 4

99.9 4

99.9 3

88.8 8

99.9 5

99.9 3

99.9 4

79

82.7

83.3 9

76.9

100

99.9 1

99.9

99.9

56.6

0

98.2 3

99.9 7

99.9 7

78.9

72.1 3

71.2 9

34

0

99.9 7

99.9 7

99.9 4

65.5

0

99

99.9 5

99.9 6

78.9 8

77.0 9

76.8 7

47.2

0.00 1

99.9 4

99.9 3

99.9 2

76.19

70.6

75.47

72

53.4

74.51

81.67

72.92

77.5

74.34

47.6

75.35

74.4

60.15

74.12

Table 2.1 showed the effect of using different data processing techniques on the classification performance of the SVM and RBF. Where the best classifier was found using numerical mapping for the symbolic features and RBF as a kernel function for the SVM classifier.

In this case, a vector of weights W ¼ fw1 ; w2 ; .wk g will be constructed such that the values of the weights are selected based on the distances between the categories in the feature. These distances are specified either statistically (i.e. based on the statistic characteristics of the variables) or scientifically using a paradigm derived from the meanings of each variable and how far are they from each other. For each cell i in fcu If fcui ¼¼ cind then fcui ¼ wind ; 1  ind  k

18 Computational Learning Approaches to Data Analytics in Biomedical Applications

This can be considered as a generalized case of binary encoding presented in (Lam et al., 2015), which is described further below.  Binary feature transformation is presented to handle multivalue categorical features by setting the corresponding entries in the binary vector. Consider a categorical feature fcu that has a domain of lvalues fd1 ; d2 ; . dl g. In the binary vector ½b1 ; b2 ; .bl , Each bv corresponds to each domain hvalue dv . This i approach is implemented by providing a frequency vector f i ¼ f1i ; f2i ; . fli in each cluster Ci for each categorical feature fcu , where P fvi

¼ Pl

Xk ˛Ci

v¼1

P

bkv

Xk ˛Ci

bkv

v ¼ 1.l

Where: bkv is binary representation of dv for sample Xk that belongs to cluster Ci . Clearly, l P 0  fvi  1 and fvi ¼ 1: v¼1

Furthermore, missing values can be resolved by setting all of the binary entries to 1. One form of uncertainty in this feature occurs when data are specified by a range of values, instead of one scalar. To correct this problem, if a numerical feature fnu has interval data [a, b], it is represented by two numeric features, fn1u ¼ aand fn2u ¼ b. Missing values of numeric features are replaced by the average of the observed value or by the k-nearest neighbors (Lam et al., 2015; Obafemi-Ajayi, Lam, Takahashi, Kanne, & Wunsch, 2015). This method can be considered as special case of the well-known one-ofn remapping method. (Pyle, 1999)

2.2.4.1 Special cases of categorical data representation using COBRIT traumatic brain injury data as an example In medical data, certain lab measurements or drug intake are represented (or encoded) using histogram bins. This could imply that there is not a significant difference in a small change between values within a given range. For example, consider the data available from the citicoline brain injury treatment trial (COBRIT) on the effects of citicoline on traumatic brain injury (TBI) (Zafonte et al., 2012). This data set is publicly available by permission from Federal Interagency Traumatic Brain Injury Research (FITBIR) Informatics System website (NIH, 2019). The features representing drug intake are quantified in histogram bins (such as: no dose, < 25 cc,  22 cc). This type of data is ordinal in nature, not categorical, as distances between the bins do matter. It would be more appropriate to represent the features using integer values. Another scenario could be categorical data that has multiple combinations of possible values. For example, in the COBRIT data, location information of injury on the brain from the CT scan. Some of the phenotypes denoting this had over 70 different possible states or categories. Encoding that many variations are potentially impactful on

Chapter 2  Data preprocessing

19

a machine learning model’s performance. The proposed approach encodes the information as a bit-type representation. To illustrate this concept, we utilize the computed tomography (CT) intraparenchymal lesion anatomic site feature from the COBRIT data. An injury could be located in one or more of the following locations: frontal (left/right), temporal (left/right), occipital (left/right), parietal (left/right), brainstem, and cerebellar. To determine the total number of bits most efficient for the encoding of the feature, we consider each individual possible site. For brainstem and cerebellar, given that the injury was either at that position or not respectively, a single bit representation is applied. If the lesion is present, the bit is on (1) otherwise off (0). For the locations (frontal, temporal, occipital, and parietal) that also had a right or left component, a 2-bit representation is needed. The lesions could exist either on the left or right side, or both simultaneously. For example, if a patient had the location on the right frontal, this would be encoded as 01. If the patient had the location on the left frontal, it is encoded as 10. A bit representation of 11 implies both left and right. Thus, given 6 possible locations, with four having both left and right components, a total of 10 bits is needed to efficiently and robustly represent this categorical feature with over 80 distinct values. This approach provides a meaningful but concise representation for complex categorical data. For example, an entry of [’Left frontal;Right frontal;Left temporal;Right temporal;Right parietal;Brainstem/diencephalon/CC’] will be encoded as 0,011,110,101. The first 2 bits imply nothing in the occipital region, next 2 bits denote the right and left frontal regions, next 2 bits denote the right and left temporal regions, while next 2 bits implies presence in only right parietal region. The last 2 bits indicate whether present or not in the cerebellar and brainstem regions respectively.

2.2.5

Detecting and removing redundant features

This process includes, but is not limited to removing highly correlated features. Redundant features are those who show a specific degree of correlation with each other. Correlation in general, is an attempt to characterize the strength and the direction of an association between two variables. In terms of the strength of relationship, the value of the correlation coefficient varies between 0 and 1 in absolute value. When the absolute value of the correlation coefficient lies around 1, then it is said to be a perfect degree of association between the two variables and hence one of these features is not required and should be removed. The direction of the relationship is simply the þ (indicating a positive relationship between the variables) or - (indicating a negative relationship between the variables) sign of the correlation or it can be shown as null. In summary, a positive association between variables means they are proportional with each other (an increase in one lead to an increase in the other). However, a correlation near zero does not necessarily mean a weak associationdin the case of Pearson correlation it would just mean a weaker linear association (but a nonlinear relationship may exist). As an example of this case, the severity measure variable has positive correlation with Autism Diagnostic Observation Schedule (ADOS) variable in an Autism dataset (Al-Jabery et al., 2016). While

20 Computational Learning Approaches to Data Analytics in Biomedical Applications

a negative association means each variable is proportional to the reciprocal of the other, an increase in one leads to a decrease in the other and vice versa (e.g. the ADOS communication with the IQ in autism) (Al-Jabery et al., 2016). However, it is very important to note that correlation does not equal causality. In other words, a null correlation doesn’t mean the variables are independent and a strong association does not prove that a change in one variable causes a change in the other variable. In fact, statistics such as the correlation or other advanced statistics mainly provide us with clues regarding what the plausible causal relationships might be, as discussed in many statistics literature (e.g. (Chen & Popovich, 2002; Glass & Hopkins, 1996; Pedhazur & Schmelkin, 1991)). There are several types of correlation indices in statistics such as: Pearson correlation, Kendall rank correlation, Spearman correlation, and the PointBiserial correlation (Cohen, Cohen, West, & Aiken, 2013; Correlation (Pearson, Kendall, Spearman), 2017). The Pearson and Spearman correlation are discussed below, since they are commonly used metrics for determining correlation. The reader can see (Chen & Popovich, 2002) or (Glass & Hopkins, 1996) for explicit details and discussion on additional correlation indices.

2.2.5.1 Pearson correlation Is also known as the “productemoment correlation” because it was calculated by multiplying the z scores of two variables and then calculating the average of these products based on a group of n cases. Pearson }r} correlation is the most widely used correlation statistic to measure the degree of the linear relationship between variables. It has been estimated that Pearson’s r and its special cases are chosen 95% of the time in research to describe a relationship or to infer a population correlation (Glass & Hopkins, 1996). Pearson’s r correlation is calculated using the formula: P P X i Yi X i Yi  n ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi r ¼v u P 2! P 2! u P 2 P ð X Þ ð Y Þ i i 2 t Yi  Xi  n n P

where: n: is the number of values in each dataset, Xi & Yi : are the paired variables. Pearson’s r is viewed as an indicator that describes a linear interdependence between variables X and Y , with the form of Y ¼ a þ ðbÞ  X , where a and b are constants. As illustrated earlier, the closer the absolute value of “r” is to 1, the more linear interdependence between the variables. Values near 0 indicate a lack of linear relationship between the variables, but does not rule out the possibility of a nonlinear relationship.

2.2.5.2 Spearman correlation Spearman correlation is known as Spearman’s “rank correlation” coefficient since it is found by calculating the Pearson correlation on the ranked data within two features.

Chapter 2  Data preprocessing

21

Spearman’s correlation is often denoted by the Greek letter r and it provides a measure of the strength and direction of a monotonic associations between ranks of the two features. It is sometimes used as an alternative to the Pearson correlation coefficient since it is less sensitive to outliers due to the utilization of ranks rather than actual data values in the calculation and can capture the strength of monotonic relationships beyond linear associations (Conover, 1999). A threshold between 0 and 1 will be chosen such that if the absolute value of the correlation between a pair of features exceeds the threshold, one of the features in the pair will be removed. However, the threshold of correlation depends on many factors, including domain expert opinion, quality of the final results, etc. In Chapter 9, the code implementation for automatic removal of the correlated features is presented. Mono features or single value features should be removed from the dataset at this point to avoid unnecessary processing of them. These features have only one value or only few samples for which they vary (i.e. 0, move the data sample with the maximum distance difference into Cj . 5. If the number of clusters is less than the number of data samples: Repeat 2(b) and 4. Else: stop. If all the variables were used in a divisive algorithm, it is called polythetic otherwise the algorithm is monothetic (Xu & Wunsch, 2009), See for example MONA (Monothetic Analysis) algorithm with binary variables (Kaufman & Rousseeuw, 1990, pp. 199e252). However, both agglomerative and divisive clustering algorithms organize data objects into a hierarchical structure based on the proximity matrix. The results of hierarchical clustering are usually depicted by a binary tree or dendrogram, as shown in Fig. 3.2. The root node of the dendrogram represents the entire dataset, and each leaf node is regarded as a data object. The intermediate nodes thus describe the extent to which the

Chapter 3  Clustering algorithms

35

FIG. 3.2 Example of a dendrogram from hierarchical clustering. The clustering direction for the divisive hierarchical clustering is opposite that of the agglomerative hierarchical clustering. A specific desired number of clusters (two, shown here) are obtained by cutting the dendrogram at an appropriate level (Xu & Wunsch, 2009).

objects are proximal to each other, and the height of the dendrogram usually expresses the distance between each pair of data objects or clusters, or a data object and a cluster. The ultimate clustering results can be obtained by cutting the dendrogram at different levels (the dashed line in Fig. 3.2). This representation provides very informative descriptions and a visualization of the potential data clustering structures, especially when real hierarchical relations exist in the data, such as the data from evolutionary research on different species of organisms, or other applications in medicine, biology, and archeology.

3.3.2

Density-based clustering

Density-based clustering is a nonparametric approach identifies clusters that are formed by different areas of object densities, where clusters are high-objects density areas separated by low density areas. Density-based clustering methods do not require the number of clusters as input parameters, nor do they make assumptions concerning the underlying density p(x) or the variance within the clusters that may exist in the dataset. As groups high-density data samples in one cluster. Each cluster is a set of data objects spread in the data space over a contiguous region of high density of objects. Contiguous regions of low density of samples identify the borders of density-based clusters. These data samples in low density regions are considered noise or outliers (Ester, Kriegel, Kro¨ger, Sander, & Zimek, 2011). Ester and his colleagues (Xu, Ester, Kriegel, & Sander, 1998) have developed this method in 1998. Fig. 3.3 illustrates the effect of density level in identifying clusters.

36 Computational Learning Approaches to Data Analytics in Biomedical Applications

FIG. 3.3 Density-distributions of data points and density-based clusters for different density levels. Different colors indicate different clusters or noise. This figure was taken from Ester, M., Kriegel, H.P., Kröger, P., Sander, J., & Zimek, A. (2011). Density-based clustering. Data Mining and Knowledge Discovery, 1(3), 231e240. https://doi.org/ 10.1002/widm.30.

There are several density-based clustering algorithms have been proposed in the literature: Wushert (1969) method, in which he generalized the nearest-neighbor method and reduces the chaining effect. CLUPOT (Coomans & Massart, 1981), DBSCAN (Ester, Kriegel, Sander, & Xu, 1996), Denclust (Hinneburg & Keim, 1998) and the K-NN kernelbased density for high-dimensional data (Tran, Wehrens, & Buydens, 2006).

3.3.3

Subspace clustering

Subspace clustering identifies groups of objects that are homogeneous in a subset of dimensions but not necessarily in all of them. This method is effective particularly in

Chapter 3  Clustering algorithms

37

high-dimensional datasets. It is possible in high-dimensional datasets to find several variables that are not relevant to cluster the data. Consequently, distance calculations using full dimensional space are not efficient and may lead to meaningless clusters and distort the distance computations. Subspace clustering algorithms search locally for relevant dimensions to find clusters that exist in multiple subspaces.

3.3.3.1 Basic subspace clustering The approaches to solve the basic subspace clustering problem have three main characteristics. First, they handle quantitative 2D dataset. Second, their homogeneous function is distance based. Third, the significance of the size is determined by userspecified thresholds. Their main difference lies in their homogeneous and support functions (Sim, Gopalkrishnan, Zimek, & Cong, 2013). 3.3.3.1.1 Grid-based subspace clustering In grid-based subspace clustering (Road & Jose, 1998), the data space is partitioned into grids, and dense grids containing significant number of objects are used to form subspace clusters. The method partitions the domain of each attributes into identical intervals. There are several factors that govern the cluster formation process in gridbased subspace clustering. These factors are homogeneity, density, and number of intervals. For further discussion on this clustering method (See Agrawal, Gehrke, Gunopulos, & Raghavan, 1998; Park & Lee, 2007; Sim et al., 2013). However, it is important to highlight the following weaknesses in this algorithm caused by intervals. For example, if the intervals are not overlapping and are positioned in the wrong locations of the grids the subspace clusters will be overlooked. In addition, the predetermined sizes of the intervals may degrade the quality of the clusters. Several techniques were proposed to overcome these weaknesses. Some of these techniques are: (1) Adaptive grids were presented in which the interval sizes vary based on the data distribution (Nagesh, Goil, and Choudhary, 2001). (2) A nonlinear monotonically decreasing threshold was presented by Sequeira and Zaki (2004) to overcome the disadvantage from using fixed thresholds. (3) Trying a range of parameters and selecting the most stable results in a particular range of parameter settings until the desirable number of clusters is obtained is the method suggested for solving the parameter sensitivity (Sim et al., 2013). 3.3.3.1.2 Window-based subspace clustering Window-based subspace clustering (Liu, Li, Hu, & Chen, 2009) was primarily developed to overcome the weaknesses of grid-based subspace clustering. In window-based subspace clustering, a sliding window is slid over the domain of each attribute to obtain overlapping intervals, which are then used as building blocks for a subspace cluster. Thus, the chances of “true” subspace clusters being overlooked are greatly reduced. The homogeneity of the clusters depends on the largest distance between the objects in them, the cluster size, and the tuning parameters.

38 Computational Learning Approaches to Data Analytics in Biomedical Applications

The major disadvantage of this method is its inability to mine arbitrarilyshaped clusters, which has been overcome in the density-based subspace clustering, as discussed in the following section. 3.3.3.1.3 Density-based subspace clustering This subspace clustering method was proposed by Kailing and Kriegel (2004). This method drops the usage of grids to overcome the problems of grid-based subspace clustering. Moreover, it is able to mine arbitrarilyshaped subspace clusters in the hyperplane of the dataset. Window-based subspace clustering is unable to mine arbitrarilyshaped clusters, which leads to the problem of mining elongated clusters. The clusters resulting from this method can be described as a chain of objects. In this method, two objects belong to the same cluster if there is a chain of objects between them, such that the closest pair of objects in the cluster satisfies the distance constraint, which simplifies the identification of arbitrarilyshaped clusters.

3.3.3.2 Advanced subspace clustering This section discusses the advanced methods in handling complex data, noisy data, and categorical data, in which we will explore in detail our new algorithm (k-dimensional subspace clustering). This section mostly discusses materials from the survey on subspace clustering presented by Sim et al. (2013). The section also discusses the presented algorithm based on the type of data. 3.3.3.2.1 3D subspace clustering There are three types of 3D subspace clustering, depending on the type of the targeted dataset. In 3D binary datasets, the binary 3D subspace cluster is a sub-cuboid (Sim et al., 2013). This definition is also known as frequent closed cube (Ji, Tan, & Tung, 2006, pp. 811e822), closed 3-set (Cerf, Besson, Robardet, & Boulicaut, 2008, 2009), frequent tri-sets (Ja¨schke, Hotho, Schmitz, Ganter, & Stumme, 2006), and cross-graph quasi-biclique subgraphs (Sim, Liu, Gopalkrishnan, & Li, 2011). The cluster is sensitive to the three tuning parameters (the minimum set of objects, the minimum number of attributes, and the minimum of the timestamps). This clustering shows the concept of subspace in the time dimension. This section and the following discussion is distilled and quoted from Sim’s survey on the advanced subspace clustering (Sim et al., 2013). Cerf et al. (2008) proposed closed n-sets, which are n-dimensional binary subspace clusters, but we focus our attention to closed 3-sets, since 3D datasets are more common in the real world. In Dense 3D subspace cluster (Georgii, Tsuda, & Scho¨lkopf, 2011), the homogeneity is based on cluster density and the density control parameter q, setting q ¼ 1 will lead to the binary 3D subspace cluster. This subspace clustering does not depend on the size of the cluster but the remaining parameters (i.e. the set of objects, attributes or timestamps) should not be singleton. This method is only sensitive to the density parameter q. This approach can mine dense n-dimensional subspace clusters similar to closed n-sets (Cerf et al., 2008).

Chapter 3  Clustering algorithms

39

(The following section is quoted with modification from Sim et al., 2013). Subspace clustering in a quantitative 3D dataset is more complex than in a Binary 3D dataset, as the homogeneity of the clusters is not just a matter of ‘0s’ and ‘1s’. A simple solution is proposed by Sim et al. (2011). Sim et al. (2011), in which the values are discretized and converted into a binary dataset, and then 3D binary subspace clusters are mined from it. However, this lousy conversion of data has several weaknesses. Selecting the appropriate discretization method is non-trivial, and information may be lost during the discretization. Moreover, the binary dataset may increase exponentially if the discretization is too fine, as each attribute of the binary dataset corresponds to an interval of the discretized attribute values of the original dataset. Jiang, Pei, Ramanathan, Tang, and Zhang (2004) mine 3D subspace clusters, known as coherent gene clusters, directly from quantitative 3D dataset, but they “flatten” the 3D dataset into 2D, which results in having the strict requirement that the clusters must be persistent in every timestamp of the dataset. Zhao and Zaki (2005) proposed tricluster, which is a variant of window-based subspace cluster. Unlike the coherent gene cluster, it does not “flatten” the 3D dataset into 2D, and the concept of subspace exists in the time dimension. Tricluster is a highly flexible cluster model that can be morphed into a wide variety of 3D subspace clusters, such as clusters that have similar values, clusters that exhibit shifting or scaling patterns, etc. To this end, the homogeneous function of the window-based subspace cluster is extended to the object and time dimension, and the pScore is used to detect the shifting or scaling patterns. In a categorical subspace cluster, the set of objects have the same value for each attribute. Under the assumption that the attributes are independent and uniformly distributed, the first criterion of the support function requires the number of objects in the cluster to be more than the expected number, while the second criterion requires that the occurrences of each pair of values in the cluster are more than the expected number. The assumption of the attribute being independent and uniformly distributed can be too rigid, as it is possible that the attributes are dependent and the dataset may not be uniformly distributed. Only one tuning parameter a is needed, which controls the density of the cluster. Consequently, the cluster is sensitive to a: The k-dimensional clustering (Al-Jabery et al., 2016), is a promising technique in subspace clustering for mixed and complex data. Here we discuss the algorithm by using an example, in which, it was implemented on complicated biomedical dataset collected from Guatimala (Voth-Gaeddert, Al-Jabery, Olbricht, Wunsch, & Oerther, 2019).

40 Computational Learning Approaches to Data Analytics in Biomedical Applications

The k-dimensional subspace clustering method includes two phases of clustering: unidimensional and k-dimensional. The clustering method is ideal for assessing child stunting variables due to its efficiency in variable selection as well as its flexibility in handling mixed datasets (i.e. categorical and continuous variables). For unidimensional clustering, a clustering criterion is applied to each variable in the dataset depending on the type of variable. If the variable is continuous, any convenient clustering algorithm (e.g. k-means (Forgy, 1965)) will be applied. If the variable is categorical, it will be partitioned based on how many children belong to the same category. At the end of the unidimensional clustering, there are clustering criterion equal to the number of variables in the dataset. Each unidimensional clustered result is then evaluated using the DavieseBouldin (DB) (Davies & Bouldin, 1979) internal validation index. According to the DB index, the variables will be ranked with a lower value on the DB index, receiving a higher rank (i.e. more distinguishing). Note that in Fig. 3.4. {d1:1  d1:2 .d1:m  d2:1 and so on}, meaning the variables are sorted in descending order based on their cluster’s DB index. The DB index provides a quantitative measure for the similarity between clusters. The DB index calculates the similarity between every cluster and all other clusters. Then, it averages all the similarities. Therefore, lower DB values indicate better clustering performance, since a low DB means clusters are dissimilar to each other. At the end of the unidimensional clustering phase, there are clustering labels equal to the number of variables in the dataset. Each clustering label resulted from applying a clustering criterion on the dataset using the specified variable only. For example, if a Uni-dimensional Clustering

variable

v1

DB index Groups

Clusters of variable i

v2

d1,1 d1,2

...

vm–1

... d1,m–1 Group 1

Clusters of variable j

Clusters of variable p

vm

....

d1,m

.... ....

v1

v2

...

dN–1,1 dN–1,2

vm–1

Group N-1

x1

x0

x0

x0

x8

x1

x8

x5

x2

x8

x2

x5

x2

x1

x5

x6

x7

x7 x9

x7

x6

x6

x6

x9

x3

x3

x3

x8

x4

x4

x4

x0

v2

...

vm–1

vm

...

dm–1,N

dm,N

Group N

x5 x7

x9

x4

v1

d1,N d2,N

Multi-dimension clusters

x2

x3

vm

dN–1,m–1 dN–1,m

...

x9

K-dimensional clustering Note that x6 here is clustered with x7 & x9 since it shared all the uni-dimensional clusters with them except one. This happens if the allowed difference(diff) ≥ 1.

FIG. 3.4 Depiction of the unidimensional and k-dimensional clustering process (the figure shows an illustration of a three variable dataset to simplify the concept). i $ j $ p, are the different variables. x1 $ x2 etc. are the data points (i.e., children). “Groups ð1.NÞ” are a group of variable clustering labels that share similar DB index values. The mean of the dividing lines for a subset of groups of variables is computed, and all variables within the multiple groups are aggregated into a multidimensional cluster/group.

Chapter 3  Clustering algorithms

41

dataset consists of n data points and m variables, after applying the unidimensional clustering there will be m clustering labels (labels here are numerical indices that assign the data point (i.e., child) to a cluster) each of them consisting of n labels. These clustering labels will be referred to as variable clustering labels which will be the inputs for the k-dimensional clustering phase. Each variable clustering label consists of n 1 numerical indices representing the cluster that each child was assigned to as a result of clustering the data based on that specified variable. The k-dimensional subspace clustering process (the k here is different from that in k-means) combines the children assigned to the same clusters - in the unidimensional clustering phasedalong all or most of the selected k-dimensions. Algorithm I illustrates the implemented clustering approach below. Algorithm I: k-dimensional subspace clustering Unidimensional clustering: In this phase each variable is treated as a complete dataset 1. Read dataset(); the dataset has n children, and each child has m variables 2. Classify variables into categorical and numerical; 3. For each variable ðiÞ f1  i  mg in the dataset: If i is categorical then: apply simple partitioning; else: apply k-means; 4. Assign m labels to each data point (i.e., child); 5. Unidimensional clustering outputs an ðn mÞ matrix that represents the unidimensional labels that were given to each child along each single variable (i.e., dimension); 6. Evaluate each clustering criterion using the DavieseBouldin (DB) internal validation index. 7. Sort the variables based on their DB score. k-dimensional clustering: 8. Divide the clustering criterions into N groups; each group has g criterions; where group(1) contains the best g variables ranked using DB and group(N) contains the lowest ranked variables criterion as illustrated in Fig. 3.4. 9. Generate k-dimensional clusters: 9.1.

While Number_of_generated_criterion < max

Select g1 ¼ group(ind1); f1  ind1  N g Note that in the first iteration, variables from one group are selected but later variables from more than one group are combined.

9.2.

9.3. Set diff ¼ the allowed differences; f0  diff  thg typically th < one  tenth of the selected variables;

42 Computational Learning Approaches to Data Analytics in Biomedical Applications

9.4.

For i ¼ 0 to diff

9.4.1.

for all children in the dataset: Compare unidimensional labels Children with the same labels along the selected variables are assigned

to the same k-dimensional label; {note that children in the same k-dimensional cluster are allowed to have i number of different unidimensional label.} i.e., if childx1_labelsechildx2_labels i then assign both to the same cluster; 9.5.

Increment Number_of_generated_criterion; End while; 10. Evaluate all k-dimensional clustering criterions using validation indices and regression analysis; 11. Choose the best criterions; 12. End.

*The variables selection process is clarified in the context of this section. The process of variable selection is similar to creating a shadow copy of the groups of variable clustering labels that resulted from the unidimensional clustering phase. Then, the shadow copy is moved beneath the original along with all children (i.e., data points), and children that belong to the same unidimensional clusters along all or most (note that most here means all-allowed difference) are assigned to the same k-dimensional cluster. Now the question in the reader’s mind is probably: what is k, and where did it come from? k here is the total number of the selected dimensions (i.e., variables). These variables may be the variables of only one group or those of two or more groups (the algorithm even allows comparison along all the variables in the dataset). The reason for dividing the variable clustering labels into groups is to explore as many subsets of the variables as possible, which reveals the optimum subset of variables that has scientifically meaningful clusters. Fig. 3.4 depicts the clustering process, where children that belong to the same cluster along the selected dimensions (common thread among individual variables) are combined into the same k-dimensional cluster (denoted as a multidimensional cluster here). Due to the multiple clustering criteria implemented in this algorithm (unidimensional and k-dimensional), a cluster evaluation is vital for selecting the most descriptive clusters among the applied k-dimensional clustering criteria. Internal validation indices and statistical evaluation are the common clustering evaluation tools. Internal validation indices are metrics for measuring the similarity between the objects that belong to the same cluster and the difference between children of different

Chapter 3  Clustering algorithms

43

clusters (see Chapter 7). The internal validation indices used with this algorithm in Voth-Gaeddert et al. (2019) are: DaviseBouldin (DB) Index (Davies & Bouldin, 1979): This index measures the similarity between the clusters. Therefore, the clustering criteria with the lowest DB index is the best as the resulting clusters are more distinguished from each other.  skieHarabaz (CH) Index (Calin  ski & Harabasz, 1974): This method evaluates the Calin clusters based on the sum of the squares of minimum distance within and between clusters, and the higher the value the better the criteria. Silhouette Index (SI) (Rousseeuw, 1987): this index evaluates the clustering performance based on the pairwise distance between and within clusters. The higher the value of SI the better the clustering criteria. However, this index is very sensitive to outliers. The Clustering criteria that were in the lower decile for DaviseBouldin and the upper decile for CH and SI were chosen for further statistical evaluation. The statistical evaluation method, which utilizes nominal logistic regression, was performed on the reduced subset of clustering criteria with the cluster number as the response variable and the variables included in the specific clustering criteria as the explanatory variables. Cluster membership can be predicted based on the model to determine how effective the variables are in cluster classification. The model was built based on 2/3 of the data as a training set and 1/3 of the data as a test set. The classification error rate (CER) and R-square value in the test dataset were calculated and used as additional criteria for cluster evaluation. Low values of CER are better, while high values of R-square are best. The remaining clustering criteria were ranked based on their DB, SI, CH, test set R-square, and CER values. Values in the lower half for DB and CER were ranked “high,” and values in the upper half for SI, CH, and R-square were deemed “high.” The cluster criteria with the most “high” rankings across the five different criteria were ranked as the best. The outputs from the cluster evaluation methods (internal validation indices and the statistical evaluation) provided several groups of variables and their associated clusters of children. As the overall focus of this study was on child stunting and diarrheal prevalence (the primary and secondary outcomes, respectively), additional specific criteria were developed for the final selection of the group of variables for the output from the clustering method. The two additional criteria were as follows: (1) child stunting and diarrheal prevalence variables must have been included in the group of variables and (2) the difference in the mean child stunting variable (i.e. height-for-age zscore) between the clusters of children had to be the largest. There are several clustering algorithms applied in the field of biomedical data analysis. In the following sections, we discuss most of them. Note that the following sections are mostly quoted with modifications from Xu and Wunsch (2005, 2011). There are several clustering algorithms applied in the field of biomedical data analysis. The following sections discuss most of them and are mostly quoted with modifications from Xu and Wunsch (2005, 2011).

44 Computational Learning Approaches to Data Analytics in Biomedical Applications

3.3.4

Squared error-based clustering

The following discussions, unless otherwise indicated, will focus on partitional clustering, which assigns a set of data objects into K clusters without any hierarchical structure. Partitional clustering can be constructed as an optimization problem: given N d-dimensional data objects xj ˛ > ðDlj =Dij Þ ; 1= >
> > :

1=jIj j;

if Ij sB; i˛Ij

0;

if Ij sB; i;Ij

where Ij ¼ fiji ˛½1; c; x j ¼ mi g; 3. Update the prototype matrix M by ðtþ1Þ

mi

¼

N  X j¼1

ðtþ1Þ

uij

m

!

xj =

N  X j¼1

for i ¼ 1; . ; c; and j ¼ 1; . ; N ;

ðtþ1Þ

uij

m

(3.25)

! ;

for i ¼ 1; . ; c;

(3.26)

4. Increase t by 1, and repeat steps 2e3 until Mðtþ1Þ  MðtÞ < ε, where ε is a positive number.

48 Computational Learning Approaches to Data Analytics in Biomedical Applications

Possibilistic c-means clustering (PCM) is another fuzzy clustering approach that is specially designed for abating the effect of noise and outliers (Krishnapuram & Keller, 1993). PCM is based on the absolute typicality, which reinterprets the memberships with a possibilistic view, i.e., “the compatibilities of the points with the class prototypes,” (Krishnapuram & Keller, 1993) rather than as the degree of membership of a certain object belonging to a certain cluster. Accordingly, the probabilistic constraint of the membership coefficient in Eq. (3.26) is relaxed to possibilistic c-means clustering (PCM) is another fuzzy clustering approach that is specially designed for abating the effect of noise and outliers (Krishnapuram & Keller, 1993). PCM is based on the absolute typicality, which reinterprets the memberships with a possibilistic view, i.e., “the compatibilities of the points with the class prototypes,” (Krishnapuram & Keller, 1993) rather than as a degree of membership of a certain object belonging to a certain cluster. Accordingly, the probabilistic constraint of the membership coefficient in Eq. (3.26) is relaxed to: maxuij > 0; cj;

(3.27)

i

which simply ensures that each data object belongs to at least one cluster. Also, the possibility that a data object falls into a cluster is independent of its relations with other clusters. Such a change is reflected in the cost function, reformulated as, JðU; MÞ ¼

c X N X i¼1

j¼1

ðuij Þm kx j  mi k2 þ

c X i¼1

hi

N X

ð1  uij Þm ;

(3.28)

j¼1

where hi are some positive constants to avoid the trivial zero solution. The additional term tends to assign credits to memberships with large values, and its influence on the cost function is balanced by hi, whose value determines the distance at which the membership of a data object in a cluster becomes 0.5.

3.3.6

Evolutionary computational technology-based clustering

As the local search techniques, such as the hill-climbing approach that K-means is based on, are easily stuck to local minima, the recent advancements in evolutionary computational technologies (Fogel, 2005) provide an alternate way to explore the complicated solution space more effectively and find the global or approximately global optimum. For example, particle swarm optimization (PSO) is a nature-inspired algorithm based on the simulation of complicated social behavior, such as bird flocking or fish schooling (Kennedy, Eberhart, & Shi, 2001). Ant colony optimization (ACO) algorithms are also nature inspired, referring to a broad category of algorithms that mimic and model the natural behaviors of real ants (Dorigo, Birattari, & Stutzle, 2006). Genetic algorithms (GAs) (Holland, 1975), inspired by natural evolutionary processes, maintain a population of “individual” solutions that evolve based on a set of evolutionary operators, such as selection, recombination, and mutation, and are also widely used in clustering. As the local search techniques, such as the hill-climbing approach that K-means is based on,

Chapter 3  Clustering algorithms

49

are easily stuck to local minima, the recent advancements in evolutionary computational technologies (Fogel, 2005) provide an alternate way to explore the complicated solution space more effectively and find the global, or approximately global, optimum. For example, particle swarm optimization (PSO) is a nature-inspired algorithm based on the simulation of complicated social behavior, such as bird flocking or fish schooling (Kennedy et al., 2001). Ant colony optimization (ACO) algorithms are also natureinspired, referring to a broad category of algorithms that mimic and model the natural behaviors of real ants (Dorigo et al., 2006). genetic algorithms (GAs) (Holland, 1975), inspired by natural evolutionary processes, maintain a population of “individual” solutions that evolve based on a set of evolutionary operators, such as selection, recombination, and mutation, and are also widely used in clustering. Generally, in the context of evolutionary computational technology-based clustering, each individual in the population encodes the information for a valid data partition. These partitions are altered iteration by iteration, for example, with either the evolutionary operators in GAs, or by following the search procedure of PSO or ACO. The best ones with the highest fitness function scores, which are used to evaluate the quality of partitions, are considered the resulting partitions. An example of a PSO-based clustering algorithm is described in algorithm VI. As such, the fitness function selection and individual encoding become two important factors with regard to the application of evolutionary computational approaches in clustering. Some comparison studies on cluster validation indices can also be found in (Xu & Wunsch, 2005, 2009). See (Xu and Wunsch, 2009) for a discussion of appropriate validation indices. Some experimentation is often required. Algorithm IV: PSO-based clustering algorithm. 1. Initialize a population of particles with random positions and velocities. Set the values of user-dependent parameters, and set variable t ¼ 1; 2. For each individual zi of the population, a. Calculate its fitness function Fit(zi); b. Compare the current fitness value Fit(zi) with the value Fit(pi) from the particle’s own previous best position pi. If the current value is better, reset both Fit(pi) and pi to the current value and location; c. Compare the current fitness value Fit(zi) with the best value Fit(pg) in the swarm. If the current value is better, reset Fit(pg) and pg to the current value and location; 3. Update the velocity and position of the particles with the following equations, v i ðtÞ ¼ WI  v i ðt  1Þ þ c1  f1  ðpi  zi ðt  1ÞÞ þ c2  f2  ðpg  zi ðt  1ÞÞ;

(3.29)

zi ðtÞ ¼ zi ðt  1Þ þ v i ðtÞ;

(3.30)

where WI is the inertia weight, c1 and c2 are the acceleration constants, and 41 and 42 are the uniform random functions in the range of [0, 1].

50 Computational Learning Approaches to Data Analytics in Biomedical Applications

4. Increase t by 1 and repeat steps 2e3 until a stopping criterion is met, which usually occurs upon reaching the maximum number of iterations or discovering highquality solutions. Alternately, in the context of clustering N data objects into K clusters with cluster centroids (means, medoids (Xu & Wunsch, 2005), or other representative points) M ¼ {mi, i ¼ 1, ., K}, each individual could be encoded directly to consist of the K cluster centroids (Ball & Hall, 1967), denoted as zi ¼ (m1, ., mK), which is known as centroid-based representation (Maulik & Bandyopadhyay, 2000; van der Merwe & Engelbrecht, 2003; Sheng & Liu, 2006). For example, in the case that medoids are used to represent clusters, which are exact objects of the dataset, each individual in the population can be considered as a set of medoids of selected data objects, with the length corresponding to the number of clusters. So, an individual {8, 19, 28, 266, 525, 697} encodes a partition with six clusters that are represented by the 8th, 19th, 28th, 266th, 525th, and 697th data objects, selected as medoids. Some other encoding strategies are also available, such as partition-based representation, which considers each individual in the population as a string of N data objects, and the ith element of the string indicates the cluster number assigned to the ith data object (Eddaly, Jarboui, & Siarry, 2016). For example, the representation of the clustering of nine data objects into three clusters, {x1, x5, x8}, {x3, x7}, and {x2, x4, x6,x9}, is denoted as a string “132313213”. One major drawback associated with this representation is the redundancy and context insensitivity, which appears as the replication of the same clustering, but with different labels attached to the clusters. For example, both the string “121233” and the string “232311” represent exactly the same clustering solution {{x1, x3}, {x2, x4}, {x5, x6}}. Additional solution parameters for cluster labels and group-oriented evolutionary operators that work with groups of individuals are suggested to prevent such a cluster label-associated cycle (Hruschka & Ebecken, 2003; Reeves, 2001), or other representative points M ¼ {mi, i ¼ 1, ., K}, each individual could be encoded directly to consist of the K cluster centroids (Ball & Hall, 1967), denoted as zi ¼ (m1, ., mK), which is known as centroid-based representation (Maulik & Bandyopadhyay, 2000; Sheng & Liu, 2006; van der Merwe & Engelbrecht, 2003). For example, in the case where medoids are used to represent clusters, which are exact objects of the dataset, each individual in the population can be considered as a set of medoids of selected data objects, with the length corresponding to the number of clusters. So, an individual {8, 19, 28, 266, 525, 697} encodes a partition with six clusters that are represented by the 8th, 19th, 28th, 266th, 525th, and 697th data objects, selected as medoids. Some other encoding strategies are also available, such as partition-based representation, which considers each individual in the population as a string of N data objects, and the ith element of the string indicates the cluster number assigned to the ith data object (Eddaly et al., 2016). For instance, the representation of the clustering of nine data objects into three clusters, {x1, x5, x8}, {x3, x7}, and {x2, x4, x6,x9}, is denoted as a string “132313213.” One major drawback associated with this representation is the redundancy and context insensitivity, which appears as a replication of the same

Chapter 3  Clustering algorithms

51

clustering, but with different labels attached to the clusters. For example, both the string “121233” and the string “232311” represent exactly the same clustering solution {{x1, x3}, {x2, x4}, {x5, x6}}. Additional solution parameters for cluster labels and group-oriented evolutionary operators that work with groups of individuals are suggested to prevent such a cluster label-associated cycle (Hruschka & Ebecken, 2003; Reeves, 2001). The major disadvantage associated with centroid-based representation is that the number of clusters must be determined in advance. A possible way to bypass this problem is to consider each individual as a binary vector, with the length equal to the maximum number of clusters Kmax, i.e., zi ¼ (zi1, ., ziKmax), zij ˛ {0,1} for j ¼ 1, ., Kmax, as, generally, it is much easier to estimate the maximum number of clusters than the actual number of clusters (Omran, Salman, & Engelbrecht, 2006). Thus, if zij ¼ 1, the corresponding cluster centroid candidate is selected to be part of the clustering solution. Alternately, if zij ¼ 0, the corresponding cluster centroid is excluded from the current solution. Note in this encoding strategy that only one set of cluster centroids is taken into account during each iteration, which can be extended further to consider Kmax sets of cluster centroids at a time, rather than only one (Abraham, Das, & Konar, 2007). Given the maximum number of clusters Kmax and a set of d-dimensional cluster centroids {mij, j ¼ 1, ., Kmax}, the particle i is encoded as a Kmax þ Kmax  dvector, zi ¼ (si1, .siKmax, mi1, ., miKmax), where sij (j ¼ 1, ., Kmax) is the activation threshold in the range of [0, 1]. The activation threshold functions as a control parameter that determines whether the corresponding cluster centroid is selected or not. If it is greater than 0.5, the cluster is chosen; otherwise, the cluster becomes inactive. For example, assume the vector (0.67 0.21 0.30 0.95 0.88 (1.9 6.7) (1.5 5.3) (3.9 2.8) (4.5 3.6) (5.2 6.0)) is an instance of a particle encoding a clustering partition in a 2-dimensional data space with the maximum number of clusters as 5. The activation thresholds 0.67, 0.95, and 0.88 then indicate that the first, fourth, and fifth clusters, with centroids (1.9 6.7), (4.5 3.6), and (5.2 6.0), become activated, and data objects are assigned to these three clusters. Conversely, the other two clusters with corresponding activation thresholds less than 0.5 (0.21 and 0.3) will not be considered. If the activation thresholds become negative or greater than one, they are fixed to zero or one, respectively. If all activation thresholds for a particular particle are not greater than 0.5, which indicates that no clusters are activated, two thresholds will be randomly selected and reinitialized to a random value in the range of 0.5e1 to make sure that at least two clusters exist. A large number of evolutionary computational technology-based clustering algorithms have been used for gene expression data analysis in the hope of further increasing clustering quality considering their more powerful search capabilities in the complicated problem space than local search methods (Du, Wang, & Ji, 2008; He & Hui, 2009; Ma, Chan, & Chiu, 2006), as such, one major application of evolutionary computational approaches is to bridge with K-means or fuzzy c-means, which rely on hill-climbing searches and suffer from their inherent limitations, in order to improve their performance. For example, a variant of PSO, called particle-pair optimizer (PPO), in which two pairs of particles work in a cooperative way, was integrated with K-means to take

52 Computational Learning Approaches to Data Analytics in Biomedical Applications

advantage of the computational efficiency of K-means and the parallel search capability of PPO. Its performance on the yeast cell-cycle data, the sporulation data, and the lymphoma data is consistently better than either K-means or fuzzy c-means. Based on similar considerations, one step of K-means is incorporated into the regeneration steps of the niching genetic K-means algorithm. A niching method is combined into a GA to maintain the population diversity and prevent premature convergence during the evolutionary process. The niching method encourages mating among similar clustering solutions while making some competition among dissimilar clustering solutions available. Some more recent applications of evolutionary computation in gene expression data analysis include the EvoCluster algorithm and GenClust (Di Gesu´, Giancarlo, Lo Bosco, Raimondi, & Scaturro, 2005). An investigation of clustering algorithms inspired by ant behavior in gene expression data analysis was reported in, in terms of the study of the ant-based clustering algorithm and the ant-based association-rule mining algorithm. A large number of evolutionary computational technology-based clustering algorithms have been used for gene expression data analysis in hopes of further increasing the clustering quality considering their more powerful search capabilities in the complicated problem space than local search methods (Du et al., 2008; He & Hui, 2009; Ma et al., 2006). As such, one major application of evolutionary computational approaches is to bridge with K-means or fuzzy c-means, which rely on hill-climbing searches and suffer from their inherent limitations, in order to improve their performance. For example, a variant of PSO, called particle-pair optimizer (PPO), in which two pairs of particles work in a cooperative way, was integrated with K-means to take advantage of the computational efficiency of K-means and the parallel search capability of PPO. Its performance on the yeast cell-cycle data, the sporulation data, and the lymphoma data is consistently better than either K-means or fuzzy c-means. Based on similar considerations, one step of K-means is incorporated into the regeneration steps of the niching genetic K-means algorithm. A niching method is combined into a GA to maintain the population diversity and prevent premature convergence during the evolutionary process. The niching method encourages mating among similar clustering solutions while making some competition among dissimilar clustering solutions. Some more recent applications of evolutionary computation in gene expression data analysis include the EvoCluster algorithm and GenClust (Di Gesu´ et al., 2005). An investigation of clustering algorithms inspired by ant behavior in gene expression data analysis was reported in He and Hui (2009), in terms of the study of the ant-based clustering algorithm and the ant-based association-rule mining algorithm.

3.3.7

Neural networkebased clustering

Neural networkebased clustering is closely related to the concept of competitive learning, which is traced back to the early works of von der Malsburg (1973), Fukushima (1975), and Grossberg (Wen, Mukherjee, & Ray, 2013). According to Rumelhart and Zipser (1985), a competitive learning scheme consists of the following three basic

Chapter 3  Clustering algorithms

53

J activated

argmaxs(x,wj)

Competitive Layer

s(x,w1)

s(x,w2)



s(x,wK)



Input Layer

Input Pattern x

d

FIG. 3.7 A competitive learning network with excitatory connections from the input neurons to the output neurons. Each neuron in the competitive layer is associated with a weight vector wj. The neuron that is nearest to the input pattern, based on the prespecified similarity function, is fired, and its prototype is adapted to the input pattern thereafter. However, updating will not occur for the other losing neurons.

components: Neural networkebased clustering is closely related to the concept of competitive learning, which is traced back to the early works of von der Malsburg (1973), Fukushima (1975), and Grossberg (Wen et al., 2013). According to Rumelhart and Zipser (1985), a competitive learning scheme consists of the following three basic components: 1. “Start with a set of units that are all the same except for some randomly distributed parameter which makes each of them respond slightly differently to a set of input patterns. 2. Limit the “strength” of each unit. 3. Allow the units to compete in some way for the right to respond to a given subset of inputs.” Specifically, a two-layer feedforward neural network that implements the idea of competitive learning is depicted in Fig. 3.7. The neurons in the input layer admit input patterns1 and are fully connected to the output neurons in the competitive layer. Each output neuron corresponds to a cluster and is associated with a prototype or weight vector wj, j ¼ 1, ., K, where K is the number of clusters, stored in terms of synaptic weights wji, i ¼ 1, ., d and representing the connection between input neuron i and output neuron j.

1 i.e., data objects. In the terminology of neural network-based clustering, input pattern is more commonly used.

54 Computational Learning Approaches to Data Analytics in Biomedical Applications

Upon the presentation of an input pattern x ¼ {x1, ., xd}, the similarity between the weight vector wj of the randomly initialized cluster j and x is calculated as the net activation vi, sðx; w j Þ ¼ vj ¼ w Tj x ¼

d X

wji xi :

(3.31)

i¼1

In the competitive layer, only the neuron with the largest net activation value, that is, the best matches with the given input pattern, becomes activated or fired, written as2, J ¼ arg max sðx; w j Þ. j

(3.32)

The weight vector of the winning neuron J is further moved toward the input pattern following the updating equation; this is known as the instar rule, w J ðt þ 1Þ ¼ w J ðtÞ þ hðxðtÞ  w J ðtÞÞ;

(3.33)

where h is the learning rate. This is known as winner-take-all (WTA) or hard or crisp competitive learning (Yang, 1993). In contrast, learning can also occur in a cooperative way, known as soft competitive learning or winner-take-most (WTM), in which not just the winning neuron adjusts its prototype, but all the other cluster prototypes have the opportunity to be adapted based on how close they are to the input pattern. Another observation from the procedure above is that the competitive network processes one input pattern at a time (online or incremental mode), rather than updating the prototypes after all input patterns have been presented (batch mode). An important problem related to competitive learning-based online clustering is stability. Moore (1988) defines the stability of an incremental clustering algorithm in terms of two conditions: “(1) No prototype vector can cycle, or take on a value that it had at a previous time (provided it has changed in the meantime). (2) Only a finite number of clusters are formed with infinite presentation of the data.” Instability in competitive algorithms is caused by their plasticity, which they require to adapt to important new patterns. However, this plasticity may cause the memories of prior learning to be lost, worn away by the recently-learned knowledge. Carpenter and Grossberg (Carpenter & Grossberg, 1987; Grossberg, 1980) refer to this problem as the stability-plasticity dilemma, i.e., how adaptable (plastic) should a learning system be so that it does not suffer from the catastrophic loss of previously-learned rules (stability), which is the prerequisite to truly robust online learning. Adaptive resonance theory (ART) was developed by Carpenter and Grossberg (1987) as a solution to the stability-plasticity dilemma. ART can learn arbitrary input patterns in a stable, fast, and self-organizing way, thus overcoming the effect of learning instability that plagues many other competitive networks. ART is not, as is popularly imagined, a neural network architecture. It is a learning theory hypothesizing that resonance in neural circuits can trigger fast learning. As such, it subsumes a large family of current and future neural network architectures with many variants. ART1 is the first member, 2

Similar conclusions can be made when dissimilarity between pairs of input patterns is considered.

Chapter 3  Clustering algorithms

55

which only deals with binary input patterns (Carpenter & Grossberg, 1987), although it can be extended to arbitrary input patterns by utilizing a variety of coding mechanisms. ART2 extends the applications to analog input patterns (Carpenter & Grossberg, 1987), and Fuzzy ART (FA) incorporates fuzzy set theory and ART (Carpenter, Grossberg, & Rosen, 1991) and is typically regarded as a superior alternative to ART2 (see Algorithm V for FA). The hyper-rectangular cluster representation in the feature space can be replaced with hyperellipsoidal clusters geometrically, such as Gaussian ART (Williamson, 1996), Bayesian ART (Vigdor & Lerner, 2007), and Ellipsoid ART was developed by Carpenter and Grossberg (1987) as a solution to the stability-plasticity dilemma. ART can learn arbitrary input patterns in a stable, fast, and self-organizing way, thus overcoming the effect of learning instability that plagues many other competitive networks. ART is not, as is popularly imagined, a neural network architecture. It is a learning theory hypothesizing that resonance in neural circuits can trigger fast learning. As such, it subsumes a large family of current and future neural network architectures with many variants. ART1 is the first member, which only deals with binary input patterns (Carpenter & Grossberg, 1987). Although, it can be extended to arbitrary input patterns by utilizing a variety of coding mechanisms. ART2 extends the applications to analog input patterns (Carpenter & Grossberg, 1987), and Fuzzy ART (FA) incorporates fuzzy set theory and ART (Carpenter, Grossberg, & Reynolds, 1991; Carpenter, Grossberg, & Rosen, 1991) and is typically regarded as a superior alternative to ART2 (see Algorithm V for FA). The hyper-rectangular cluster representation in the feature space can be replaced with hyperellipsoidal clusters geometrically, such as Gaussian ART (Williamson, 1996), Bayesian ART (Vigdor & Lerner, 2007), and Ellipsoid ART (Xu, Anagnostopoulos, & Wunsch, 2002). The ease with which ART is used for hierarchical clustering is demonstrated in the ART tree method, which is a hierarchy in which the same input pattern is sent to every level (Wunsch, Caudell, Capps, & Falk, 1991). Which ART units in a given level get to look at the input are determined by the winning nodes of layer F2 at a lower level. By incorporating two ART modules that receive input patterns (ARTa) and corresponding label information (ARTb), respectively, with an inter-ART module, the resulting ARTMAP system can be used for supervised classifications (Carpenter, Grossberg, & Reynolds, 1991). The integration of ART with Q-learning also extends the capability of ART in dealing with reinforcement signals (Brannon, Seiffertt, Draelos, & Wunsch, 2009). For supervised learning, a good starting method is Fuzzy ART, of which many of the above are refinements. It’s algorithm is given below. The ease with which ART is used for hierarchical clustering is demonstrated in the ART tree method, which is a hierarchy in which the same input pattern is sent to every level (Wunsch et al., 1991). Which ART units in a given level get to look at the input are determined by the winning nodes of layer F2 at a lower level. By incorporating two ART modules that receive input patterns (ARTa) and corresponding label information (ARTb), respectively, with an inter-ART module, the resulting ARTMAP system can be used for supervised classifications (Carpenter, Grossberg, & Reynolds, 1991; Carpenter, Grossberg, & Rosen, 1991). The integration of ART with Q-learning also extends the capability of ART in dealing with

56 Computational Learning Approaches to Data Analytics in Biomedical Applications

reinforcement signals (Brannon et al., 2009). For supervised learning, a good starting method is Fuzzy ART, of which many of the above are refinements. Its algorithm is given below. Algorithm V: Fuzzy ART. 1. Transform each input pattern into its complement coding form (Carpenter, Grossberg, & Reynolds, 1991; Carpenter, Grossberg, & Rosen, 1991). Initialize the weights of the uncommitted neuron to 1; Transform each input pattern into its complement coding form (Carpenter, Grossberg, & Reynolds, 1991; Carpenter, Grossberg, & Rosen, 1991). Initialize the weights of the uncommitted neuron to 1; 2. Present a new pattern x, and calculate the input from layer F1 to layer F2 by means of the category choice functions, Tj ¼

jx^w j j ; a þ jw j j

(3.34)

where ^ is the fuzzy AND operator defined by ðx^yÞi ¼ minðxi ; yi Þ;

(3.35)

and a > 0 is the choice parameter; 3. Activate layer F2 by choosing neuron J based on the winner-take-all rule, TJ ¼ maxfTj g; j

(3.36)

4. Compare the expectation from layer F2 with the input pattern by the vigilance test. If r

jx^w J j ; jxj

(3.37)

go to step 5a; otherwise, go to step 5b. 5. a. Update the corresponding weights for the active neuron as w J ðnewÞ ¼ bðx^w J ðoldÞÞ þ ð1  bÞw J ðoldÞ;

(3.38)

where b ˛ [0,1] is the learning rate parameter. If J is an uncommitted neuron, create a new uncommitted neuron with the initial values set as in step 1; b. Send a reset signal to disable the current active neuron by the orienting subsystem, and return to step 3; 6. Return to step 2 until all patterns are processed. In contrast to WTA, WTM, otherwise known as soft competitive learning, allows learning to occur for neurons in addition to the winner. The leaky learning model (Rumelhart & Zipser, 1985) paradigm addresses the underutilized or dead neuron problem in hard competitive learning by moving both winning and losing neurons toward the presented input pattern, but with a much faster pace for the winning neuron. Thus, even though the weight vector of a neuron is initialized farther away from any input patterns than other weight vectors, it still has the opportunity to learn from the input patterns. Another common strategy to deal with the dead neuron problem is to add

Chapter 3  Clustering algorithms

57

a conscience to hard competitive learning (Moore, 1988; Wen et al., 2013), which penalizes the neurons that win very often and thus provides opportunities for other _ neurons to be trained. Rival penalized competitive learning (RPCL) (Xu, Krzyzak, & Oja, 1993) addresses the insufficiency of conscience strategy-based approaches when the number of clusters in the competitive layer is not selected appropriately. During the learning of RPCL, not only is the winning neuron updated, but the weight vector of the second winner, called the rival of the winner, is “de-learned” to push it away from the input pattern. Thus, the extra weight vectors caused by the over-estimation of the number of clusters are driven away from the high-density regions, avoiding the confusion caused by redundancy.In contrast to WTA, WTM, otherwise known as soft competitive learning, allows learning to occur for other neurons in addition to the winner. The leaky learning model (Rumelhart & Zipser, 1985) paradigm addresses the underutilized or dead neuron problem in hard competitive learning by moving both winning and losing neurons toward the presented input pattern, but at a much faster pace for the winning neuron. Thus, even though the weight vector of a neuron is initialized farther away from any input patterns than other weight vectors, it still has the opportunity to learn from the input patterns. Another common strategy to deal with the dead neuron problem is to add a conscience to hard competitive learning (Moore, 1988; Wen et al., 2013), which penalizes the neurons that win very often and thus provides opportunities for other neurons to be trained. Rival penalized competitive learning (RPCL) (Xu et al., 1993) addresses the insufficiency of conscience strategy-based approaches when the number of clusters in the competitive layer is not selected appropriately. During the learning of RPCL, not only is the winning neuron updated, but the weight vector of the second winner, called the rival of the winner, is “de-learned” to push it away from the input pattern. Thus, the extra weight vectors caused by the overestimation of the number of clusters are driven away from the high-density regions, avoiding the confusion caused by redundancy. Learning vector quantization (LVQ) is another family of models based on competitive learning, including unsupervised LVQ (Kohonen, 1989), generalized LVQ (Pal, Tsao, & Bezdek, 1993), and Fuzzy LVQ (FLVQ) (Tsao, Bezdek, & Pal, 1994), although LVQ is also used to refer to a family of supervised classifiers (Teuvo Kohonen, 2001). The basic architecture of unsupervised LVQ is similar to that described in Fig. 3.7, except that the Euclidean metric is used to measure the distance between the prototype vectors and the input pattern. Thus, it belongs to the category of hard competitive learning. In order to overcome the limitations of unsupervised LVQ, such as sensitivity to initialization, a generalized LVQ (GLVQ) algorithm (Sato & Yamada, 1995) was developed to explicitly infer the learning rules from the optimization of a cost function. Soft competitive learning is adopted for the neurons in the competitive layer, and the influence of the input pattern on the other neurons is dependent on the degree of its match with the winning neuron. The integration of the fuzzy memberships of input patterns in the cost function for data scaling invariance leads to a family of competitive learning paradigms, known as GLVQ-F (Karayiannis & Pai, 1996). FLVQ also incorporates the fuzzy

58 Computational Learning Approaches to Data Analytics in Biomedical Applications

membership function with the learning rule and can automatically determine the size of the update neighborhood without the requirement of defining the neighborhood. Moreover, a family of batch LVQ algorithms, known as an extended FLVQ family (EFLVQ-F), was introduced by explicitly inferring the learning rules from the minimization of the proposed cost function, defined as the average generalized mean between the prototypes and the input pattern, with gradient descent search (Karayiannis & Bezdek, 1997). EFLVQ-F generalizes the learning rule of FLVQ and FCM, which has a restricted weighting exponent under certain conditions.Learning vector quantization (LVQ) is another family of models based on competitive learning, including unsupervised LVQ (Kohonen, 1989), generalized LVQ (Pal et al., 1993), and Fuzzy LVQ (FLVQ) (Tsao et al., 1994), although LVQ is also used to refer to a family of supervised classifiers (Kohonen, 2001). The basic architecture of unsupervised LVQ is similar to that described in Fig. 3.7, except that the Euclidean metric is used to measure the distance between the prototype vectors and the input pattern. Thus, it belongs to the category of hard competitive learning. In order to overcome the limitations of unsupervised LVQ, such as sensitivity to initialization, a generalized LVQ (GLVQ) algorithm (Pal et al., 1993) was developed to explicitly infer the learning rules from the optimization of a cost function. Soft competitive learning is adopted for the neurons in the competitive layer, and the influence of the input pattern on the other neurons is dependent on the degree of its match with the winning neuron. The integration of the fuzzy memberships of input patterns in the cost function for data scaling invariance leads to a family of competitive learning paradigms, known as GLVQ-F (Karayiannis & Pai, 1996). FLVQ also incorporates the fuzzy membership function with the learning rule and can automatically determine the size of the updated neighborhood without being required to define the neighborhood. Moreover, a family of batch LVQ algorithms, known as an extended FLVQ family (EFLVQ-F), was introduced by explicitly inferring the learning rules from the minimization of the proposed cost function, defined as the average generalized mean between the prototypes and the input pattern, with a gradient descent search (Karayiannis & Bezdek, 1997). EFLVQ-F generalizes the learning rule of FLVQ and FCM, which has a restricted weighting exponent under certain conditions. Self-organizing feature maps (SOFMs), or self-organizing maps, developed from the work of von der Malsburg (1973), Grossberg (Carpenter & Grossberg, 1987, 1991), and Kohonen (Pal et al., 1993). SOFM represents high-dimensional input patterns with prototype vectors that can be visualized in, usually, a two-dimensional lattice structure, or sometimes a one-dimensional linear structure, while preserving the proximity relationships of the original data as much as possible (Pal et al., 1993). Each unit in the lattice is called a neuron, and the input patterns are fully connected to all neurons via adaptable weights. During training, neighboring input patterns are projected into the lattice, corresponding to adjacent neurons. These adjacent neurons are connected to each other, giving a clear topology of how the network fits into the input space. Therefore, the regions with a high probability of occurrence of sampled patterns will be represented by larger areas in the feature map. In this sense, some researchers prefer to

Chapter 3  Clustering algorithms

59

think of SOFM as a method of displaying latent data structures in a visual way rather than through a clustering approach (Kohonen, 2001). However, SOFM could be integrated with other clustering approaches, such as K-means or hierarchical clustering, to reduce the computational cost and provide fast clustering (Vesanto & Alhoniemi, 2000). During its learning, SOFM updates a set of weight vectors within the neighborhood of the winning neuron, which is determined in a topological sense, with its size decreasing monotonically (Kohonen, 2001). Thus, the updating neighborhood starts with a wide field and gradually shrinks with time until there are no other neurons inside. Correspondingly, the learning paradigm transitions from soft competitive learning, which updates a neighborhood of neurons, to hard competitive learning, which only updates the winner. The basic procedures of SOFM are then summarized in algorithm VI.Self-organizing feature maps (SOFMs), or self-organizing maps, developed from the work of von der Malsburg (1973), Grossberg (Carpenter & Grossberg, 1987, 1991), and Kohonen (Pal et al., 1993). SOFM represents high-dimensional input patterns with prototype vectors that can be visualized in, usually, a two-dimensional lattice structure, or sometimes a one-dimensional linear structure, while preserving the proximity relationships of the original data as much as possible (Pal et al., 1993). Each unit in the lattice is called a neuron, and the input patterns are fully connected to all neurons via adaptable weights. During training, neighboring input patterns are projected into the lattice, corresponding to adjacent neurons. These adjacent neurons are connected to each other, giving a clear topology of how the network fits into the input space. Therefore, the regions with a high probability of occurrence of sampled patterns will be represented by larger areas in the feature map. In this sense, some researchers prefer to think of SOFM as a method of displaying latent data structures in a visual way rather than through a clustering approach (Kohonen, 2001). However, SOFM could be integrated with other clustering approaches, such as K-means or hierarchical clustering, to reduce the computational cost and provide fast clustering (Vesanto & Alhoniemi, 2000). During its learning, SOFM updates a set of weight vectors within the neighborhood of the winning neuron, which is determined in a topological sense, with its size decreasing monotonically (Kohonen, 2001). Thus, the updating neighborhood starts with a wide field and gradually shrinks with time until there are no other neurons inside. Correspondingly, the learning paradigm transitions from soft competitive learning, which updates a neighborhood of neurons, to hard competitive learning, which only updates the winner. The basic procedures of SOFM are then summarized in algorithm VI. Algorithm VI: Self-organizing feature maps. 1. Determine the topology of the SOFM. Initialize the weight vectors wj(0) for j ¼ 1, ., K, randomly; 2. Present an input pattern x to the network. Choose the winning node J that has the minimum Euclidean distance to x, J ¼ argminðkx  w j kÞ; j

3. Calculate the current learning rate and size of the neighborhood;

(3.39)

60 Computational Learning Approaches to Data Analytics in Biomedical Applications

4. Update the weight vectors of all the neurons in the neighborhood of J, w j ðt þ 1Þ ¼ w j ðtÞ þ hJj ðtÞðx  w j ðtÞÞ;

(3.40)

where hJj(t) is the neighborhood function, defined, for example, as ! krJ  rj k2 ; hJj ðtÞ ¼ hðtÞexp 2s2 ðtÞ

(3.41)

where rJ and rj represent the positions of the corresponding neurons on the lattice, and s(t) is the monotonically decreasing Gaussian kernel width function. 5. Repeat steps 2 through 4 until the change in the neuron’s position is below a prespecified small positive number. Magnetic resonance imaging (MRI) provides a visualization of internal tissues and organs, for disease diagnosis (such as cancer and heart and vascular disease), treatment and surgical planning, image registration, location of pathology, etc. As an important image-processing step, the goal of MRI segmentation is to partition an input image into significant anatomical areas that are uniform and homogeneous based on some image properties (Bezdek, Hall, &Clarke, 1993; Pham, Xu, & Prince, 2000). MRI segmentation can be formulated as a clustering problem in which a set of feature vectors, obtained through image measurement and position transformation, is grouped into a relatively small number of clusters corresponding to image segments (Bezdek et al., 1993; Karayiannis, 1997). Because the number of clusters is much smaller than the number of intensity levels in the original image, such an unsupervised clustering process actually removes the redundant information from the MR images. Magnetic resonance imaging (MRI) provides a visualization of internal tissues and organs for disease diagnosis (such as cancer and heart and vascular disease), treatment and surgical planning, image registration, location of pathology, etc. As an important image-processing step, the goal of MRI segmentation is to partition an input image into significant anatomical areas that are uniform and homogeneous based on some image properties (Bezdek et al., 1993; Pham, Xu et al., 2000). MRI segmentation can be formulated as a clustering problem in which a set of feature vectors, obtained through image measurement and position transformation, is grouped into a relatively small number of clusters corresponding to image segments (Bezdek et al., 1993; Karayiannis, 1997). Because the number of clusters is much smaller than the number of intensity levels in the original image, such an unsupervised clustering process actually removes the redundant information from the MR images. An application of LVQ and FALVQ in a segment of MR images of the brain of a patient with meningioma was illustrated in Ruan, Moretti, Fadili, and Bloyet (2001). In this study, the feature vectors at every image location are composed of pixel values of the T1 (spin-lattice relaxation time)-weighted, T2 (spin-spin relaxation time)-weighted, and SD (spin density) images, respectively. The tumor is located in the right frontal lobe, appearing bright on the T2-weighted image and dark on the T1-weighted image. After the patient was given Gadolinium, the tumor on the T1-weighted image becomes very bright and is isolated from surrounding tissue. There also exists a large amount of edema

Chapter 3  Clustering algorithms

61

surrounding the tumor, appearing very bright on the T2-weighted image. For all the analyses with LVQ, the algorithms from the FALVQ1 family, the FALVQ2 family, and the FALVQ3 family, the number of clusters is set as 8, leading to 8 different segments. The empirical results show that LVQ can identify the edema but is incapable of discriminating the tumor from the surrounding tissue. The performance for all FALVQ families of algorithms is more promising as both the tumor and the edema can be identified successfully. More segmented results with different parameter selections and discussions can be found in Karayiannis (2000) and Ruan et al. (2001). Research on the clustering of dynamic contrast-enhanced perfusion MRI time series with SOFM, minimal free energy vector quantization (Rose, Gurewitz, & Fox, 1992), and FCM demonstrates that these methods are useful extensions of the conventional perfusion parameter maps (Wismuller, Meyer-Baese, Lange, Reiser, & Leinsinger, 2006). A comparison of several major neural and fuzzy clustering algorithms, including SOFM, minimal free energy vector quantization, neural gas (Martinetz, Berkovich, & Schulten, 1993), FCM, and the Gath-Geva algorithmon fMRI datasets is reported in Lange, Meyer-Baese, Hurdal, and Foo (2006). An application of LVQ and FALVQ in a segment of MR images of the brain of a patient with meningioma was illustrated in Ruan et al. (2001). In this study, the feature vectors at every image location are composed of pixel values of the T1 (spin-lattice relaxation time)-weighted, T2 (spin-spin relaxation time)-weighted, and SD (spin density) images, respectively. The tumor is located in the right frontal lobe, appearing bright on the T2-weighted image and dark on the T1-weighted image. After the patient was given gadolinium, the tumor on the T1-weighted image became very bright and isolated from the surrounding tissue. There is also a large amount of edema surrounding the tumor, appearing very bright on the T2-weighted image. For all the analyses with LVQ, the algorithms from the FALVQ1 family, the FALVQ2 family, and the FALVQ3 family, the number of clusters is set as 8, leading to 8 different segments. The empirical results show that LVQ can identify the edema but is incapable of discriminating the tumor from the surrounding tissue. The performance for all FALVQ families of algorithms is more promising, as both the tumor and the edema can be identified successfully. More segmented results with different parameter selections and discussions can be found in Karayiannis (2000) and Ruan et al. (2001). Research on the clustering of dynamic contrast-enhanced perfusion MRI time series with SOFM, minimal free energy vector quantization (Rose et al., 1992), and FCM demonstrates that these methods are useful extensions of the conventional perfusion parameter maps (Wismuller et al., 2006). A comparison of several major neural and fuzzy clustering algorithms, including SOFM, minimal free energy vector quantization, neural gas (Martinetz et al., 1993), FCM, and the GatheGeva algorithmon fMRI datasets is reported in Lange et al. (2006). SOFM is also one of the major clustering algorithms applied to gene expression data, partially because of its good visualization. SOFM implemented as a software package called GENECLUSTER (Tamayo et al., 1999) was used for human hematopoietic differentiation modeled with four cell lines (HL-60, U937, Jurkat, and NB4 cells), representing 1036 genes with a 6  4 SOFM. Genomically relevant clusters provide new and useful

62 Computational Learning Approaches to Data Analytics in Biomedical Applications

insights into hematopoietic differentiation. Using SOFM to cluster human acute leukemias (Pal et al., 1993) was one of the earliest practices for cancer class discovery that popularized other relevant gene expression profile-based cancer research. In this case, samples from acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL), further classified as subtypes of T-lineage ALL and B-lineage ALL, were effectively discriminated with a modification of the GENECLUSTER package. A multi-SOFM consisting of a hierarchy of SOFM grids was used to cluster macrophage gene expression data, which aims to reduce the dependency of SOFM on the user-specified number of clusters (Ghouila et al., 2009). Approaches for further improving the visualization in SOFM for gene expression data analysis and for organizing neurons into clusters are presented in Fernandez and Balzarini (2007). SOFM is also one of the major clustering algorithms applied to gene expression data, partially because of its good visualization. SOFM implemented as a software package called GENECLUSTER was used for human hematopoietic differentiation modeled with four cell lines (HL-60, U937, Jurkat, and NB4 cells), representing 1036 genes with a 6  4 SOFM. Genomically relevant clusters provide new and useful insights into hematopoietic differentiation. Using SOFM to cluster acute human leukemias (Pal et al., 1993) was one of the earliest practices for cancer class discovery that popularized other relevant gene expression profile-based cancer research. In this case, samples from acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL), further classified as subtypes of T-lineage ALL and B-lineage ALL, were effectively discriminated with a modification of the GENECLUSTER package. A multiSOFM, consisting of a hierarchy of SOFM grids, was used to cluster macrophage gene expression data, which aims to reduce the dependency of SOFM on the user-specified number of clusters (Pal et al., 1993). Approaches for further improving the visualization in SOFM for gene expression data analysis and for organizing neurons into clusters are presented in Fernandez and Balzarini (2007). ART methods also have been successfully applied to gene expression data analysis. Xu et al. achieved better performance with Fuzzy ART on the small round blue-cell tumor dataset, including 83 samples from four categories, than with standard hierarchical clustering and K-means (Xu, Damelin, Nadler, & Wunsch, 2010). Gene filtering methods, which are exclusively based on some statistical characteristics, such as correlation coefficient, variance, and double hump probability density, without any prior information, were also suggested to remove noninformative genes for cancer discrimination. Another application of Fuzzy ART is for expression data during sporulation of Saccharomyces cerevisiae (Tomida, Hanai, Honda, & Kobayashi, 2002), focusing on the 45 genes that are involved in meiosis and sporulation. Within a partition of 5 clusters, 14 characterized early genes, 2 mid-late genes, and 3 late genes are all correctly organized into the corresponding clusters, with only 2 middle genes incorrectly assigned. Furthermore, it can be observed clearly from the cluster prototypes how genes behave during the entire process of sporulation. Ellipsoid ART with hyper-ellipsoid cluster representation also successfully separates the ALL and AML samples of the acute leukemia data set, and further identifies the subcategories of ALL (Xu et al., 2002). ART

Chapter 3  Clustering algorithms

63

methods have also been successfully applied to gene expression data analysis. Xu et al. achieved better performance with Fuzzy ART on the small round blue-cell tumor dataset, including 83 samples from four categories, than with standard hierarchical clustering and K-means (Xu & Wunsch, 2009). Gene filtering methods, which are exclusively based on some statistical characteristics, such as correlation coefficient, variance, and double hump probability density, without any prior information, were also suggested to remove noninformative genes for cancer discrimination. Another application of Fuzzy ART is for expression data during sporulation of S. cerevisiae (Tomida et al., 2002), focusing on the 45 genes that are involved in meiosis and sporulation. Within a partition of 5 clusters, 14 characterized early genes, 2 mid-late genes, and 3 late genes are all correctly organized into the corresponding clusters, with only 2 middle genes incorrectly assigned. Furthermore, the way genes behave during the entire process of sporulation can be observed clearly from the cluster prototypes. Ellipsoid ART with hyper-ellipsoid cluster representation also successfully separates the ALL and AML samples of the acute leukemia dataset (Tamayo et al., 1999) and further identifies the subcategories of ALL (Tomida et al., 2002).

3.3.8

Kernel learning-based clustering

Kernel-based learning, inspired by support vector machines (Mu¨ller, Mika, Ra¨tsch, Tsuda, & Scho¨lkopf, 2001; Vapnik, 1998), has become increasingly important in pattern recognition and machine learning (Scho¨lkopf, Burges, &Smola, 1999; Vapnik, 1998). It typically uses a linear hyperplane to separate the patterns achieved by nonlinearly transforming a set of complex and nonlinearly separable patterns into a higherdimensional feature space.3 The difficulty arising from the requirement of explicitly defining the nonlinear mapping F($), which is time-consuming and sometimes infeasible, can be overcome by the kernel trick, or kðx i ; x j Þ ¼ Fðx i Þ $ Fðx j Þ, based on Mercer’s theorem (Girolami, 2002). Some commonly used kernel functions include polynomial kernels, Gaussian radial basis function (RBF) kernels, and sigmoid kernels (Mu¨ller et al., 2001). Using the kernel trick in clustering makes it possible to explore the potentially nonlinear structures in data that may go unnoticed with the traditional clustering algorithm in the Euclidean space.Kernel-based learning, inspired by support vector machines (Frohlich, Speer, Spieth, & Zell, 2006; Siriteerakul & Boonjing, 2013), has become increasingly important in pattern recognition and machine learning (Xu & Wunsch, 2005, 2009). It typically uses a linear hyperplane to separate the patterns achieved by nonlinearly transforming a set of complex and nonlinearly separable patterns into a higher-dimensional feature space.4 The nonlinear mapping F($) must be explicitly defined, which is time-consuming and sometimes infeasible, but this difficulty can be overcome by the kernel trick, or kðx i ; x j Þ ¼ Fðx i Þ $ Fðx j Þ, based on Mercer’s theorem (Girolami, 2002). Some commonly used kernel functions include polynomial 3

The dimension of the feature space can be infinite. The dimension of the feature space can be infinite.

4

64 Computational Learning Approaches to Data Analytics in Biomedical Applications

kernels, Gaussian radial basis function (RBF) kernels, and sigmoid kernels (Scho¨lkopf et al., 1999). Using the kernel trick in clustering makes it possible to explore the potentially nonlinear structures in data that may go unnoticed with the traditional clustering algorithm in the Euclidean space. The support vector clustering (SVC) algorithm has the objective of finding the smallest enclosing hypersphere, although the hypersphere can be replaced with the hyperellipsoid (Wang, Shi, Yeung, Tsang, & Ann Heng, 2007; Zafeiriou, Laskaris, & Transform, 2008) in the transformed high-dimensional feature space that contains most of the images of the data objects (Ben-Hur, 2008). The hypersphere is then mapped back to the original data space to form a set of contours, defining the cluster boundaries. The SVC algorithm includes two major steps: SVM training and cluster labeling. SVM training determines the hypersphere construction and the kernel radius function that define the distance from the image of an object in the feature space to the hypersphere center. The objects that lie on the cluster boundaries are defined as support vectors (SVs) whose images lie on the surface of the hypersphere. Cluster labeling aims to assign each data object to its corresponding cluster, according to the adjacency matrix A, which is based on the observation that any path connecting a pair of data objects belonging to different clusters must exit from the hypersphere in the feature space. Clusters are then considered as the connected components of the graph induced by the adjacency matrix. (see Algorithm VII.) Algorithm VII: Support vector clustering. 1. Formulate the primal optimization problem as, min J ¼ R2 þ C

X xj ;

(3.42)

j

subject to 2 kFðx j Þ  ak  R2 þ xj ;

(3.43)

where F: < / F is a nonlinear map, R is the minimal radius of the hypersphere a is P the center of the hypersphere and C xj is a penalty term with xj  0 as the slack j variable allowing soft constraints and C as a soft margin constant; 2. Solve the dual quadratic optimization problem, d

maxW ¼ bj

X XX Fðx j Þ2 bj  bi bj ðFðx i Þ $ Fðx j ÞÞ; j

i

(3.44)

j

subject to the constraints: (1) 0  bj  C;

(2)

X bj ¼ 1;

for j ¼ 1; . ; N.

j

where bj  0 are Lagrange multipliers;

(3.45)

(3.46)

Chapter 3  Clustering algorithms

65

3. Construct the cluster boundaries by a set of contours that enclose the objects in the original input space {x: R(x} ¼ R}, where the kernel function is written as, R2 ðxÞ ¼ kFðxÞ  ak2 ¼ kðx; xÞ  2

X XX bj kðx j ; xÞ  bi bj kðx i ; x j Þ; j

i

(3.47)

j

4. Calculate the adjacency matrix, 

Aij ¼

1; 0;

if Rðx i þ gðx j  x i ÞÞ  R; g˛½0; 1 otherwise.

;

(3.48)

5. Assign a label to each data object. The idea of SVC can be extended by representing each cluster with a hypersphere instead of using just one hypersphere in the feature space overall. One way to achieve this one-hypersphere-for-one-cluster representation is to use an iteration method similar to K-means, i.e., repetitively adjusting the cluster centroids in the feature space and assigning data objects to the nearest cluster until there is no change to the Voronoi sets (Camastra & Verri, 2005). In contrast, the multiplesphere support vector clustering (MSVC) algorithm adopts a mechanism similar to ART’s orienting subsystem (Moore, 1988) in order to create clusters dynamically (Chiang & Hao, 2003). Upon the presentation of a data object, the formed clusters, each of which corresponds to a hypersphere, compete with each other based on the distance definition in the feature space. A vigilance test is then performed to ensure the eligibility of the winning cluster to represent the data object. The winning cluster that fails the test will be disabled, thus triggering another round of competitions among the remaining clusters. If no clusters can pass the vigilance test, a new cluster will be created to enclose the data object as a support vector. The concept of a rough set (Zalewski, 1996) can be introduced into support vector clustering to achieve soft clustering (Asharaf, Shevade, & Murty, 2005). In this context, the rough hypersphere has both an inner radius RI representing its lower approximation and an outer radius RO (RO > RI) representing its upper approximation. If the images of data objects are within the lower approximation in the feature space, these objects are considered to belong to one cluster exclusively. However, if the images are in the upper approximation but not in the lower approximation, the corresponding objects are regarded as being associated with more than one cluster. Introducing a nonlinear version of a variety of existing linear clustering algorithms constitutes another major category of application for kernel learning in clustering. The premise is that the scalar product can be obtained in the feature space. For example, a straightforward way to transform the K-means algorithm into its kernel version is to use the kernel trick to calculate the distance between the images of the data objects and the cluster centroids defined in the feature space (or in the original data space), 2 N 1 X Fðx j Þ  mF 2 ¼ gil Fðx l Þ Fðx j Þ  i Ni l¼1 N N X N 2 X 1 X ¼ kðx j ; x j Þ  gih kðx j ; x h Þ þ 2 gil gim kðx l ; x m Þ; Ni h¼1 Ni l¼1 m¼1

(3.49)

66 Computational Learning Approaches to Data Analytics in Biomedical Applications

which prevents the requirement of direct calculation of the cluster centroids. The basic steps of the kernel-K-means algorithm are then summarized in algorithm VIII. The last step is required because the cluster centroids in the feature space cannot be explicitly expressed. The kernel-K-means algorithm can also be constructed by manipulating the sum-of-squared-error criterion function in the feature space (Girolami, 2002). An incremental kernel-K-means algorithm is described in (Scho¨lkopf, Smola, & Mu¨ller, 1998). It is also worthwhile to point out that kernel-K-means can be interpreted in terms of information theoretic quantities such as Renyi quadratic entropy and integrated squared error (Robert, Torbjorn, Deniz, & Jose, 2006). In this context, kernel-K-means corresponds to the maximization of an integrated squared error divergence measure between the Parzen window estimated cluster probability density functions (Jenssen & Eltoft, 2007). Further discussions on the connection between information theoretic learning and the kernel methods can be found in Jenssen, Erdogmus, Hild, Principe, and Eltoft (2007) and Jenssen, Erdogmus, Principe, and Eltoft (2005). More nonlinear kernel clustering algorithms following similar procedures and ideas include kernel principal component analysis (Mu¨ller et al., 2001), kernel deterministic annealing clustering (Yang, Song, & Zhang, 2006), kernel fuzzy clustering (Liu & Xu, 2008; Zhang & Chen, 2004; Zhou & Gan, 2004), and kernel self-organizing feature maps (Andra´s, 2002; Boulet, Jouve, Rossi, & Villa, 2008), which prevents the cluster centroids from having to be calculated directly. The basic steps of the kernel-K-means algorithm are then summarized in Algorithm VIII. The last step is required because the cluster centroids in the feature space cannot be explicitly expressed. The kernel-K-means algorithm can also be constructed by manipulating the sum-of-squared-error criterion function in the feature space (Camastra & Verri, 2005). An incremental kernel-K-means algorithm is described in (Pham, Dimov, & Nguyen, 2004). It is also worthwhile to point out that kernelK-means can be interpreted in terms of information theoretic quantities such as Renyi quadratic entropy and integrated squared error (Renyi, 1961). In this context, kernelK-means corresponds to the maximization of an integrated squared error divergence measure between the Parzen window estimated cluster probability density functions (Jenssen et al., 2007). Further discussions on the connection between information theoretic learning and the kernel methods can be found in Ja¨schke et al. (2006) and Jenssen and Eltoft (2007). More nonlinear kernel clustering algorithms following similar procedures and ideas include kernel principal component analysis (Mu¨ller et al., 2001), kernel deterministic annealing clustering (Rose et al., 1992), kernel fuzzy clustering (Boulet et al., 2008; Liu & Xu, 2008; Zhou & Gan, 2004), and kernel self-organizing feature maps (Papadimitriou & Likothanassis, 2004; Robert et al., 2006). Algorithm VIII: Kernel-K-means algorithm. 1. Initialize a K-partition in the feature space; 2. Calculate the number of data objects and the third term in Eq. (3.46) for each cluster;

Chapter 3  Clustering algorithms

67

3. Assign each data object xj, j ¼ 1, ., N to the cluster whose centroid is closest to its image in the feature space,



2 x j ˛ Ci

i ¼ argmin Fðx j Þ  mFl ;

(3.50)

l

4. Repeat steps 2e3 until there is no change for each cluster; 5. Finally, the data object whose image is closest to the centroid is selected as the representative of the corresponding cluster. Support vector clustering with a Gaussian kernel was used to cluster functional MRI (fMRI) time series, mapped into a feature space of Fourier coefficients by Fourier transformation (Wang et al., 2005) (see Section 3.3.11 for more discussion on sequential data clustering). The objective is to partition fMRI time series into a set of clusters, each consisting of voxels exhibiting similar activation patterns. Experimental results on the auditory fMRI data from Wellcome Department of Cognitive Neurology at University College London demonstrate that SVC is capable of detecting more continuous activated regions that are consistent with physiological knowledge, compared with irrelevant voxels identified by other clustering algorithms, such as K-means (Goutte, Toft, Rostrup, Nielsen, & Hansen, 1999). Extending SVC by replacing the hypersphere with a hyperellipsoid in the feature space in order to enhance the description capability of the kernel radius function displays further improvement over SVC (Ben-Hur, 2008). Application of SVC for gene expression data, such as the diffuse large B-cell lymphoma benchmark (Xu et al., 2010), was reported under a systematic framework for parameter tuning in SVC (Yilmaz, Achenie, & Srivastava, 2007). Other kernel learning-based clustering applications in gene expression data analysis include kernel PCA combined with fuzzy c-means (Liu, Dechang, Bensmail, & Ying, 2005), kernel-based SOFM (Papadimitriou & Likothanassis, 2004), and kernel hierarchical clustering (Qin, Lewis, & Noble, 2003). It is worthwhile to point out that kernel hierarchical clustering does not show any improved performance over the original linear version in the expression datasets that were used (Liu et al., 2005), which raises some questions for future research in kernel learning. As for MRI segmentation, kernelized fuzzy c-means shows better performance than the standard version, especially with some spatial penalty containing spatial neighborhood information introduced in the cost function (Chen & Zhang, 2004; Hathaway & Bezdek, 2001). This increases robustness in segmenting images corrupted by outliers and noise. Kernel functions can also be used to calculate the similarity between pairs of genes based on Gene Ontology (Gene Ontology Consortium, 2004) terms in order to create biologically plausible functional gene clusters (Frohlich et al., 2006).

3.3.9

Large-scale data clustering

The ability to deal with large-scale data within a reasonable computational burden has become mandatory for the development of a new clustering algorithm because of the exponential increase of data from many scientific communities and research activities.

68 Computational Learning Approaches to Data Analytics in Biomedical Applications

For even an otherwise excellent algorithm, a quadratic or higher computational complexity would dramatically limit its application. The following discussions review the strategies and algorithms for large-scale data clustering in the categories of random sampling, data condensation, divide and conquer, incremental learning, density-based approaches, and grid-based approaches. In fact, many algorithms combine more than one strategy to improve scalability and thus belong to at least two of these categories. Random sampling is the simplest way to process large-scale data because only a small, random sample of the original dataset is considered. The sampling process can be uniform or biased, based on the specific requirements (Kollios, Gunopulos, Koudas, & Berchtold, 2003). For example, over-sampling the sparse regions reduces the likelihood of missing the small clusters in the data. In order to find the data partitions effectively, random sampling should maintain sufficient geometrical properties with regards to each potential cluster, i.e., clustering information should not be missing just because no data objects from some clusters are included in the sample set. Sampling also screens out a proportionate (i.e. majority) of outliers. In fact, the lower bound of the minimum sample size can be estimated in terms of Chernoff bounds, given the low probability that clusters are missing in the sample set (Guha, Rastogi, & Shim, 1998). The algorithms CURE (Clustering Using Representatives) (Guha et al., 1998) and ROCK (Guha, Rastogi, & Shim, 2000) perform on such random sample sets, with CURE used for continuous features and ROCK for categorical features. Even though both algorithms are built on hierarchical clustering that has at least quadratic computational complexity, they can still scale well with large-scale data clustering. Other examples that are based on the random sampling strategy include CLARA (Clustering LARge Applications) (Kaufman & Rousseeuw, 2008) and CLARANS (Clustering Large Applications based on RANdomized search) (Ng & Han, 2002). Sampling also screens out a proportionate (i.e. majority) of outliers. In fact, the lower bound of the minimum sample size can be estimated in terms of Chernoff bounds, given the low probability that clusters are missing in the sample set (Sequeira & Zaki, 2004). The algorithms CURE (Clustering Using Representatives) (Guha et al., 1998) and ROCK (Guha et al., 2000) perform on such random sample sets, with CURE used for continuous features and ROCK for categorical features. Even though both algorithms are built on hierarchical clustering that has at least quadratic computational complexity, they can still scale well with large-scale data clustering. Other examples that are based on the random sampling strategy include CLARA (Clustering LARge Applications) (Kaufman & Rousseeuw, 2008) and CLARANS (Clustering Large Applications based on RANdomized search) (Ng & Han, 2002). Condensation-based approaches perform clustering by using the calculated summary statistics of the original data rather than operating on the entire dataset. In this sense, the requirement for the storage of and frequent operations on the large amount of data is greatly reduced, saving both computational time and storage space. The algorithm BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) (Zhang, Ramakrishnan, & Livny, 1996) is a typical example. The idea of BIRCH is to use a CF (Clustering Feature) tree to store the statistical summaries of the original data, which

Chapter 3  Clustering algorithms

69

capture the important clustering information of the original data. The basic unit of the CF tree in representing a cluster Ci is a clustering feature (CF), which is a 3-tuple, CFi ¼ (Ni, LSi, SSi), consisting of the number of data objects Ni in the cluster, the linear sum of Ni Ni P P x j , and the square sum of the Ni data objects SSi ¼ x 2j . the Ni data objects LSi ¼ j¼1

j¼1

This data summary maintains sufficient information for further clustering operations in BIRCH, such as the merging of two clusters, while largely reducing the computational burden. The inclusion of the categorical features leads to a different form of the cluster feature for a cluster Ci, with one more element added, which is a vector given as NBi ¼ ðNBi1 ; .; NBiKB Þ, where NBij ¼ ðNij1 ; .; NijLl 1Þ, and Nijl is the number of data objects in Ci whose jth categorical feature has the value of lth category, l ¼ 1, ., Lk  1 and k ¼ 1, .KB (Chiu, Fang, Chen, Wang, & Jeris, 2001). The concept of a CF and a CF tree of BIRCH is generalized under the BIRCH* framework with the general metric spaces considered, which is further realized as the algorithms BUBBLE and BUBBLE-FM (Goutte et al., 1999). BIRCH is also combined with other algorithms that use summary statistics for sub-cluster representation, such as the algorithm EMADS (EM Algorithm for Data Summaries), which provides an approximation of the aggregate behavior for sub-clusters under the Gaussian mixture model, leading to the EMADS algorithm (Huidong, Man-Leung, & Leung, 2005). The scalable clustering framework by Bradley, Fayyad, & Reina (1998) classifies data objects into three categories based on their importance to the clustering model: retained set (RS), discard set (DS), and compression set (CS). RS contains data objects that must be assessed all the time and therefore is always maintained in the main memory. DS includes data objects that are unlikely to move to a different cluster and is determined in the primary phase of the algorithm. In BIRCH, all data objects are placed into the DS. The CS consists of data objects that are further compressed in the secondary phase of the framework to generate sub-clusters represented with sufficient statistics. A simplified version of the clustering framework, together with a simple single pass K-means algorithm, is described in Farnstrom, Lewis, & Elkan (2000). The divide-and-conquer strategy (Guha, Meyerson, Mishra, Motwani, & OCallaghan, 2003; Park & Lee, 2007) first divides the entire dataset, which is too large to fit in the main memory, into subsets with similar sizes. Each of the subsets is then loaded into the main memory and is presented to some cluster algorithm separately to generate a set of clusters. Representatives of these formed clusters, which can be the centroids of the clusters or a set of well-scattered data objects (Park & Lee, 2007), are then picked for further clustering. These representatives might be weighted based on some rule, e.g., the centroids of the clusters could be weighted by the number of objects they represent (Hinneburg & Keim, 1998). The algorithm repeatedly clusters the representatives obtained from the clusters in the previous level until the highest level is reached. The data objects are then placed into the corresponding clusters formed at the highest level based on the representatives at different levels. An application of divide-and-conquer in

70 Computational Learning Approaches to Data Analytics in Biomedical Applications

the traveling salesman problem quadrupled the speed of the best competing heuristics, with a smaller memory footprint (Mulder & Wunsch, 2003; Wunsch & Mulder, 2004). The divide-and-conquer strategy first divides the entire dataset, which is too large to fit in the main memory, into subsets with similar sizes. Each of the subsets is then loaded into the main memory and presented to some cluster algorithm separately to generate a set of clusters. Representatives of these formed clusters, which can be the centroids of the clusters or a set of well-scattered data objects, are then picked for further clustering. These representatives might be weighted based on some rule, e.g., the centroids of the clusters could be weighted by the number of objects they represent (Robert et al., 2006). The algorithm repeatedly clusters the representatives obtained from the clusters in the previous level until the highest level is reached. The data objects are then placed into the corresponding clusters formed at the highest level, based on the representatives at different levels. An application of divide-and-conquer in the traveling salesman problem quadrupled the speed of the best competing heuristics with a smaller memory footprint (Mulder & Wunsch, 2003). An incremental or online clustering algorithm processes a dataset in a one-object-ata-time manner, making the storage of the entire dataset unnecessary, while endowing the algorithm with the ability to admit new input without learning from scratch. A typical incremental learning clustering algorithm is the adaptive resonance theory (ART) family (Moore, 1988), which constructively creates new clusters to represent data objects that do not match well with the existing clusters. Because only the prototype of each cluster must be stored in the memory, the space requirement is fairly low for incremental clustering algorithms. In addition to stability, another major problem for incremental clustering algorithms is that they are order dependent. In other words, the different presentation orders of the input objects may lead to different partitions of the data (Carpenter & Grossberg, 1987; Moore, 1988). This is frequently an acceptable tradeoff for the computational advantages. An incremental or online clustering algorithm processes a dataset in a one-object-at-a-time manner, making the storage of the entire dataset unnecessary, while enabling the algorithm to admit new input without learning from scratch. A typical incremental learning clustering algorithm is the (ART) family (Moore, 1988), which constructively creates new clusters to represent data objects that do not match well with the existing clusters. Because only the prototype of each cluster must be stored in the memory, the space requirement is fairly low for incremental clustering algorithms. In addition to stability, another major problem for incremental clustering algorithms is that they are order dependent. In other words, the different presentation orders of the input objects may lead to different partitions of the data (Carpenter & Grossberg, 1987; Moore, 1988).This is frequently an acceptable tradeoff for the computational advantages. Density-based approaches consider clusters as regions of data objects that have a higher density than that of objects outside of the clusters. In addition to good scalability, a direct advantage of such a definition is that the clusters will be in arbitrary shapes, rather than restricted to some certain shapes that may not be effective in describing all

Chapter 3  Clustering algorithms

71

different types of data. The algorithm DBSCAN (Density-Based Spatial Clustering of Applications with Noise) (Ester et al., 1996) implements the concept of densityreachability and density-connectivity to define clusters. Each data object, called a core point, that lies inside a cluster should contain sufficient neighboring objects in its neighborhood. If two core points are within each other’s neighborhood, they belong to the same cluster. In contrast, border points refer to those on the border of a cluster that fail to contain enough neighbors to be the core points. DBSCAN then creates a new cluster from a core point by absorbing all objects that are density-reachable from it. The algorithm OPTICS (Ordering Points To Identify the Clustering Structure) (Ankerst, Breunig, Kriegel, & Sander, 1999) is immune from the determination of the userdependent parameters in DBSCAN and focuses on the construction of an augmented ordering of data representing its density-based clustering structure when processing a set of neighborhood radius parameters. A different way to reduce the dependence on user-specified parameters is suggested in the algorithm DBCLASD (Distribution Based Clustering of LArge Spatial Databases) (Xu et al., 1998); this method assumes that data objects inside a cluster follow a uniform distribution and attempts to build a characteristic probability distribution of the distance to the cluster’s nearest neighbors. The algorithm DENCLUE (DENsity-based CLUstEring) (Hinneburg & Keim, 1998) considers modeling the overall density function over the data space based on the influence of each data object on its neighborhood, which can be modeled mathematically using an influence function. A center-defined cluster is then identified through a density attractor, with its overall density above a density threshold and the density of a set of objects attracted to the density attractor. Further, if a pair of density attractors can be connected by a path and each object on the path has a density above the threshold, these clusters merge and form a cluster with an arbitrary shape. DENCLUE can be considered as a generalization of both K-means and DBSCAN algorithms with an appropriate selection of the influence function, together with its parameters (Hinneburg & Keim, 1998). Grid-based clustering algorithms achieve the final data partitioning by operating on some space partitioning, usually in the form of a set of cells or cubes, by applying a grid to the data space. The algorithm STING (STatistical INformation Grid) (Wang, Yang, & Muntz, 1997) is based on a hierarchical structure within the division of the data space. Cells are constructed at different levels in the hierarchy corresponding to different resolutions, with only one cell at the root level, which corresponds to the entire dataset. Clustering is performed in a top-down manner, starting with the root level or some intermediate layer. Cells that are relevant to certain conditions are determined based on their data summaries, and only those cells that are children of the relevant cells are further examined. After the bottom level is reached, a breadth-first search can be used to find the clusters that have densities greater than a prespecified threshold. Thus, STING combines both data condensation and density-based clustering strategies. Other grid-based clustering algorithms include WaveCluster (Sheikholeslami, Chatterjee, & Zhang, 1998), which considers clustering from a signal processing perspective, and FC (Fractal Clustering) (Barbara´& Chen, 2000), which integrates the concepts of incremental clustering with fractal dimensions.

72 Computational Learning Approaches to Data Analytics in Biomedical Applications

3.3.10

High-dimensional data clustering

Data complexity increases not only in the direction of the number of data objects N, but in the dimension d as well. Advancement in different domains makes it possible to automatically and systematically obtain a large amount of measurements, but, unfortunately, without precise identification of the relevance of the measured features to the specific phenomena of interest. Data observations with thousands of features, or more, are now common in the analysis of financial, genomic, sensor, web document, and satellite image data. The term “curse of dimensionality,” first used by Bellman (1961) to indicate the exponential growth of complexity in the case of multivariate function estimation under a high-dimensionality situation, is now generally used to describe the problems accompanying high-dimensional spaces (Durrant & Kaba´n, 2009; Haykin, 2009). High dimensionality not only greatly increases the computational burden and makes a clear and visual examination of data infeasible, but it may also cause problems in the separation of data objects. In practice, although data is represented with a large number of features, many of them are only included in the data as a result of a subjective measurement choice and contribute nothing to the description of the real structure of the data. Alternately, the minimum number of free features that provide sufficient information in representing data is referred to as intrinsic dimensionality, d0, which is much lower than the original dimension d (Camastra, 2003). In this way, data are 0 embedded in a low-dimensionality and compact subspace > > > < Sði  1; jÞ þ eðx ; fÞ; i : Sði; jÞ ¼ max > Sði; j  1Þ þ eðf; y Þ; > > j > : 0:

(3.54)

The backtracking procedure can start from anywhere in the dynamic programming matrix that has the maximum similarity score, not necessarily just from the bottom-right corner of the matrix. The process continues until a zero is reached. This is known as the SmitheWaterman algorithm (Smith & Waterman, 1985). Although dynamic programming algorithms are guaranteed to generate the optimal alignment or sets of alignments based on the similarity score (Duffy & Quiroz, 1991), their high computational complexity, which is O(NM), make them less usable in typical sequence comparison with the order of millions in length or longer. Shortcuts to dynamic programming have been developed that trade speed with sensitivity (Xu et al., 2010). Although some best scoring alignments will be missed as a result of this tradeoff, these heuristics still provide effective ways of investigating the sequence similarity and therefore are widely used in sequence clustering. The basic idea of the heuristic approaches is to tentatively identify the regions that may have potentially high matches using a list of prespecified, high-scoring words at an early stage. Thus, further searches only need to focus on these small regions, saving them from expensive computation on the full sequences. The most well-known examples of these algorithms include BLAST (Basic Local Alignment Search Tool) (Altschul, Gish, Miller, Myers, & Lipman, 1990) and FASTA (Pearson & Lipman, 1988), together with many of their variants (Altschul et al., 1997; Gusfield, 1997). Because avoiding the intrinsically, BLAST is based on the assumption that a statistically significant alignment is more likely to contain a highscoring pair of aligned words (Altschul et al., 1997). Accordingly, BLAST searches all high-scoring segment pairs (HSPs) of two compared sequences whose similarity scores

Chapter 3  Clustering algorithms

77

exceed a prespecified threshold T1. Such a search starts with the location of short stretches or words among the query sequence and the sequences in the database with a score no less than a threshold T2. Typically, the length of the words is 3 for proteins and 11 for nucleic acids (Durbin et al., 1998). BLAST then extends these words as seeds in both directions until no further improvement for the score of the extensions is observed. The original BLAST does not consider gaps during the alignment, a capability that became available in the later version, known as gapped BLAST. The integration of statistically significant alignments derived from BLAST with a position-specific score matrix leads to the PSI-BLAST (Position-Specific Iterated BLAST), which is more sensitive to weak but biologically relevant sequence similarities. FASTA stands for “FAST-All” because it works for both protein (FAST-P) and nucleotide (FAST-N) alignment. FASTA implements a hash table to store all words of length T (typically, 1 or 2 for protein and 4 or 6 for DNA (Aggarwal et al., 1999)) in the query sequence. Sequences in the database are then scanned to mark all the matches of words, and the 10 best diagonal regions with the highest densities of word matches are selected for a rescore in the next step, where these diagonals are further extended; this strategy is similar to that used in BLAST (Altschul et al., 1997). The third step examines whether the obtained ungapped regions could be joined by considering gap penalties. Finally, standard dynamic programming is applied to a band around the best region to optimize the alignments. Recognizing the benefit of the separation of word matching and sequence alignment to the reduction of computational burden, Miller, Gurd et al. (1999) introduced the RAPID (Rapid Analysis of Pre-Indexed Data structures) algorithm for word search and the PHAT (Probabilistic Hough Alignment Tool) and SPLAT (Smart Probabilistic Local Alignment Tool) algorithms for alignment. The implementation of the scheme for large database versus database comparison exhibits approximately one order of magnitude improvement in computational time compared with BLAST, while maintaining good sensitivity. Kent and Zahler designed a three-pass algorithm, called WABA (Wobble Aware Bulk Aligner) (Kent & Zahler, 2000), for aligning large-scale genomic DNA sequences of different species, e.g, the alignment of 8 million bases of Caenorhabditis Briggsae genomic DNA against the entire 97 million bases of the Caenorhabditis elegans genome (Kent & Zahler, 2000). The specific consideration of WABA is to deal with a high degree of divergence in the third, or “wobble,” position of a codon. The seeds used to start an alignment take the form of XXoXXoXX, where the X’s indicate a must-match and the o’s do not, instead of requiring a perfect match of 6 consecutive bases. A seven-state pairwise hidden Markov model (Smyth, 1997) was used for more effective alignments, where the states correspond to the long inserts in the target and query sequences, highly and lightly conserved regions, and three coding regions. See Morgenstern, Frech, Dress, and Werner (1998), Zhang, Schwartz, Wagner, and Miller (2000), and Sæbø, Andersen, Myrseth, Laerdahl, and Rognes (2005) for more algorithms for sequence comparison and alignment.

78 Computational Learning Approaches to Data Analytics in Biomedical Applications

With the distance or similarity information available, clustering algorithms that can operate on a proximity matrix can then be directly applied to cluster sequences. For example, Somervuo and Kohonen illustrated an application of a 30-by-20 hexagonal SOFM in clustering protein sequences of the SWISS-PROT database, release 37, containing 77,977 protein sequences (Somervuo & Kohonen, 2000). Each neuron of SOFM is represented by a prototype sequence, which is the generalized median of the neighborhood, based on the sequence similarity calculated by FASTA. Sasson et al. used agglomerative hierarchical clustering to cluster the protein sequences up to one million based on the similarity measure of gapped BLAST (Kaplan et al., 2005; Sasson, Linial, & Linial, 2002). GeneRAGE is also based on the implementation of single-linkage hierarchical clustering with BLAST (Altschul et al., 1997). A family of partitional clusteringbased protein sequence clustering algorithms includes Pro-k-means, Pro-LEADER, Pro-CLARA, and Pro-CLARANS (Fayech, Essoussi, & Limam, 2009), adapted from the corresponding clustering algorithms such as K-Means, LEADER (Spath, 1980), CLARA (Kaufman & Rousseeuw, 2008), and CLARANS (Ng & Han, 2002). SEQOPTICS is developed as a combination of density-based OPTICS (Ankerst et al., 1999) and the SmitheWaterman algorithm for clustering protein sequences (Fayech et al., 2009).

3.3.11.2 Feature-based sequence clustering Feature-based sequence clustering treats the sequential data as raw and unprocessed. Feature extraction methods are then used to map a set of sequences onto the corresponding data points in the transformed feature space, where classical vector spacebased clustering algorithms can be used for clustering. In this way, all-against-all expensive sequence comparison is no longer necessary, making this strategy potentially useful for large-scale sequence clustering. Features can be defined as the sequential patterns within a prespecified length range, with a minimum support constraint, which satisfies the following important characteristics: (1) Effectiveness in capturing the sequential relations between the different sets of elements contained in the sequences, (2) Wide presence in the sequences, and (3) Completeness of the derived feature space (Guralnik & Karypis, 2001). A sequential pattern refers to a list of sets of elements with the percentage of sequences containing it, called support, above some threshold. Both global and local approaches are suggested to prune the initial feature sets, or to select the independent features, in order to remove the potential dependency between two sequential patterns. The resulting clusters of sequences can be formed by using clustering algorithms, such as K-means (Sheng & Liu, 2006) and agglomerative hierarchical clustering (Xu & Wunsch, 2009) on these types of sequential patterns. Guralnik and Karypis transformed protein and DNA sequences into a feature space based on the detected sub-patterns or motifs, treated as sequence features, and they further performed clustering with the K-means algorithm (Guralnik & Karypis, 2001). An application of this approach in clustering 43,569 sequences from 20 different protein families discloses the 13 most functional classes, and most clusters include sequences

Chapter 3  Clustering algorithms

79

from, at most, 2 major protein families. The independent features are identified as 21,223 motifs of length 3e6 found by the pattern discovery algorithm. In the case of sequences with continuous data, the Haar wavelet decomposition can be calculated for all sequences to make the analysis of data available (Vlachos, Lin, & Keogh, 2003). At different resolutions K-means is then applied to each level of resolution in the direction of finer levels. At each resolution, the centroids obtained from the previous level are used as seeds for the current level of clustering to alleviate the dependency on initialization for K-means.

3.3.11.3 Model-based sequence clustering Model-based sequence clustering algorithms consider each cluster of sequences to be generated following some underlying probabilistic model. Thus, the dynamics and characteristics of sequence clusters can be explicitly described, which makes modelbased sequence clustering more powerful in disclosing the properties of the sequences. Specifically, given a set of L sequences {o1,o2, .oL} from K clusters {C1,C2, .CK}, the mixture probability density for the entire data set can be expressed as pðojqÞ ¼

K X

pðojCi ; qi ÞPðCi Þ;

(3.55)

i¼1

where p(ojCi, qi) is the component density with parameters qi, or the class-conditional probability density, and the prior probability P(Ci) for the cluster Ci satisfies the K P constraint PðCi Þ ¼ 1. The component density can take any functional form, for i¼1

example, autoregressive moving average (ARMA) models (Warren Liao, 2005; Xiong & Yeung, 2004), Markov chains (Ramoni, Sebastiani, & Cohen, 2002; Smyth, 1999), and polynomial models (Bagnall, Janacek, Iglesia, & Zhang, 2003; Gaffney & Smyth, 1999), or it can come from the same family but with different parameters. Among the models, the hidden Markov model (HMM) (Rabiner, 1989; Smyth, 1997) is widespread, having first gained popularity in the application of speech recognition (Rabiner, 1989). A discrete HMM describes an unobservable stochastic process consisting of a set of states, each of which is related to another stochastic process that emits observable symbols. Fig. 3.8 depicts an application of HMM in genomic sequence modeling and clustering, which is implemented in a system called SAM (Sequence Alignment and Modeling) (Krogh, Brown, Mian, Sjo¨lander, & Haussler, 1994). The system consists of match states (abbreviated with the letter M), insert states (I), and delete states (D), corresponding to substitution, insertion, and deletion in edit operations, respectively (Hughey & Krogh, 1996; Krogh et al., 1994; Smyth, 1997). Symbols in the form of the 4-letter nucleotide alphabet or 20-letter amino acid alphabet are produced from match and insert states according to emission probability distributions, but not from delete states, which are used to skip the match states. For convenience, a begin state and an end state are added to the model, denoted by the letters B and E, which do not

80 Computational Learning Approaches to Data Analytics in Biomedical Applications

Di

Ii

B

Mi

E

FIG. 3.8 An HMM architecture (Smyth, 1997). The model consists of three different states, match (M), insert (I), and delete (D), represented as rectangles, diamonds, and circles, respectively. A begin (B) and an end (E) state are also introduced to represent the start and end of the process. This process goes through a series of states according to the transition probability and emits either a 4-letter nucleotide or a 20-letter amino acid alphabet based on the emission probability.

generate any amino acid. Given an HMM with completely specified parameters, a DNA or protein sequence can be obtained following the procedure below: 1. Starting at the begin state B, transition to a match, insert, or delete state based on the state transition probabilities; 2. If the state is a a. match or insert state, generate an amino acid or nucleotide in the sequence according to the emit probabilities; b. delete state, generate no amino acid or nucleotide. 3. Transition to a new state with the state transition probabilities. If it is not the end state, return to step 2. Otherwise, terminate the procedure and output the genomic sequence. In order to cluster a set of sequences into K clusters, or families (subfamilies), K HMMs are required, each corresponding to a component of a mixture model. The parameters are estimated through the EM algorithm (Hughey & Krogh, 1996). Krogh et al. (1994) applied 10-component HMMs with initial lengths randomly selected between 120 and 170 to cluster subfamilies of 625 globins with an average length of 145 amino acids and 3 nonglobins. HMM clustering successfully identifies three major globin subfamilies as alpha, beta, and myoglobin.

3.4 Adaptive resonance theory Carpenter and Grossberg developed ART in 1987 as a solution to the stabilityeplasticity dilemma. ART can learn arbitrary input patterns in a stable, fast, and self-organizing way, thus overcoming the effect of learning instability that plagues many other competitive networks. ART is not, as is popularly imagined, a neural network architecture. It is a

Chapter 3  Clustering algorithms

81

learning theory hypothesizing that resonance in neural circuits can trigger fast learning. As such, it subsumes a large family of current and future neural network architectures with many variants. ART1 is the first member, which only deals with binary input patterns (Carpenter & Crossberg, 1987; Moore, 1988), although it can be extended to arbitrary input patterns by utilizing a variety of coding mechanisms. ART2 extends the applications to analog input patterns (Carpenter & Grossberg, 1987), and ART3 introduces a mechanism originating from elaborate biological processes to achieve more efficient parallel searches in hierarchical structures (Carpenter & Crossberg, 1987). Fuzzy ART (FA) incorporates fuzzy set theory and ART and can work for all real datasets (Carpenter, Grossberg, & Reynolds, 1991; Carpenter, Grossberg, & Rosen, 1991). (It is typically regarded as a superior alternative to ART2). In Pela´iz-Barranco, Gutie´rrezAmador, Huanosta, and Valenzuela (1998) the hardware implementations and verylarge-scale integration (VLSI) design of ART systems was demonstrated. In Wunsch et al. (1991) and Wunsch, Caudell, Capps, Marks, and Falk (1993), the optical correlatorbased ART implementation, instead of the implementation of ART in electronics, was also discussed. The rest of this chapter will discuss Fuzzy ART, Fuzzy ARTMAP, and, BARTMAP.

3.4.1

Fuzzy ART

Because Fuzzy ART serves as the basic module for both Fuzzy ARTMAP and BARTMAP, this section opens with a brief introduction of Fuzzy ART. The basic Fuzzy ART architecture consists of two-layer neurons, the feature representation field F1, and the category representation field F2, as shown in Fig. 3.9. The neurons in layer F1 are activated by the input pattern that is normalized with the complement coding rule to avoid category proliferation. The prototypes of the formed clusters are stored in layer F2. The neurons in layer F2 that are already being used as representations of input patterns are said to be committed. Correspondingly, the uncommitted neuron encodes no input patterns. The two layers are connected via adaptive weights wj, emanating from node j in layer F2, which are initially set as 1. After Reset Layer F2

… W

Layer F1



ρ Orien

Input P

system

A

FIG. 3.9 Topological structure of Fuzzy ART. Fuzzy ART performs fast, online, unsupervized learning by clustering input patterns, admitted in Layer F1, into hyper-rectangular clusters, stored in Layer F2. Both layers are connected via adaptive weights W. The orienting subsystem is controlled by the vigilance parameter r.

82 Computational Learning Approaches to Data Analytics in Biomedical Applications

an input pattern is presented, the neurons (including a certain number of committed neurons and one uncommitted neuron) in layer F2 compete by calculating the category choice function Tj ¼

jA^w j j ; a þ jw j j

(3.56)

where ^ is the fuzzy AND operator defined by ðA^wÞi ¼ minðAi ; wi Þ;

(3.57)

and a > 0 is the choice parameter to break the tie when more than one prototype vector is a fuzzy subset of the input pattern, based on the winner-take-all rule, TJ ¼ maxfTj g. j

(3.58)

The winning neuron, J, then becomes activated, and an expectation is reflected in layer F1 and compared with the input pattern. The orienting subsystem with the prespecified vigilance parameter r (0  r  1) determines whether the expectation and the input pattern are closely matched. If the match meets the vigilance criterion, r

jA^w J j ; jAj

(3.59)

weight adaptation occurs where learning starts, and the weights are updated using the following learning rule, w J ðnewÞ ¼ bðA^w J ðoldÞÞ þ ð1  bÞw J ðoldÞ;

(3.60)

where b ˛ [0,1] is the learning rate parameter, and b ¼ 1 corresponds to fast learning. This procedure is called resonance, which suggests the name of ART. On the other hand, if the vigilance criterion is not met, a reset signal is sent back to layer F2 to shut off the current winning neuron, which will remain disabled for the duration of the presentation of this input pattern. A new competition will be performed among the remaining neurons. This new expectation is then projected onto layer F1, and this process repeats until the vigilance criterion is met. In the case where an uncommitted neuron is selected for coding, a new uncommitted neuron is created to represent a potential new cluster. This will maintain a consistent supply of uncommitted neurons.

3.4.2

Fuzzy ARTMAP

A Fuzzy ARTMAP network consists of two Fuzzy ART modules (ARTa and ARTb) interconnected via an inter-ART module, or the map field module (Carpenter, Grossberg, & Reynolds, 1991; Carpenter, Grossberg, & Rosen, 1991). In the context of supervised classification, the input pattern is presented to the ARTa module, and the corresponding label is presented to the ARTb module. For gene expression data, the input patterns can be either genes or samples, depending on the interests of the users. The vigilance parameter of ARTb is set to 1, which leads to the representation of each label as a specific cluster. The information regarding the inputeoutput associations is stored in the weights

Chapter 3  Clustering algorithms

83

wab of the inter-ART module. The jth row of the weights of the inter-ART module w ab j denotes the weight vector from the jth neuron in ARTa to the map field. When the map field is activated, the output vector of the map field is x ab ¼ y b ^w ab j ;

(3.61)

Where yb is the binary output vector of field F2 in ARTb, and y bi ¼ 1 only if the ith category wins in ARTb. Similar to the vigilance mechanism in ARTa, the map field also performs a vigilance test such that if rab >

jx ab j ; jy b j

(3.62)

where rab (0  rab  1) is the map field vigilance parameter, a match tracking procedure is activated. In this case, the ARTa vigilance parameter ra is increased from its baseline vigilance to a value just above the current match value. This procedure assures the shut off of the current winning neuron in ARTa, whose prediction does not comply with the label represented in ARTb. Another ARTa neuron will then be selected, and the match-tracking mechanism will again verify whether it is appropriate. If no such neuron exists, a new ARTa category is created. Once the map field vigilance test criterion is satisfied, the weight w ab J for the neuron J in ARTa is updated by the following learning rule:  b  ab ab w ab J ðnewÞ ¼ g y ^w J ðoldÞ þ ð1  gÞw J ðoldÞ;

(3.63)

where g ˛ [0,1] is the learning rate parameter. Note that with fast learning (g ¼ 1), once neuron J learns to predict the ARTb category I, the association is permanent, i.e., wJIab ¼ 1 for all input pattern presentations. In a test phase where only an input pattern is provided to ARTa without the corresponding label to ARTb, no match tracking occurs. The class prediction is obtained from the map field weights of the winning ARTa neuron. However, if the neuron is uncommitted, the input pattern cannot be classified solely on prior experience.

3.4.3

BARTMAP

BARTMAP is motivated by the basic theory of Fuzzy ARTMAP, but with a different focus on clustering in both dimensions, rather than on supervised classification for which Fuzzy ARTMAP is generally used. Similar to Fuzzy ARTMAP, BARTMAP also consists of two Fuzzy ART modules communicated through the inter-ART module (see Fig. 6.3). However, the inputs to the ARTb module are genes (rows) instead of labels as in most of the Fuzzy ARTMAP applications (Fuzzy ARTMAP itself is specified as learning maps between spaces of arbitrary dimension (Carpenter, Grossberg, Markuzon, Reynolds, &Rosen, 1992)). As such, the inputs to the ARTa module are samples. Although, the inputs to the ARTa and ARTb modules can be exchanged, and procedures like the ones described below can be used to identify relations between gene and sample clusters. The basic idea of BARTMAP is to integrate the clustering results on the dimensions of

84 Computational Learning Approaches to Data Analytics in Biomedical Applications

columns and rows of the data matrix from Fuzzy ART to create biclusters that capture the local relations of genes and samples. BARTMAP belongs to the category of two-way clustering that is conceptually simpler than other types of biclustering algorithms (Carpenter & Gaddam, 2010). It is further elaborated on in Chapter six.

3.5 Summary This chapter presents a review of classical and state-of-the-art clustering algorithms in the communities of computer science, computational intelligence, machine learning, and statistics, with a focus on their applications in biomedical research, particularly for microarray gene expression data analysis, genomic sequence analysis, MRI data analysis, and biomedical document clustering. The integration of clustering theories and algorithms with biomedical research practices constitutes an important portion of an emerging and rapidly developing field of bioinformatics that is multidisciplinary in nature. This integration will greatly benefit both fields and facilitate their advancement. The goal of this chapter, then, is to provide guidance for biomedical researchers to select the most appropriate model for their applications and integrate the two fields more simply. Biomedical engineers heavily rely on several classical clustering technologies in their data analysis, such as standard agglomerative hierarchical clustering (single linkage, complete linkage, average linkage, etc.), SOFM, and standard k-means, considering good visualization for the first two methods and linear complexity for the latter. The availability of software packages and the ease of implementation of the algorithms are other major factors contributing to their popularity. However, while these approaches enjoy their merits, they also suffer from many disadvantages that make them inappropriate for some biomedical applications. For example, the computational complexity of standard   agglomerative hierarchical clustering is at least O N 2 , which makes them a very inappropriate choice for large-scale data clustering. Furthermore, their lack of robustness also restricts their applications in a noisy environment. Alternately, applications of many state-of-the-art clustering algorithms in biomedical practice, for example, kernel learning-based clustering, nonlinear projection approaches, the family of ART, and many clustering algorithms specifically designed for large-scale data clustering (BIRCH, CURE, DBSCAN, OPTICS, STING, etc.), are still rare, although biomedical researchers have gradually recognized their effectiveness. The reason for this could be the lack of effective guidance for algorithm implementation and parameter tuning, or just the lack of good communication between the fields. To help biomedical engineers gain a clearer understanding of the existing clustering algorithms and their pros and cons so that clustering can be better applied, this chapter summarizes, in the bullets below, properties as important criteria for evaluating a clustering algorithm. The clustering algorithms are also classified with regard to these properties in Table 3.2. These properties also constitute the major challenges in clustering that must be addressed by the next generation of clustering technologies. Of course, detailed requirements for specific applications will affect these properties.

Table 3.2

Summary of major clustering algorithms.

Clustering algorithms Classical agglomerative hierarchical clustering BIRCH CURE ROCK K-means Online K-means PAM ISODATA

K-modes CLARA

RPCL GLVQ FLVQ SOFM SVC Kernel-K-means

Order dependency Visualization Data types

At least O(N2)

No

No

Yes

Fair

Yes

No

Good

Both

O(N) O(Ns2 logNs)b O(N2s þ Ns2 logNs þ Nsmmma)c O(NKdT) O(NKdT) O(K(N  K)2) O(NKdT)

No No No

Yes Yes Yes

Yes Yes Yes

Many Many Many

Yes Yes e

No No No

Good Good Good

No No No No

No No Yes Yes

Fair Fair Fair Many

No No No No

No Yes No No

e e e e

O(NKdT) O(K(40 þ K)2 þ K(N  K) U(KN2) O(NKdT) O(NKdT) O(N)  O(N2) Scalable O(NlogN) or O(N) for one-pass variant Scalable Scalable Scalable O(N2) O(N2) O(N2(d þ T))

No No

No Yes

Yes Yes Yes Yes, but can be dynamically adjusted Yes Yes

Continuous Continuous Categorical or binary Continuous Continuous Continuous Continuous

Fair Fair

e Yes

No No

e e

Categorical Continuous

No No No Yes Yes Yes

Yes No Yes Yes Yes Yes

Yes Yes Yes No No No

Many Fair Many Many Many Fair

Yes Yes Yes Yes Yes Yes

No No No No No Yes

e e e e e e

Continuous Continuous Continuous Continuous Both Both

No No No Yes Yes Yes

Yes No No No Yes No

Yese Yes Yes Yes No Yes

Fair Many Many Many Fair Fair

Yes No No Yes Yes Yes

Yes Yes Yes Yes No No

e e e Good e e

Continuous Continuous Continuous Continuous Continuous Continuous

85

Continued

Chapter 3  Clustering algorithms

CLARANS FCM PCM PSO-basedd GA-basedd ART family

Scalability

Irregular High User-dependent Parameter cluster dimensionality Robustness K reliancea shape

Clustering algorithms KFCM DBSCAN OPTICS DBCLASD DENCLUE STING WaveCluster FC CLIQUE MAFIA ORCLUS OptiGrid

Summary of major clustering algorithms.dcont’d

Scalability O(N2(d þ T)) O(NlogN) O(NlogN) Near O(NlogN) O(NlogN)  2  O NCB O(N) O(N) O(N) 0 O(cd þ (N/B)d0 g)f 3 O(K0 þ K0Nd þ K02 d3)g Between O(Nd) and O(dNlogN)

Irregular High User-dependent Parameter cluster dimensionality Robustness K reliancea shape

Order dependency Visualization Data types

Yes No No Yes Yes No No Yes Yes Yes Yes

No Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes

Yes No No No No No No No No No No

Fair Fair Many Free Fair Fair Fair Many Fair Fair Many

Yes Yes Yes Yes Yes No Yes Yes Yes Yes Yes

No No No Yes Yes No No Yes No No No

e e e e e e e e Good Good Good

Continuous Continuous Continuous Continuous Continuous Continuous Continuous Continuous Both Both Continuous

Yes

Yes

No

Fair

Yes

No

Good

Continuous

The parameter of number of clusters is separated for consideration. When the number of parameters >2 it is categorized as “Many;” otherwise, it is considered “Fair.” If there are no user-dependent parameters, the word “Free” is used. b Number of samples; 2Number of cells in the bottom layer. c mm is the maximum number of neighbors for an object and ma is the average number of neighbors. d The properties of PSO-based and GA-based clustering algorithms can vary based on the different encoding strategies and the definition of fitness function. e Weight vectors corresponding to the overestimated number of clusters will be driven away from data dense areas. f d’ is the highest dimensionality of any dense unit in the data, c is a constant, B is the number of records that fit in the memory buffer, and g is the I/O access time for a clock of B records from the disk. g K0 is the number of initial seeds. a

86 Computational Learning Approaches to Data Analytics in Biomedical Applications

Table 3.2

Chapter 3  Clustering algorithms

87

(1) Scalability: Scalability, in terms of both running time and memory requirements, is the top priority for a clustering algorithm because of the ever-increasing data from all different types of biomedical applications. Typical examples include the requirement of processing millions of pieces of biomedical data and handling genomic sequences with an increase rate of up to tens of billions of bases per year. It is difficult to imagine a clustering algorithm demonstrating practical utility if it takes several months to complete a run for a million-record database or if it has an extraordinarily strict need for memory that cannot be satisfied by regular machines, even it is perfect in terms of all other evaluation criteria listed later in this section. Linear, or near linear, complexity is therefore highly desirable and is available for some of the algorithms described here. (2) Dimensionality: The algorithm should be capable of handling data with a large number of features, possibly even larger than the number of objects in the dataset. This property is particularly urgent in gene expression data analysis and image analysis. The identification of relevant features or the capture of the intrinsic dimension is important for describing the real data structure. (3) Robustness: It is unrealistic to count on the availability of pure and uncontaminated data. Noise and outliers are inevitably present in the data introduced in the different stages of measurement, storage, and processing of data. The clustering algorithm should be able to detect and remove possible outliers and noise. (4) Cluster number: Correctly identifying the number of clusters is the most fundamental problem in cluster analysis and continues to attract more attention as many current algorithms require it as a user-specified parameter, which is difficult to decide without prior knowledge. An incorrect estimation of the number of clusters will prevent investigators from learning the real clustering structure, e.g., underestimation could cause cancer subtypes to be missed. Thus, if this information is not known or reliably estimable a priori, the number of clusters should be determined by the algorithm itself based on the data properties. (5) Parameter reliance: It is not rare for a current clustering algorithm to require users to specify three or more parameters to which the algorithm is sensitive. Thus, it is important to decrease the reliance of algorithms on user-dependent parameters, or at least to provide effective guidance to determine these parameters instead of users’ wide range of guesses. (6) Arbitrary cluster shapes: Biomedical data are not always formed into clusters with regular shapes, such as hyperspheres or hyperrectangles. Cluster algorithms should have the ability to detect irregular cluster shapes rather than being confined to some particular shape. (7) Order insensitivity: This property is particularly important for incremental or online clustering algorithms as their clustering solutions may vary with different orders of the presentation of input patterns, making the results questionable. To decrease or to be completely immune to the effects of the order of input patterns is a challenging problem in incremental learning.

88 Computational Learning Approaches to Data Analytics in Biomedical Applications

(8) Good visualization and interpretation: Good visualization will help users understand the clustering results, which should be interpretable in the problem domain, as well as extract useful information from the data to solve their problems. (9) Mixed data types: Data might be obtained from different sources, including features in continuous or categorical forms. Thus, clustering algorithms should be flexible in admitting different data types or be easily adaptable to some other data types. In conclusion, clustering has become an important tool for biomedical researchers to make use of the data that society, and the biomedical field in particular, increasingly generates. Collaboration between biomedical engineers and computer scientists, statisticians, and computational intelligence researchers is encouraged to not only improve effectiveness in solving biomedical problems, but also to provide useful insights into developing new clustering algorithms to meet the requirements of biomedical practices.

References Abraham, A., Das, S., & Konar, A. (2007). Kernel based automatic clustering using modified particle swarm optimization algorithm. In Proceedings of the 9th annual conference on Genetic and evolutionary computation - GECCO’07 (p. 2). New York, New York, USA: ACM Press. Adamatzky, A., & Holland, O. (1998). Voronoi-like nondeterministic partition of a lattice by collectives of finite automata. Mathematical and Computer Modelling, 28(10), 73e93. https://doi.org/10.1016/ S0895-7177(98)00156-3. Aggarwal, C. C., Wolf, J. L., Yu, P. S., Procopiuc, C., & Park, J. S. (1999). Fast algorithms for projected clustering. ACM SIGMOD Record, 28(2), 61e72. https://doi.org/10.1145/304181.304188. Aggarwal, C. C., & Yu, P. S. (2002). Redefining clustering for high-dimensional applications. IEEE Transactions on Knowledge and Data Engineering, 14(2), 210e225. https://doi.org/10.1109/69. 991713. Agrawal, R., Gehrke, J., Gunopulos, D., & Raghavan, P. (1998). Automatic subspace clustering of high dimensional data for data mining applications. In Proceedings of the 1998 ACM SIGMOD international conference on Management of data - SIGMOD’98 (pp. 94e105). New York, New York, USA: ACM Press. Al-Jabery, K. K., Obafemi-Ajayi, T., Olbricht, G. R., Takahashi, T. N., Kanne, S., & Wunsch, D. C. I. (2016). Ensemble statistical and subspace clustering model for analysis of autism spectrum disorder phenotypes. In Conf Proc IEEE Eng Med Biol Soc, 2016 (pp. 3329e3333). Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local alignment search tool. Journal of Molecular Biology. http://doi/10.1016/S0022-2836(05)80360-2. Altschul, S. F., Madden, T. L., Scha¨ffer, A. A., Zhang, J., Zhang, Z., Miller, W., et al. (1997). Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research. http:/ doi.org/10.1093/nar/25.17.3389. Anderberg, M. R. (1973). Cluster Analysis for applications (1st ed.). Elsevier. Andra´s, P. (2002). Kernel-Kohonen networks. International Journal of Neural Systems, 12(2), 117e135. https://doi.org/10.1142/S0129065702001084. Ankerst, M., Breunig, M. M., Kriegel, H.-P., & Sander, J. (1999). Optics: Ordering points to identify the clustering structure. ACM Sigmod Record. https://doi.org/10.1145/304182.304187. Arthur, D., & Vassilvitskii, S. (2007). k-meansþþ: the advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. pp. 1027e1025.

Chapter 3  Clustering algorithms

89

Asharaf, S., Shevade, S. K., & Murty, M. N. (2005). Rough support vector clustering. Pattern Recognition, 38(10), 1779e1783. https://doi.org/10.1016/j.patcog.2004.12.016. Bagnall, Janacek, Iglesia, D. L., & Zhang. (2003). Clustering time series from mixture polynomial models with discretised data. In Proceedings of the second Australasian data mining workshop. Baldi, P., & Brunak, S. (2001). Bioinformatics - the machine learning approach. Machine Learning. Ball, G. H., & Hall, D. J. (1967). A clustering technique for summarizing multivariate data. Behavioral Science, 12(2), 153e155. https://doi.org/10.1002/bs.3830120210. Barbara´, D., & Chen, P. (2000). Using the fractal dimension to cluster datasets. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD’00 (pp. 260e264). New York, New York, USA: ACM Press. Bellman, R. E. (1961). Adaptive control processes: A guided tour. PrincetonUniversity Press. Ben-Dor, A., Chor, B., Karp, R., & Yakhini, Z. (2003). Discovering local structure in gene expression data: The order-preserving submatrix problem. Journal of Computational Biology : A Journal of Computational Molecular Cell Biology. https://doi.org/10.1089/10665270360688075. Ben-Hur, A. (2008). Support vector clustering. Scholarpedia, 3(6), 5187. https://doi.org/10.4249/ scholarpedia.5187. Bezdek, J. C., Ehrlich, R., & Full, W. (1984). FCM: The fuzzy c-means clustering algorithm. Computers and Geosciences. https://doi.org/10.1016/0098-3004(84)90020-7. Bezdek, J. C., Hall, L. O., & Clarke, L. P. (1993). Review of MR image segmentation techniques using pattern recognition. Medical Physics, 20(4), 1033e1048. https://doi.org/10.1118/1.597000. Boulet, R., Jouve, B., Rossi, F., & Villa, N. (2008). Batch kernel SOM and related Laplacian methods for social network analysis. Neurocomputing. https://doi.org/10.1016/j.neucom.2007.12.026. Bradley, P. P. S., Fayyad, U., & Reina, C. (1998). Scaling clustering algorithms to large databases. Knowledge Discovery and Data Mining. https://doi.org/10.1109/78.127962. Bragg, L. M., & Stone, G. (2009). k-link EST clustering: Evaluating error introduced by chimeric sequences under different degrees of linkage. Bioinformatics. https://doi.org/10.1093/bioinformatics/btp410. Brannon, N. G., Seiffertt, J. E., Draelos, T. J., & Wunsch, D. C., II (2009). Coordinated machine learning and decision support for situation awareness. Neural Networks, 22(3), 316e325. https://doi.org/10. 1016/j.neunet.2009.03.013. Burke, J., Davison, D., & Hide, W. (1999). d2_cluster: a validated method for clustering EST and fulllength cDNA sequences. Genome Research. https://doi.org/10.1101/gr.9.11.1135. Busygin, S., Jacobsen, G., Kramer, E., Kra¨mer, E., & Ag, C. (2002). Double conjugated clustering applied to leukemia microarray data. In Proceedings of the 2nd SIAM ICDM, Workshop on clustering high dimensional data. Busygin, S., Prokopyev, O., & Pardalos, P. M. (2008). Biclustering in data mining. Computers and Operations Research. https://doi.org/10.1016/j.cor.2007.01.005.  ski, T., & Harabasz, J. (1974). A dendrite method for cluster Analysis. Communications in Calin Statistics-theory and Methods, 3(1), 1e27. Camastra, F. (2003). Data dimensionality estimation methods: A survey. Pattern Recognition. https://doi. org/10.1016/S0031-3203(03)00176-6. Camastra, F., & Verri, A. (2005). A novel Kernel method for clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(5), 801e805. https://doi.org/10.1109/TPAMI.2005.88. Carpenter, G. A., & Crossberg, S. (1987). A massively parallel architecture for a self-organizing neural pattern recognition machine. CVGIP, 37, 54e115.

90 Computational Learning Approaches to Data Analytics in Biomedical Applications

Carpenter, G. A., & Gaddam, S. C. (2010). Biased ART: A neural architecture that shifts attention toward previously disregarded features following an incorrect prediction. Neural Networks. https://doi.org/ 10.1016/j.neunet.2009.07.025. Carpenter, G. A., & Grossberg, S. (1987). ART 2: Self-organization of stable category recognition codes for analog input patterns. Applied Optics, 26(23), 4919. https://doi.org/10.1364/AO.26.004919. Carpenter, G. A., & Grossberg, S. (1991). A massively parallel architecture for a self-organizing neural pattern recognition machine. In Pattern recognition by self-organizing neural networks (pp. 316e382). Carpenter, G. A., Grossberg, S., Markuzon, N., Reynolds, J. H., & Rosen, D. B. (1992). Fuzzy ARTMAP: A neural network architecture for incremental supervised learning of analog multidimensional maps. IEEE Transactions on Neural Networks. https://doi.org/10.1109/72.159059. Carpenter, G., Grossberg, S., & Reynolds, J. (1991). ARTMAP: Supervised real-time learning and classification of nonstationary data by a self-organizing neural network. IEEE Conference on Neural Networks for Ocean Engineering, 4, 565e588. https://doi.org/10.1109/ICNN.1991.163370. Carpenter, G. A., Grossberg, S., & Rosen, D. B. (1991). Fuzzy ART: Fast stable learning and categorization of analog patterns by an adaptive resonance system. Neural Networks, 4(6), 759e771. https://doi.org/ 10.1016/0893-6080(91)90056-B. Cerf, L., Besson, J., Robardet, C., & Boulicaut, J.-F. (2008). Data-Peeler: Constraint-Based closed pattern mining in n-ary relations. SDM37e48. Cerf, L., Besson, J., Robardet, C., & Boulicaut, J.-F. (2009). Closed patterns meet n -ary relations. ACM Transactions on Knowledge Discovery from Data, 3(1), 1e36. https://doi.org/10.1145/1497577. 1497580. Cheng, Y., & Church, G. M. (2000). Biclustering of expression data. In Proceedings/. International conference on intelligent systems for molecular Biology ; ISMB. International conference on intelligent systems for molecular biology. Cheng, C.-H., Fu, A. W., & Zhang, Y. (1999). Entropy-based subspace clustering for mining numerical data. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD’99. Chen, S., & Zhang, D. (2004). Robust image segmentation using FCM with spatial constraints based on new kernel-induced distance measure. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics). https://doi.org/10.1109/TSMCB.2004.831165. Chiang, J. H., & Hao, P. Y. (2003). A new kernel-based fuzzy clustering approach: Support vector clustering with cell growing. IEEE Transactions on Fuzzy Systems, 11(4), 518e527. https://doi.org/10. 1109/TFUZZ.2003.814839. Chiu, T., Fang, D., Chen, J., Wang, Y., & Jeris, C. (2001). A robust and scalable clustering algorithm for mixed type attributes in large database environment. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining - KDD’01. Coomans, D., & Massart, D. L. (1981). Potential methods in pattern recognition: Part 2. Clupotdan unsupervised pattern recognition technique. Analytica Chimica Acta, 133(3), 225e239. Davies, D. L., & Bouldin, D. W. (1979). A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.1979.4766909. Di Gesu´, V., Giancarlo, R., Lo Bosco, G., Raimondi, A., & Scaturro, D. (2005). GenClust: A genetic algorithm for clustering gene expression data. BMC Bioinformatics, 6. https://doi.org/10.1186/14712105-6-289. Dorigo, M., Birattari, M., & Stutzle, T. (2006). Ant colony optimization. IEEE Computational Intelligence Magazine, 1(4), 28e39. https://doi.org/10.1109/MCI.2006.329691.

Chapter 3  Clustering algorithms

91

Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification (2nd ed.). New York: John Wiley. Section. Duffy, D. E., & Quiroz, A. J. (1991). A permutation-based algorithm for block clustering. Journal of Classification. https://doi.org/10.1007/BF02616248. Durbin, R., Eddy, S., Krogh, a, & Mitchison, G. (1998). Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Analysis. https://doi.org/10.1017/CBO9780511790492. Dunn, J. C. (1973). A fuzzy relative of the ISODATA process and its use in detecting compact wellseparated clusters. Journal of Cybernetics, 3(3), 32e57. https://doi.org/10.1080/01969727308546046. Durrant, R. J., & Kaba´n, A. (2009). When is “nearest neighbour” meaningful: A converse theorem and implications. Journal of Complexity. https://doi.org/10.1016/j.jco.2009.02.011. Du, Z., Wang, Y., & Ji, Z. (2008). PK-means: A new algorithm for gene clustering. Computational Biology and Chemistry, 32(4), 243e247. https://doi.org/10.1016/j.compbiolchem.2008.03.020. Eddaly, M., Jarboui, B., & Siarry, P. (2016). Combinatorial particle swarm optimization for solving blocking flowshop scheduling problem. Journal of Computational Design and Engineering, 3(4), 295e311. https://doi.org/10.1016/j.jcde.2016.05.001. Edwards, A. W. F., & Cavalli-Sforza, L. L. (1965). A method for cluster analysis. Biometrics, 21(2), 362. https://doi.org/10.2307/2528096. Enright, A. J., & Ouzounis, C. A. (2000). GeneRAGE: A robust algorithm for sequence clustering and domain detection. Bioinformatics. https://doi.org/10.1093/bioinformatics/16.5.451. Ester, M., Kriegel, H. P., Kro¨ger, P., Sander, J., & Zimek, A. (2011). Density-based clustering. Data Mining and Knowledge Discovery, 1(3), 231e240. https://doi.org/10.1002/widm.30. Ester, M., Kriegel, H., Sander, J., Wimmer, M., & Xu, X. (1998). Incremental clustering for mining in a data warehousing environment. Data Base323e333. https://doi.org/10.1002/widm.30. Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, 96(34), 226e231. https://doi.org/10.1.1.71.1980. Farnstrom, F., Lewis, J., & Elkan, C. (2000). Scalability for clustering algorithms revisited. ACM SIGKDD Explorations Newsletter, 2(1), 51e57. https://doi.org/10.1145/360402.360419. Fayech, S., Essoussi, N., & Limam, M. (2009). Partitioning clustering algorithms for protein sequence data sets. BioData Mining. https://doi.org/10.1186/1756-0381-2-3. Fernandez, E. A., & Balzarini, M. (2007). Improving cluster visualization in self-organizing maps: Application in gene expression data analysis. Computers in Biology and Medicine, 37(12), 1677e1689. https://doi.org/10.1016/j.compbiomed.2007.04.003. Fogel, D. B. (2005). Evolutionary computation: Toward a new philosophy of machine intelligence. Evolutionary computation: Toward a new philosophy of machine intelligence. Forgy, E. W. (1965). Cluster analysis of multivariate data: Efficiency versus interpretability of classifications. Biometrics. Frohlich, H., Speer, N., Spieth, C., & Zell, A. (2006). Kernel based functional gene grouping. In The 2006 IEEE international joint conference on neural network proceedings. Fukushima, K. (1975). Cognitron: A self-organizing multilayered neural network. Biological Cybernetics, 20(3e4), 121e136. https://doi.org/10.1007/BF00342633. Gaffney, S., & Smyth, P. (1999). Trajectory clustering with mixtures of regression models. In Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining - KDD’99. Gene Ontology Consortium. (2004). The Gene Ontology (GO) database and informatics resource. Nucleic Acids Research. https://doi.org/10.1093/nar/gkh036.

92 Computational Learning Approaches to Data Analytics in Biomedical Applications

Georgii, E., Tsuda, K., & Scho¨lkopf, B. (2011). Multi-way set enumeration in weight tensors. Machine Learning, 82(2), 123e155. https://doi.org/10.1007/s10994-010-5210-y. Getz, G., Gal, H., Kela, I., Notterman, D. A., & Domany, E. (2003). Coupled two-way clustering analysis of breast cancer and colon cancer gene expression data. Bioinformatics. https://doi.org/10.1093/ bioinformatics/btf876. Ghouila, A., Yahia, S. B., Malouche, D., Jmel, H., Laouini, D., Guerfali, F. Z., et al. (2009). Application of Multi-SOM clustering approach to macrophage gene expression analysis. Infection, Genetics and Evolution, 9(3), 328e336. https://doi.org/10.1016/j.meegid.2008.09.009. Girolami, M. (2002). Mercer kernel-based clustering in feature space. IEEE Transactions on Neural Networks, 13(3), 780e784. https://doi.org/10.1109/TNN.2002.1000150. Goutte, C., Toft, P., Rostrup, E., Nielsen, F., & Hansen, L. K. (1999). On clustering fMRI time series. NeuroImage. https://doi.org/10.1006/nimg.1998.0391. Grossberg, S. (1980). How does a brain build a cognitive code? Psychological Review, 87(1), 1e51. https:// doi.org/10.1037/0033-295X.87.1.1. Guan, X., & Du, L. (1998). Domain identification by clustering sequence alignments. Bioinformatics. https://doi.org/10.1093/bioinformatics/14.9.783. Guha, S., Meyerson, A., Mishra, N., Motwani, R., & OCallaghan, L. (2003). Clustering data streams: Theory and practice. IEEE Transactions on Knowledge and Data Engineering. https://doi.org/10.1109/TKDE. 2003.1198387. Guha, S., Rastogi, R., & Shim, K. (1998). CURE: An efficient clustering algorithm for large databases. In Proceedings of the 1998 ACM SIGMOD international conference on management of data - SIGMOD’98. Guha, S., Rastogi, R., & Shim, K. (2000). Rock: A robust clustering algorithm for categorical attributes. Information Systems. https://doi.org/10.1016/S0306-4379(00)00022-3. Guralnik, V., & Karypis, G. (2001). A scalable algorithm for clustering sequential data. In IEEE international conference on data mining (pp. 179e186). IEEE. Gusfield, D. (1997). Algorithms on strings, trees, and sequences: Computer science and computational biology. Hartigan, J. A. (1972). Direct clustering of a data matrix. Journal of the American Statistical Association. https://doi.org/10.1080/01621459.1972.10481214. Hathaway, R. J., & Bezdek, J. C. (2001). Fuzzy c-means clustering of incomplete data. In IEEE Transactions on systems, man, and cybernetics. Part B, cybernetics : A publication of the IEEE systems, man, and cybernetics society (Vol. 31, pp. 735e744), 5. Haykin, S. (2009). Neural networks and learning machines (3rd ed.) arXiv preprint. Hazelhurst, S., Hide, W., Lipta´k, Z., Nogueira, R., & Starfield, R. (2008). An overview of the wcd EST clustering tool. Bioinformatics. https://doi.org/10.1093/bioinformatics/btn203. He, Y., & Hui, S. C. (2009). Exploring ant-based algorithms for gene expression data analysis. Artificial Intelligence in Medicine, 47(2), 105e119. https://doi.org/10.1016/j.artmed.2009.03.004. Heger, A., & Holm, L. (2001). Towards a covering set of protein family profiles. Bioinformatics, 17(3), 272e279. https://doi.org/10.1093/bioinformatics/17.3.272. Hinneburg, A., & Keim, D. A. (1998). DENCLUE : An efficient approach to clustering in large multimedia databases with noise. In Proceedings of 4th international conference on knowledge discovery and data mining (KDD-98), (c) (pp. 58e65). Association for the Advancement of Artificial Intelligence (AAAI). Hinneburg, A., & Keim, D. A. (1999). Optimal grid-clustering: Towards breaking the curse of dimensionality in high-dimensional clustering. In International conference on very large databases (VLDB). Hogg, C. J., Vickers, E. R., & RogLers, T. L. (2005). Determination of testosterone in saliva and blow of bottlenose dolphins (Tursiops truncatus) using liquid chromatography-mass spectrometry. Journal

Chapter 3  Clustering algorithms

93

of Chromatography B: Analytical Technologies in the Biomedical and Life Sciences. https://doi.org/10. 1016/j.jchromb.2004.10.058. Holland, J. H. (1975). Adaptation in natural and artificial systems. Ann Arbor: Ann Arbor MI University of Michigan Press, 183. Holm, L., & Sander, C. (1998). Removing near-neighbour redundancy from large protein sequence collections. Bioinformatics. https://doi.org/10.1093/bioinformatics/14.5.423. Hruschka, E., & Ebecken, N. (2003). A genetic algorithm for cluster analysis. Intelligent Data Analysis, 7(1), 15e25. Huang, Z. (1998). Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2(3), 283e304. https://doi.org/10.1023/A: 1009769707641. Hubbard, T. J. P., Ailey, B., Brenner, S. E., Murzin, A. G., & Chothia, C. (1999). SCOP: A structural classification of proteins database. Nucleic Acids Research. https://doi.org/10.1093/nar/27.1.254. Hughey, R., & Krogh, A. (1996). Hidden Markov models for sequence analysis: Extension and analysis of the basic method. Bioinformatics. https://doi.org/10.1093/bioinformatics/12.2.95. Huidong, J., Man-Leung, W., & Leung, K. S. (2005). Scalable model-based clustering for large databases based on data summarization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(11), 1710e1719. Ilango, M., & Mohan, V. (2010). A survey of grid based clustering algorithms. International Journal of Engineering Science and Technology. Ja¨schke, R., Hotho, A., Schmitz, C., Ganter, B., & Stumme, G. (2006). TRIAS - an algorithm for mining iceberg tri-lattices. In Proceedings - IEEE international conference on data mining, ICDM (pp. 907e911). Jenssen, R., & Eltoft, T. (2007). An information theoretic perspective to kernel K-means. In Proceedings of the 2006 16th IEEE signal processing society workshop on machine learning for signal processing, MLSP 2006 (pp. 161e166). Jenssen, R., Erdogmus, D., Hild, K. E., Principe, J. C., & Eltoft, T. (2007). Information cut for clustering using a gradient descent approach. Pattern Recognition, 40(3), 796e806. https://doi.org/10.1016/j. patcog.2006.06.028. Jenssen, R., Erdogmus, D., Principe, J., & Eltoft, T. (2005). The Laplacian PDF distance: A cost function for clustering in a kernel feature space. Advances in Neural Information Processing Systems, 17, 625e632. Jiang, D., Pei, J., Ramanathan, M., Tang, C., & Zhang, A. (2004). Mining coherent gene clusters from genesample-time microarray data. In Proceedings of the 2004 ACMSIGKDD international conference on Knowledge discovery and data mining - KDD’04 (p. 430). Ji, L., Tan, K., & Tung, A. K. H. (2006). Mining frequent closed cubes in 3D datasets. VLDB. Kailing, K., & Kriegel, H. (2004). Density-connected subspace clustering for high-dimensional data. In Society for industrial and applied mathematics (pp. 246e256). Kaplan, N., Sasson, O., Inbar, U., Friedlinch, M., Fromer, M., Fleischer, H., et al. (2005). ProtoNet 4.0: A hierarchical classification of one million protein sequences. Nucleic Acids Research. https://doi.org/ 10.1093/nar/gki007. Karayiannis, N. B. (1997). A methodology for constructing fuzzy algorithms for learning vector quantization. IEEE Transactions on Neural Networks, 8(3), 505e518. https://doi.org/10.1109/72.572091. Karayiannis, N. B. (2000). Soft learning vector quantization and clustering algorithms based on ordered weighted aggregation operators. IEEE Transactions on Neural Networks, 11(5), 1093e1105. https:// doi.org/10.1109/72.870042.

94 Computational Learning Approaches to Data Analytics in Biomedical Applications

Karayiannis, N. B., & Bezdek, J. C. (1997). An integrated approach to fuzzy learning vector quantization and fuzzy c-means clustering. IEEE Transactions on Fuzzy Systems. https://doi.org/10.1109/91. 649915. Karayiannis, N. B., & Pai, P. I. (1996). Fuzzy algorithms for learning vector quantization. IEEE Transactions on Neural Networks, 7(5), 1196e1211. https://doi.org/10.1109/72.536314. Kasif, S. (1999). Datascope : Mining biological. IEEE Intelligent Systems, 6, 38e43. Kaufman, L., & Rousseeuw, P. J. (1990). Agglomerative nesting (program AGNES).Finding groups in data. Kaufman, L., & Rousseeuw, P. J. (2008). Clustering large applications (program CLARA). Finding Groups in Data: An Introduction to Cluster Analysis, 126e146. Kennedy, J., Eberhart, R. C., & Shi, Y. (2001). Swarm intelligence. Scholarpedia, 2(9), 1462. https://doi. org/10.4249/scholarpedia.1462. Kent, W. J., & Zahler, A. M. (2000). Conservation, regulation, synteny, and introns in a large-scale C. briggsae-C.elegans genomic alignment. Genome Research. https://doi.org/10.1101/gr.10.8.1115. Kim, N., Shin, S., & Lee, S. (2005). ECgene : Genome-based EST clustering and gene modeling for alternative splicing ECgene : Genome-based EST clustering and gene modeling for alternative splicing (pp. 566e576). Kohonen, T. (1989). A self learning musical grammar, or “Associative memory of the second kind”. International Joint Conference on Neural Networks, 1, 1e5. https://doi.org/10.1109/IJCNN.1989. 118552. Kohonen, T. (2001). The self-organizing map. Self-Organizing Maps. Kollios, G., Gunopulos, D., Koudas, N., & Berchtold, S. (2003). Efficient biased sampling for approximate clustering and outlier detection in large data sets. IEEE Transactions on Knowledge and Data Engineering. https://doi.org/10.1109/TKDE.2003.1232271. Krishnapuram, R., & Keller, J. M. (1993). A possibilistic approach to clustering. IEEE Transactions on Fuzzy Systems, 1(2), 98e110. https://doi.org/10.1109/91.227387. Krogh, A., Brown, M., Mian, I. S., Sjo¨lander, K., & Haussler, D. (1994). Hidden Markov models in computational biology. Applications to protein modeling. Journal of Molecular Biology. https://doi. org/10.1006/jmbi.1994.1104. Kaufman, L., & Rousseeuw, P. J. (1987). Clustering by means of medoids. In Y. Dodge (Ed.), Statistical data analysis based on the L1 norm and related methods (pp. 405e416). Lam, D., Wei, M., & Wunsch, D. (2015). Clustering data of mixed categorical and numerical type withunsupervised feature learning. IEEE Access, 3, 1605e1616. https://doi.org/10.1109/ACCESS.2015. 2477216. Lance, G. N., & Williams, W. T. (1967). A general theory of classificatory sorting strategies: 1. Hierarchical systems. The Computer Journal, 9(4), 373e380. https://doi.org/10.1093/comjnl/9.4.373. Lander, E. S., Linton, L. M., Birren, B., Nusbaum, C., Zody, M. C., Baldwin, J., et al. (2001). Initial sequencing and analysis of the human genome. Nature. https://doi.org/10.1038/35057062. Lange, O., Meyer-Baese, A., Hurdal, M., & Foo, S. (2006). A comparison between neural and fuzzy cluster analysis techniques for functional MRI. Biomedical Signal Processing and Control, 1(3), 243e252. https://doi.org/10.1016/j.bspc.2006.11.002. Lazzeroni, L., & Owen, A. (2002). Plaid models for gene expression data. Statistica Sinica. https://doi.org/ 10.1017/CBO9781107415324.004. Li, W., Jaroszewski, L., & Godzik, A. (2001). Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics. https://doi.org/10.1093/bioinformatics/17.3.282.

Chapter 3  Clustering algorithms

95

Li, W., Jaroszewski, L., & Godzik, A. (2002). Tolerating some redundancy significantly speeds up clustering of large protein databases. Bioinformatics, 18(1), 77e82. https://doi.org/10.1093/ bioinformatics/18.1.77. Linde, Y., Buzo, A., & Gray, R. M. (1980). An algorithm for vector quantization. IEEE Trans. Communication, 28(1), 84e95. Liu, C. L. (1968). Introduction to combinatorial mathmatics. New Jersy, NJ: McGraw Hill. Liu, J., Li, Z., Hu, X., & Chen, Y. (2009). Biclustering of microarray data with MOSPO based on crowding distance. BMC Bioinformatics. https://doi.org/10.1186/1471-2105-10-S1-S9. Liu, J., & Rost, B. (2002). Target space for structural genomics revisited. Bioinformatics. https://doi.org/ 10.1093/bioinformatics/18.7.922. Liu, J., & Xu, M. (2008). Kernelized fuzzy attribute C-means clustering algorithm. Fuzzy Sets and Systems, 159(18), 2428e2445. https://doi.org/10.1016/j.fss.2008.03.018. Liu, Z., Dechang, C., Bensmail, H., & Ying, X. (2005). Clustering gene expression data with kernel principal components. Journal of Bioinformatics and Computational Biology, 3(2), 303e316. Lloyd, S. P. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129e137. https://doi.org/10.1109/TIT.1982.1056489. Ma, P. C. H., Chan, K. C. C., & Chiu, D. (2006). An evolutionary clustering algorithm for gene expression microarray data analysis. IEEE Transactions on Evolutionary Computation, 10(3), 296e314. https:// doi.org/10.1109/TEVC.2005.859371. Madeira, S. C., & Oliveira, A. L. (2004). Biclustering algorithms for biological data analysis: A survey. IEEE/ ACM Transactions on Computational Biology and Bioinformatics. https://doi.org/10.1109/TCBB. 2004.2. Madeira, S. C., Teixeira, M. C., Sa´-Correia, I., & Oliveira, A. L. (2010). Identification of regulatory modules in time series gene expression data using a linear time biclustering algorithm. IEEE/ACM Transactions on Computational Biology and Bioinformatics. https://doi.org/10.1109/TCBB.2008.34. Malsburg, C. (1973). Self-organization of orientation sensitive cells in the striate cortex. Kybernetik, 14(2), 85e100. https://doi.org/10.1007/BF00288907. Marcotte, E. M., Pellegrini, M., Thompson, M. J., Yeates, T. O., & Eisenberg, D. (1999). A combined algorithm for genome-wide prediction of protein function. Nature. https://doi.org/10.1038/47048. Martinetz, T. M., Berkovich, S. G., & Schulten, K. J. (1993). “Neural-Gas”network for vector quantization and its application to time-series prediction. IEEE Transactions on Neural Networks, 4(4), 558e569. https://doi.org/10.1109/72.238311. Maulik, U., & Bandyopadhyay, S. (2000). Genetic algorithm-based clustering technique. Pattern Recognition, 33(9), 1455e1465. https://doi.org/10.1016/S0031-3203(99)00137-5. van der Merwe, D. W., & Engelbrecht, A. P. (2003). Data clustering using particle swarm optimization. In The 2003 congress on evolutionary computation, 2003 (pp. 215e220). CEC’03. Miller, C., Gurd, J., & Brass, A. (1999). A RAPID algorithm for sequence database comparisons: Application to the identification of vector contamination in the EMBL databases. Bioinformatics, 15(2), 111e121. https://doi.org/10.1093/bioinformatics/15.2.111. Miller, R. T., Christoffels, A. G., Gopalakrishnan, C., Burke, J., Ptitsyn, A. A., Broveak, T. R., et al. (1999). A comprehensive approach to clustering of expressed human gene sequence: The sequence tag alignment and consensus knowledge base. Genome Research. https://doi.org/10.1101/gr.9.11.1143. Moore, B. (1988). Art I and pattern clustering algorithms. Neural Networks, 1(1 Suppl.), 116. https://doi. org/10.1016/0893-6080(88)90155-4. Morgenstern, B., Frech, K., Dress, A., & Werner, T. (1998). DIALIGN: Finding local similarities by multiple sequence alignment. Bioinformatics. https://doi.org/10.1093/bioinformatics/14.3.290.

96 Computational Learning Approaches to Data Analytics in Biomedical Applications

Mulder, S. A., & Wunsch, D. C. (2003). Million city traveling salesman problem solution by divide and conquer clustering with adaptive resonance neural networks. Neural Networks. https://doi.org/10. 1016/S0893-6080(03)00130-8. Mu¨ller, K. R., Mika, S., Ra¨tsch, G., Tsuda, K., & Scho¨lkopf, B. (2001). An introduction to kernel-based learning algorithms. IEEE Transactions on Neural Networks. https://doi.org/10.1109/72.914517. Murali, T. M., & Kasif, S. (2003). Extracting conserved gene expression motifs from gene expression data. Pacific Symposium on Biocomputing. https://doi.org/10.1142/9789812776303_0008. Nagesh, H., Goil, S., & Choudhary, A. (2001). Adaptive grids for clustering massive data sets. In Proceedings of the 1 st SIAM ICDM Chicago IL (Vol. 477, pp. 1e17). Needleman, S. B., & Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology. https://doi.org/10.1016/ 0022-2836(70)90057-4. Ng, E. K. K., Fu, A. W. C., & Wong, R. C. W. (2005). Projective clustering by histograms. IEEE Transactions on Knowledge and Data Engineering. https://doi.org/10.1109/TKDE.2005.47. Ng, R. T., & Han, J. (2002). Clarans: A method for clustering objects for spatial data mining. IEEE Transactions on Knowledge and Data Engineering. https://doi.org/10.1109/TKDE.2002.1033770. NIH. (2019). Federal interagency traumatic brain injury research (FITBIR) Informatics system. Retrieved from: https://fitbir.nih.gov/. Omran, M. G. H., Salman, A., & Engelbrecht, A. P. (2006). Dynamic clustering using particle swarm optimization with application in image segmentation. Pattern Analysis and Applications, 8(4), 332e344. https://doi.org/10.1007/s10044-005-0015-5. Pal, N. R., Tsao, E. C. K., & Bezdek, J. C. (1993). Generalized clustering networks and Kohonen’s selforganizing scheme. IEEE Transactions on Neural Networks, 4(4), 549e557. https://doi.org/10.1109/ 72.238310. Papadimitriou, S., & Likothanassis, S. D. (2004). Kernel-based self-organized maps trained with supervised bias for gene expression data analysis. Journal of Bioinformatics and Computational Biology. https://doi.org/10.1007/s10270-003-0031-0. Park, N. H., & Lee, W. S. (2007). Grid-based subspace clustering over data streams. In Proceedings of the sixteenth ACMconference on conference on information and knowledge management - CIKM’07, 801. Pearson, W. R., & Lipman, D. J. (1988). Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences of the United States of America. https://doi.org/10.1073/pnas.85. 8.2444. Pela´iz-Barranco, A., Gutie´rrez-Amador, M. P., Huanosta, A., & Valenzuela, R. (1998). Phase transitions in ferrimagnetic and ferroelectric ceramics by ac measurements. Applied Physics Letters. https://doi. org/10.1063/1.122360. Pham, D. L., Xu, C., & Prince, J. L. (2000). Current methods in medical image segmentation. Annual Review of Biomedical Engineering, 2(1), 315e337. https://doi.org/10.1146/annurev.bioeng.2.1.315. Pham, D. T., Dimov, S. S., & Nguyen, C. D. (2004). An incremental K-means algorithm. Proceedings of the Institution of Mechanical Engineers, Part C: Journal of Mechanical Engineering Science. https://doi. org/10.1243/0954406041319509. Procopiuc, C. M., Jones, M., Agarwal, P. K., & Murali, T. M. (2002). A Monte Carlo algorithm for fast projective clustering. In Proceedings of the 2002 ACM SIGMOD international conference on Management of data - SIGMOD’02. Qin, J., Lewis, D. P., & Noble, W. S. (2003). Kernel hierarchical gene clustering from microarray expression data. Bioinformatics. https://doi.org/10.1093/bioinformatics/btg288. Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE. https://doi.org/10.1109/5.18626.

Chapter 3  Clustering algorithms

97

Ramoni, M., Sebastiani, P., & Cohen, P. (2002). Bayesian clustering by dynamics. Machine Learning. https://doi.org/10.1023/A:1013635829250. Reeves, C. R. (2001). Genetic algorithms and grouping problems. IEEE Transactions on Evolutionary Computation. https://doi.org/10.1109/TEVC.2001.930319. Renyi, A. (1961). On measures of entropy and information. In Fourth Berkeley symposium on mathematical statistics and probability. Road, H., & Jose, S. (1998). Automatic subspace clustering mining of high dimensional applications for data. In Proceedings of the 1998 ACMSIGMOD international conference on Management of data (Vol. 27, pp. 94e105). Robert, J., Torbjorn, E., Deniz, E., & Jose, P. C. (2006). Some equivalences between kernel methods and information theoretic methods. Journal of VLSI Signal Processing Systems, 45(1e2), 49e65. https:// doi.org/10.1007/s11265-006-9771-8. Rose, K., Gurewitz, E., & Fox, G. C. (1992). Vector quantization by deterministic annealing. IEEE Transactions on Information Theory, 38(4), 1249e1257. https://doi.org/10.1109/18.144705. Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics. https://doi.org/10.1016/0377-0427(87) 90125-7. Ruan, S., Moretti, B., Fadili, J., & Bloyet, D. (2001). Segmentation of magnetic resonance images using fuzzy Markov random fields. In IEEE international conference on image processing (Vol. 3). Rumelhart, D. E., & Zipser, D. (1985). Feature discovery by competitive learning. Cognitive Science, 9(1), 75e112. https://doi.org/10.1016/S0364-0213(85)80010-0. Sarle, W. S., Jain, A. K., & Dubes, R. C. (1990). Algorithms for clustering data. Technometrics, 32(2), 227. https://doi.org/10.2307/1268876. Sasson, O., Linial, N., & Linial, M. (2002). The metric space of proteins-comparative study of clustering algorithms. Bioinformatics, 18(suppl_1), S14eS21. Sato, A., & Yamada, K. (1995). Generalized learning vector quantization. In Advances in neural information processing systems 8, NIPS, Denver, CO, November 27-30, 1995. Scho¨lkopf, B., Smola, A., & Mu¨ller, K. R. (1998). Nonlinear component analysis as a kernel Eigen value problem. Neural Computation, 10(5), 1299e1319. https://doi.org/10.1162/089976698300017467. Scho¨lkopf, B., Mika, S., Burges, C. J., Knirsch, P., Mu¨ller, K. R., Ra¨tsch, G., & Smola, A. J. (1999). Input space versus feature space in kernel-based methods. IEEE transactions on neural networks, 10(5), 1000e1017. Segal, E., Taskar, B., Gasch, A., Friedman, N., & Koller, D. (2001). Rich probabilistic models for gene expression. Bioinformatics. https://doi.org/10.1093/bioinformatics/17.suppl_1.S243. Sequeira, K., & Zaki, M. (2004). Schism: A new approach for interesting subspace mining. In Proceedings fourth IEEE international conference on data mining, ICDM 2004 (pp. 186e193). Shaik, J., & Yeasin, M. (2009). Fuzzy-adaptive-subspace-iteration-based two-way clustering of microarray data. IEEE/ACM Transactions on Computational Biology and Bioinformatics. https://doi.org/10.1109/ TCBB.2008.15. Sheikholeslami, G., Chatterjee, S., & Zhang, A. (1998). Wavecluster: A multi-resolution clustering approach for very large spatial databases. In Proceedings of the international conference on very large data bases (pp. 428e439). Sheng, W., & Liu, X. (2006). A genetic k-medoids clustering algorithm. Journal of Heuristics, 12(6), 447e466. https://doi.org/10.1007/s10732-006-7284-z. Sim, K., Gopalkrishnan, V., Zimek, A., & Cong, G. (2013). A survey on enhanced subspace clustering. Data Mining and Knowledge Discovery, 26(2), 332e397. https://doi.org/10.1007/s10618-012-0258-x.

98 Computational Learning Approaches to Data Analytics in Biomedical Applications

Sim, K., Liu, G., Gopalkrishnan, V., & Li, J. (2011). A case study on financial ratios via cross-graph quasibicliques. Information Sciences, 181(1), 201e216. https://doi.org/10.1016/j.ins.2010.08.035. Siriteerakul, T., & Boonjing, V. (2013). Support Vector Machine accuracy improvement with k-means clustering. International Conference on Computer Science and Engineering, 2013(ICSEC2013), 218e221. https://doi.org/10.1109/ICSEC.2013.6694782. Smith, T., & Waterman, M. (1985). New stratigraphic correlation techniques. The Journal of Geology, 32(3), 404e409. Smyth, P. (1997). Clustering sequences with hidden Markov models. In Advances in neural information processing systems. Smyth, P. (1999). Probabilistic model-based clustering of multivariate and sequential data. In Proceedings of artificial intelligence and statistics. Somervuo, P., & Kohonen, T. (2000). Clustering and visualization of large protein sequence databases by means of an extension of the self-organizing maptle. In International conference on discovery science (pp. 67e85). Springer. Spath, H. (1980). Cluster analysis algorithms for data reduction and classification of objects. Sun, R. (2003). In R. Sun, & C. L. Giles (Eds.), Sequence learning: Paradigms, algorithms, and applications (1st ed.). Sæbø, P. E., Andersen, S. M., Myrseth, J., Laerdahl, J. K., & Rognes, T. (2005). PARALIGN: Rapid and sensitive sequence similarity searches powered by parallel computing technology. Nucleic Acids Research. https://doi.org/10.1093/nar/gki423. Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., et al. (1999). Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proceedings of the National Academy of Sciences of the United States of America, 96(6), 2907e2912. https://doi.org/10.1073/PNAS.96.6.2907. Tanay, A., Sharan, R., & Shamir, R. (2002). Discovering statistically significant biclusters in gene expression data. Bioinformatics. https://doi.org/10.1093/bioinformatics/18.suppl_1.S136. Tomida, S., Hanai, T., Honda, H., & Kobayashi, T. (2002). Analysis of expression profile using fuzzy adaptive resonance theory. Bioinformatics, 18(8), 1073e1083. https://doi.org/10.1093/ bioinformatics/18.8.1073. Tran, T. N., Wehrens, R., & Buydens, L. M. C. (2006). KNN-kernel density-based clustering for highdimensional multivariate data. Computational Statistics and Data Analysis, 51(2), 513e525. https://doi.org/10.1016/j.csda.2005.10.001. Tsao, E. C. K., Bezdek, J. C., & Pal, N. R. (1994). Fuzzy Kohonen clustering networks. Pattern Recognition, 27(5), 757e764. https://doi.org/10.1016/0031-3203(94)90052-3. Vapnik, V. N. (1998). Statistical learning theory. Interpreting. Vesanto, J., & Alhoniemi, E. (2000). Clustering of the self-organizing map. IEEE Transactions on Neural Networks, 11(3), 586e600. https://doi.org/10.1109/72.846731. Vigdor, B., & Lerner, B. (2007). The Bayesian ARTMAP. IEEE Transactions on Neural Networks, 18(6), 1628e1644. https://doi.org/10.1109/TNN.2007.900234. Vlachos, M., Lin, J., & Keogh, E. (2003). A wavelet-based anytime algorithm for k-means clustering of time series. In Proc. workshop on clustering. Voth-Gaeddert, L. E., Al-Jabery, K. K., Olbricht, G. R., Wunsch, D. C., & Oerther, D. B. (2019). Complex associations between environmental factors and child growth: Novel mixed-methods approach. Journal of Environmental Engineering. https://doi.org/10.1061/(asce)ee.1943-7870.0001533. Wang, D., Shi, L., Yeung, D. S., Heng, P.-A., Wong, T.-T., & Tsang, E. C. C. (2005). Support vector clustering for brain activation detection. In Medical image computing and computer-assisted

Chapter 3  Clustering algorithms

99

intervention : MICCAI. International conference on medical image computing and computer-assisted intervention. Wang, D., Shi, L., Yeung, D. S., Tsang, E. C. C., & Ann Heng, P. (2007). Ellipsoidal support vector clustering for functional MRI analysis. Pattern Recognition, 40(10), 2685e2695. https://doi.org/10.1016/j. patcog.2007.01.017. Wang, H., Wang, W., Yang, J., & Yu, P. S. (2002). Clustering by pattern similarity in large data sets. In 2002 ACMSIGMODinternational conference on management of data. Wang, J. P. Z., Lindsay, B. G., Leebens-Mack, J., Cui, L., Wall, K., Miller, W. C., et al. (2004). EST clustering error evaluation and correction. Bioinformatics. https://doi.org/10.1093/bioinformatics/bth342. Wang, W., Yang, J., & Muntz, R. (1997). Sting: A statistical information grid approach to spatial data mining. In Proceedings of international conference on very large data. Warren Liao, T. (2005). Clustering of time series data - a survey. Pattern Recognition. https://doi.org/10. 1016/j.patcog.2005.01.025. Wen, Y., Mukherjee, K., & Ray, A. (2013). Adaptive pattern classification for symbolic dynamic systems. Signal Processing, 93(1), 252e260. https://doi.org/10.1016/j.sigpro.2012.08.002. Williamson, J. R. (1996). Gaussian ARTMAP: A neural network for fast incremental learning of noisy multidimensional maps. Neural Networks, 9(5), 881e897. https://doi.org/10.1016/0893-6080(95) 00115-8. Wismuller, A., Meyer-Baese, A., Lange, O., Reiser, M. F., & Leinsinger, G. (2006). Cluster analysis of dynamic cerebral contrast-enhanced perfusion MRI time-series. IEEE Trans Med Imaging, 25(1), 62e73. https://doi.org/10.1109/TMI.2005.861002. Wu, Z., Gao, X., Xie, W., & Yu, J. (2005). Kernel method-based fuzzy clustering algorithm. Journal of Systems Engineering and Electronics, 16(1). Wunsch, D. C., Caudell, T. P., Capps, C. D., & Falk, R. A. (1991). An optoelectronic adaptive resonance unit. In Proceedings. IJCNN-91-Seattle: International joint conference on neural networks (pp. 541e549). Wunsch, D. C., Caudell, T. P., Capps, C. D., Marks, R. J., & Falk, R. A. (1993). An optoelectronic implementation of the adaptive resonance neural network. In IEEE transactions on neural networks/a publication of the IEEE neural networks council. Wunsch, D. C., & Mulder, S. (2004). Evolutionary algorithms, Markov decision processes, adaptive critic designs, and clustering: Commonalities, hybridization and performance. In Proceedings of international conference on intelligent sensing and information processing, 2004. Wushert, D. (1969). Mode analysis: A generalization of nearest neighbour which reduces chaining effects (with discussion). Numerical Taxonomy282e311. Xiong, Y., & Yeung, D. Y. (2004). Time series clustering with ARMA mixtures. Pattern Recognition. https:// doi.org/10.1016/j.patcog.2003.12.018. Xu, R., Anagnostopoulos, G. C., & Wunsch, D. C. (2002). Tissue classification through analysis of gene expression data using a new family of ART architectures. In Proceedings of the international joint conference on neural networks (Vol. 1, pp. 300e304). Xu, R., Damelin, S., Nadler, B., & Wunsch, D. C. (2010). Clustering of high-dimensional gene expression data with feature filtering methods and diffusion maps. Artificial Intelligence in Medicine. https://doi. org/10.1016/j.artmed.2009.06.001. Xu, X., Ester, M., Kriegel, H., & Sander, J. (1998). A distribution-based clustering algorithm for mining in large spatial databases. In 14th international conference on data engineering ( ICDE’ 98 (pp. 324e331). _ Xu, L., Krzyzak, A., & Oja, E. (1993). Rival penalized competitive learning for clustering analysis, RBF net, and curve detection. IEEE Transactions on Neural Networks, 4(4), 636e649. https://doi.org/10.1109/ 72.238318.

100

Computational Learning Approaches to Data Analytics in Biomedical Applications

Xu, R., & Wunsch, D. (2005). Survey of clustering algorithms. IEEE Transactions on Neural Networks. https://doi.org/10.1109/TNN.2005.845141. Xu, R., & Wunsch, D. C. (2009). Clustering. Computational intelligence. Xu, R., & Wunsch, D. C. (2010). Clustering algorithms in biomedical research: A review. IEEE Reviews in Biomedical Engineering. https://doi.org/10.1109/RBME.2010.2083647. Xu, R., & Wunsch, D. C. (2011). Bartmap: A viable structure for biclustering. Neural Networks, 24(7), 709e716. https://doi.org/10.1016/j.neunet.2011.03.020. Yang, J., Wang, H., Wang, W., & Yu, P. (2003). Enhanced biclustering on expression data. In Proceedings 3rd IEEE symposium on BioInformaticsand BioEngineering, BIBE 2003. Yang, M. S. (1993). A survey of fuzzy clustering. Mathematical and Computer Modelling, 18(11), 1e16. https://doi.org/10.1016/0895-7177(93)90202-A. Yang, X. L., Song, Q., & Zhang, W. B. (2006). Kernel-based deterministic annealing algorithm for data clustering. IEE Proceedings: Vision, Image and Signal Processing, 153(5). https://doi.org/10.1049/ipvis:20050366. ¨ ., Achenie, L. E. K., & Srivastava, R. (2007). Systematic tuning of parameters in support vector Yilmaz, O clustering. Mathematical Biosciences. https://doi.org/10.1016/j.mbs.2006.09.013. Yip, K. Y., Cheung, D. W., & Ng, M. K. (2004). Harp: A practical projected clustering algorithm. IEEE Transactions on Knowledge and Data Engineering. https://doi.org/10.1109/TKDE.2004.74. Yona, G., Linial, N., & Linial, M. (2000). ProtoMap: Automatic classification of protein sequences and hierarchy of protein families. Nucleic Acids Research, 28(1), 49e55. https://doi.org/10.1093/nar/28.1.49. Yu, Q., Miche, Y., Eirola, E., Van Heeswijk, M., Se´Verin, E., & Lendasse, A. (2013). Regularized extreme learning machine for regression with missing data. Neurocomputing, 102, 45e51. https://doi.org/10. 1016/j.neucom.2012.02.040. Zafeiriou, S., Laskaris, N., & Transform, A. W. (2008). On the improvement of support vector techniques for clustering by means of whitening transform (Vol. 15, pp. 198e201). Zafonte, R. D., Bagiella, E., Ansel, B. M., Novack, T. A., Friedewald, W. T., Hesdorffer, D. C., et al. (2012). Effect of citicoline on functional and cognitive status amongpatients with traumatic brain injury. JAMA, 308(19), 1993e2000. https://doi.org/10.1001/jama.2012.13256. Zalewski, J. (1996). Rough sets: Theoretical aspects of reasoning about data. Control Engineering Practice. https://doi.org/10.1016/S0967-0661(96)90021-0. Zhang, D.-Q., & Chen, S.-C. (2004). A novel kernelized fuzzy C-means algorithm with application in medical image segmentation. Artificial Intelligence in Medicine, 32(1), 37e50. https://doi.org/10. 1016/j.artmed.2004.01.012. Zhang, T., Ramakrishnan, R., & Livny, M. (1996). BIRCH: An efficient data clustering databases method for very large. ACM SIGMOD International Conference on Management of Data. https://doi.org/10. 1145/233269.233324. Zhang, Z., Schwartz, S., Wagner, L., & Miller, W. (2000). A greedy algorithm for aligning DNA sequences. Journal of Computational Biology. https://doi.org/10.1089/10665270050081478. Zhao, L., & Zaki, M. J. (2005). triCluster: An effective algorithm for mining coherent clusters in 3D microarray data. Sigmod, 694e705. https://doi.org/10.1145/1066157.1066236. Zhou, S.-M., & Gan, J. (2004). An unsupervised kernel based fuzzy C-means clustering algorithm with kernel normalisation. International Journal of Computational Intelligence and Applications(4), 355e373.

4 Selected approaches to supervised learning 4.1 Backpropagation and related approaches 4.1.1

Backpropagation

This chapter begins with backpropagation (BP) and backpropagation through time (BPTT). The former technique is the foundation for most (but not all) nonlinear training methods. At the heart of backpropagation is a more careful definition of the chain rule from calculus. This redefinition is called the chain rule for ordered derivatives, which makes it possible to account for many layers of a neural network. This technique was developed by Werbos in his 1974 Ph.D. dissertation (Werbos, 1974). This and some related publications are republished in (Werbos, 1994). The identical technique increased in popularity in the mid-1980’s, and a particularly popular formulation was published in (Hinton, McClelland, & Rumelhart, 1986; Rumelhart & McClelland, 1986). To summarize, consider a simple multilayer perceptron with one or more hidden layers as in Fig. 4.1. This can be described as: pnþ1 ðiÞ ¼

m X

w nþ1 ði; jÞxn ðiÞ þ bnþ1 ðiÞ

(4.1)

j¼1

Where: p is the net input to the neuron, m is the number of inputs to the neuron, n is the current time step, w is the weight of the corresponding input, b is the neuron bias, i is the index of the neuron and j is the index of the input. The output of neuron i will be:   y nþ1 ðiÞ ¼ f nþ1 pnþ1 ðiÞ

(4.2)

Where: y is the neuron output, f is the activation function of the ith neuron. The total error d is calculated by comparing the output of the perceptron yt with the desired output dðtÞ, as in (4.3): 1 d ¼ ðyðtÞ  dðtÞÞ2 : 2

(4.3)

Error minimization using gradient descent is achieved by calculating the partial derivatives of d with respect to each weight in the network. The partial derivatives for the error are calculated in two stages: forward, as described in (4.1) and (4.2), and backward, Computational Learning Approaches to Data Analytics in Biomedical Applications. https://doi.org/10.1016/B978-0-12-814482-4.00004-8 Copyright © 2020 Elsevier Inc. All rights reserved.

101

102

Computational Learning Approaches to Data Analytics in Biomedical Applications

FIG. 4.1 Backpropagation in a multilayer perceptron with two hidden layers, showing the synaptic connections between neurons in the different layers.

in which the derivatives are backpropagated from output layers back toward input layers. The backward stage starts by computing vd=vy for each of the output units. Differentiating (4.3) for a specific input pattern n gives (4.4) (Rumelhart & McClelland, 1986): vd ¼ yn  dn ; vyn

(4.4)

Applying the chain rule to compute the error derivatives with respect to the inputs vd= vxn : vd vd dyn ¼  ; vxn vyn dxn

The value of

dy dx

(4.5)

is obtained from differentiating (4.2). Substituting in (4.5) gives: vE vE ¼  f 0 ðpi Þ: vxi vyi

(4.6)

This shows how the change in input x will affect the error. Since the total input is a linear function of the weights on the connections, it is easy to calculate how the error will be affected by the change in the input states and weights. The partial derivative of the error with respect to the weight can be defined by: vE vE vxi vE ¼  ¼  yj ; vwij vxi vwij vxi

(4.7)

vE vxi vE  ¼  wij ; vxi vyj vxi

(4.8)

 The output of each neuron contributes to vE vyj resulting from neuron i on j is:

From (4.8) a general formula for all connections to unit j can be generated: vE X vE ¼  wij ; vyj vxi i

(4.9)

Chapter 4  Selected approaches to supervised learning

103

The partial derivatives of the error with respect to the weights is used to change the weights after every input-output pattern, which does not require a dedicated memory for the derivatives themselves. This is called online training (Haykin, 2018). An alternative approach is accumulating the error vE=vw over all the input-output pairs in the training set before updating the weights accordingly. This approach is also known as offline (or batch) training. The simplest version of gradient descent is to change each weight by an amount proportional to the accumulated error as described in (4.10): Dw ¼  εvE=vw:

(4.10)

As an improvement to this method (Rumelhart & McClelland, 1986) used an acceleration method in which the current gradient is used to modify the velocity of the point in w space instead of its position: DwðnÞ ¼ 

εvE þ aDwðn  1Þ; vw

(4.11)

where n is the epoch’s integer index, a is a learning rate or decay factor that defines the ration of contribution of the training history and current gradient to the weight change, a ˛ð0; 1. For details about the weight initialization techniques and the challenges that may face BP training, readers are advised to see (Goodfellow, 2015; Haykin, 2018).

4.1.2

Backpropagation through time

When dealing with time series data, a few approaches dominate. The classic approach, backpropagation through time (Werbos, 1990) is discussed in this section. Section 4.2 will discuss another popular approach, the training of recurrent neural networks, and Section 4.3 will discuss Long Short-Term Memory (LSTM). In BPTT, time (i.e. memory) is important, since the learning process or classification becomes more accurate if it considers what the network has seen previously. This is analogous to experience in biological intelligence where memory is important for the creatures. Generally, BPTT works by unrolling the recurrent neural network RNN in time by creating copies of the network for each time step. Then, errors are calculated and accumulated for each time step. The network is rolled back up and the weights are updated. Each time step of the unrolled recurrent neural network is considered an additional layer, given the order dependence of the problem, and the internal state from the previous time step is taken as an input for the subsequent time step. Fig. 4.2 shows a generalized form of BPTT on a recurrent neural network. The BPTT algorithm can be summarized in the following steps: 1 2 3 4

Present input-output pairs at specific time steps. Unfold the network and calculate the total network errors at each time step. Fold the network and update the weights. Repeat.

Computational Learning Approaches to Data Analytics in Biomedical Applications

( )

( )

( )

Output layer

Hidden layers

……

104

( − 1)

……… ( − 1)

( − 1)

Output layer

Hidden layers

……

( − )

( − )

( − )

FIG. 4.2 General diagram for BPTT network.

The strength of BP is that it is applicable to any system, even those that depend on past calculations within the network itself. In a traditional NN parameters are not shared across layers, so nothing needs to be summed. The key difference between BPTT and the standard BP, described in the previous section, is that the gradients for W are summed at each time step. The BPTT algorithm can be computationally expensive if it goes back too many steps in time. This will exponentially increase the number of steps required for a single weight update process. In Fig. 4.2 every neuron can input values from any other neuron at previous time steps (Werbos, 1974, 1990). However, assuming n ¼ 2 in Fig. 4.2, (4.1) can simply be replaced with: pi ðtÞ ¼

i1 X

wij xi ðtÞ þ

j¼1

N þn X

wij0 xi ðt  1Þ þ

j¼1

N þn X

wij00 xi ðt  2Þ:

(4.12)

j¼1

The network described by (4.12) can be significantly simplified by fixing some weights to zero, particularly, the w00 . It can be simplified even further if all w0 are set to zero as well, except those for wii . Werbos provided two reasons for this: “parsimony” and historical reasons (Werbos, 1990). In Werbos’ paper (Werbos, 1990), all neurons, other than input layer neurons, can input the output of any other neurons with time lag in the connection. The w 0 and w 00 are the weights on the time-lagged connections between neurons. Code for BPTT in Python and Matlab can be found in the code repository (Al-jabery, 2019). Also, readers who wish to code this algorithm themselves will find (Brownlee, 2017; Werbos, 1990) helpful. To calculate the derivatives, backward time calculations are required. In forward calculations, an exact result requires the calculation of an entire Jacobian matrix, which is computationally expensive and sometimes infeasible in large networks. The derivatives calculation or network adaptation for a network that goes back two steps in time is described by Vxi ðtÞ ¼ VYiN ðtÞ þ

Nþn X j¼iþ1

Wji  pi ðtÞ þ

N þn X i¼mþ1

Wji0  pi ðt þ 1Þ þ

Nþn X

W 00 ji  pi ðt þ 2Þ

(4.13)

j¼mþ1

Where V stands for the gradient. Setting W 00 to zero will eliminate the last term in (4.13).

Chapter 4  Selected approaches to supervised learning

105

In order to adapt this network, VWij0 and VWij00 should be calculated: VWij0 ¼

T X

pi ðt þ 1Þ  xj ðtÞ

(4.14)

pi ðt þ 2Þ  xj ðtÞ

(4.15)

t¼1

VWij00 ¼

T X t¼1

The learning rate in BPTT is usually much smaller than that of basic BP. Alternately, the network weights could be initialized to zeros, ones or any random values. There are also some systematic approaches for initializing weights (Haykin, 2018; Werbos, 1990).

4.2 Recurrent neural networks Recurrent Neural Networks (RNN) are important when analyzing time series data because they are designed to adaptively look backward over varying lengths of time. A simple RNN is shown in Fig. 4.3. The design is similar to the neural networks described in Section 4.1 with the key difference of feedback connections. Although engineers use the term “feedback”, in neuroscience this is known as recurrence, so the field of neural networks has long adopted the latter term. Recurrent connections can be inputs from a node to itself or an input from a higher-level node back to a lower one; either of these creates a feedback loop. Such systems have many challenges. Analyzing the behavior of systems with feedback is more complex, so stability theorems have been developed, particularly in the case of real-time applications. An equally prominent issue is the increased demands on training such systems.

Output

Hidden layer of fully recurrent nodes

Z–1

Z–1

Z–1 Input

FIG. 4.3 In this figure, recurrent connections are enabled from higher to lower levels. A node sending its output back to its own input is a special case. The Z1 notation, adopted from electrical engineering, indicates a singlestep time delay. This architecture can learn the number of steps back to use for a time series analysis. Taken from Hu, X., Prokhorov, D. V., & Wunsch, D. C. (2007). Time series prediction with a weighted bidirectional multi-stream extended Kalman filter. Neurocomputing. https://doi.org/10.1016/j.neucom.2005.12.135.

106

Computational Learning Approaches to Data Analytics in Biomedical Applications

The training methods of Section 4.1 can be applied with longer training times. Another successful approach is known as Extended Kalman Filter (EKF). This technique treats the optimization of neural network weights as a control problem and has been particularly useful in RNN and related models. The method can be computationally complex, but heuristics such as the node-decoupled EKF can reduce this considerably. Advances in computing power have also significantly reduced barriers to training RNN. The rest of this section is a slightly modified and condensed version of the explanation in (Hu et al., 2007). In addition to this and the papers and books cited below, see (Haykin, 2001) for a thorough discussion. Multi-Stream EKF consists of the following: (1) gradient calculation by backpropagation through time (BPTT) (Werbos, 1990); (2) weight updates based on the extended Kalman filter; and (3) data presentation using multi-stream mechanics (Feldkamp, Prokhorov, Eagen, & Yuan, 2011). See also (Anderson & Moore, 1979; Haykin, 1991; Singhal & Wu, 2003) Weights are interpreted as states of a dynamic system (Anderson & Moore, 1979), which allows for efficient Kalman training. Given a network with M weights and NL output nodes, the weights update for a training instance at the time step n of the extended Kalman filter is given by: AðnÞ ¼ ½RðnÞ þ H 0 ðnÞPðnÞHðnÞ

1

(4.16)

K ðnÞ ¼ PðnÞHðnÞAðnÞ

(4.17)

W ðn þ 1Þ ¼ W ðnÞ þ K ðnÞxðnÞ

(4.18)

Pðn þ 1Þ ¼ PðnÞ  K ðnÞH 0 ðnÞPðnÞ þ QðnÞ

(4.19)

Pð0Þ ¼ I=hp ; Rð0Þ ¼ hr I; Qð0Þ ¼ hq I

(4.20)

where R(n) is a diagonal NL-by-NL matrix, whose diagonal components are equal to or slightly less than 1; H(n) is an M-by-NL matrix containing the partial derivatives of the output node signals with respect to the weights; P(n) is an M-by-M approximate conditional error covariance matrix; AðnÞ is a NL  by  NL global scaling matrix; K ðnÞ is an M-by-NL Kalman gain matrix; W ðnÞ is a vector of length M of weights; xðnÞis the error vector of the output layer. The use of artificial process noise in Eq. (4.16) avoids numerical difficulties and significantly enhances performance. Decoupled EKF (DEKF) (Puskorius & Feldkamp, 1991, 1994) was implemented in (Hu, Vian, Choi, Carlson, & Wunsch, 2002) as a natural simplification of EKF by ignoring the interdependence of mutually exclusive groups of weights. The advantage of EKF over backpropagation is that EKF often requires significantly fewer presentations of training data and less overall training epochs (Puskorius & Feldkamp, 1991). Fig. 4.4 gives the flowchart. The multi-stream procedure (Feldkamp & Puskorius, 1994) was devised to cope with the conflicting requirements of training (Kolen & Kremer, 2001). It mitigates the risk that currently presented training data could be learned at the expense of performance on previous data. This is called the recency effect (Puskorius & Feldkamp, 1997). Multi-stream training is based on the principle that each weight update should attempt

Chapter 4  Selected approaches to supervised learning

107

Start

Initialize the network weights W, P, Q, R

Is it the end of the learning iterations?

Yes

No Feed forward all the data through the network

Compute H by BPTT

GEKF: Compute A, K, Update network weights W and P

End FIG. 4.4 Training a neural network using EKF.

to simultaneously satisfy the demands from multiple input-output pairs. In each cycle of training, a specified number NS of starting points are randomly selected in a chosen set of files. Each starting point is the beginning of a stream. The multi-stream procedure consists of progressing in sequence through each stream, carrying out weight updates according to current points. A consistent EKF update routine was devised for the multi-stream procedure. The training problem is treated as a single shared-weight network, in which the number of original outputs is multiplied by the number of streams. In multi-stream training, the number of columns in HðnÞ is correspondingly increased to NS  NL . Multi-streaming is useful whether using EKF training or some other approach.

4.3 Long short-term memory Long Short-Term Memory (LSTM) is a type of recurrent neural network with a strong ability to learn and predict sequential data. The research shows that RNN is limited in

108

Computational Learning Approaches to Data Analytics in Biomedical Applications

maintaining long-term memory. Therefore, the LSTM was invented to overcome this limitation by adding memory structure, which can maintain its state over time, with gates to decide what to remember, what to forget and what to output. The LSTM shows effective results in many applications that are inherently sequential such as speech recognition, speech synthesis, language modeling and translation and handwriting recognition. Several LSTM architectures offer major and minor changes to the standard one. The vanilla LSTM, described in (Greff, Srivastava, Koutnik, Steunebrink, & Schmidhuber, 2017) is the most commonly used LSTM variant in literature and is considered a reference for comparisons. Fig. 4.5 shows a schematic of the vanilla LSTM block which includes three gates (input, forget and output), an input block, an output activation function, a peephole connection and a memory block called cell. (A) Input Gate: The input gate receives the new information and the prior predictions as inputs and provides a vector of information that represents the possibilities. The input information is regulated using an activation function; the logistic sigmoid function is commonly used here. The values in the vector will be between 0 and 1. The highest number is more likely to be predicted next. The generated vector is then added with the viable possibilities, previously stored in the cell, to produce a collection of possibilities; their values may range from less than -1 to more than 1. The information is held by the cell and manipulated by the gates. There are three gates; each has its own neural network and is trained to do its designated task. These gates are: it1 ¼ Wi xt þ Ri yt1 þ pi 1ct1 þ bi it ¼ sðit1 Þ

(4.21)

Where Wi is the input weights, Ri is the recurrent weight, pi is the peephole weights, bi is the bias weight and s is the logistic sigmoid.

FIG. 4.5 Schematic diagram of the LSTM (Greff et al., 2017).

Chapter 4  Selected approaches to supervised learning

sðxÞ ¼

1 : 1 þ e x

109

(4.22)

The symbol 1 denotes the two-vector multiplication. (B) Forget Gate: This gate removes the information that is no longer useful in the cell state. The two inputs, which is the new information at a particular time and the previous prediction, are fed to the gate and multiplied by weight matrices. Then, the result is passed through an activation function that gives 0 when the information should be forgotten or 1 when the information must be retained for future use. The result will be multiplied with the possibilities that are collected from the input gate and the cell. The useful possibilities will be stored in the cell. ft1 ¼ Wf xt þ Rf Yt1 þ pf 1ct1 þ bf ft ¼ sðft1 Þ

(4.23)

The cell formula is represented as follows: ct ¼ zt 1 it þ ct1 1 ft

(C) Output Gate (selection gate): The output gate makes a selection based on the new information and the previous predictions. The final prediction will be a result of multiplying the result of the output gate and the normalized possibilities that are provided by the cell and the input gate. Since the collected possibilities have values that may range from more than 1 to less than -1, the tanh activation function is used for normalization. ot1 ¼ Wo xt þ Ro Yt1 þ po 1ct þ bo ot ¼ sðot1 Þ

(4.24)

The following formula represents the input block: zt1 ¼ Wz xt þ Rz yt1 þ bz zt ¼ gðzt1 Þ

(4.25)

Where g is the hyperbolic tangent activation function. g(x) ¼ tanh(x). The output block can be represented by the following formula: yt ¼ hðct Þ1ot

(4.26)

In some LSTM architectures, the peephole connections were omitted as in the simplified variant called Gated Recurrent Unit (GRU) (Cho et al., 2014); however (Gers & Schmidhuber, 2000), argued that adding them to the architecture can make precise timing easier to learn. (Greff et al., 2017) concluded that the forget gate and the output activation function can significantly impact the LSTM performance. Removing any one of them affects the performance negatively. They claim that the necessity of the output activation function is to prevent the unrestrained cell state from propagating through the network and

110

Computational Learning Approaches to Data Analytics in Biomedical Applications

Memory Predictors Selection

OUTPUT GATE Collected possibilities Forgetting Memory

FORGET GATE possibilities

INPUT GATE New information

FIG. 4.6 A typical LSTM cell structure showing its main three components.

upsetting the stability of learning. As shown in Fig. 4.6, there are two directions for the arrows: forward and back.

4.4 Convolutional neural networks and deep learning Convolutional neural networks or CNNs (LeCun & Bengio, 1998) are multi-layer perceptron designed specifically for pattern recognition in single dimension (e.g. time series) or two-dimensional data matrices (e.g. images) with a high degree of invariance to translation, scaling, skewing or any type of distortion. The idea behind this branch of neural network is motivated from biology and goes back to (Hubel & Wiesel, 1962). Some give credit to Fukoshima, who developed a convolutional network in 1980 and called it Neocognitron (Fukushima, 1980). Richard Bellman, the founder of dynamic programming, stated that high dimensionality in data is essential for many applications. The main difficulty that arises, particularly in the context of pattern classification applications, is that the learning complexity grows exponentially with a linear increase in the dimensionality of the data (Bellman, 1954). Regardless of these historical arguments, the name “convolutional neural network” indicates that the network employs a mathematical operation called convolution, described in the following section. This chapter provides an overview of two of the most important types of deep learning models: convolutional neural networks and deep belief networks (DBNs) (and their respective variations). These approaches have different strengths and weaknesses based on the type of applications. The following section begins by showing the structure of a CNN, specifically its training algorithms.

Chapter 4  Selected approaches to supervised learning

4.4.1

111

Structure of convolutional neural network

These neural networks use convolution in place of general matrix multiplication in at least one of their layers. They are very successful in many applications (Arbib et al., 2015; Haykin, 2018; LeCun & Bengio, 1998). Research into convolutional network architectures is growing so rapidly that a new architecture is announced every month, if not every week. However, CNNs’ structures consist of four patches of layers or phases according to the form of constraints that govern their structures: 1. Feature extraction. The mainstream approach of overcoming “the curse” has been to pre-process the data in a manner that would reduce its dimensionality to that which can be effectively processed, for example by a classification engine. This dimensionality reduction scheme is often referred to as feature extraction. In a CNN, each neuron receives its synaptic input from a local receptive field in the previous layer which leads to extracting local features. The exact position of the extracted features loses its importance after this process as long as its relative position to the other features is preserved. This process is implemented in the input layer, where the input is usually a multidimensional array of data such as image pixels, images transformations, patterns, time series, or video signals (Ferreira & Giraldi, 2017). Some researchers used Gabor filters as an initial pre-processing step to mimic the retinal visual response to visual excitation (Tivive & Bouzerdoum, 2003). Others have applied CNNs to various machine learning problems including face detection (Tivive & Bouzerdoum, 2004), document analysis (Simard, Steinkraus, & Platt, 2003) and speech detection (Sukittanon, Surendran, Platt, Burges, & Look, 2004). 2. Feature mapping. This is when each computational layer of the network consists of multiple feature maps, with each map composed of a plane of individual neurons where they share the same set of synaptic weights. This form of structure constraint has two advantages over the other forms: shift invariance and fewer free parameters (Haykin, 2018). The convolution process is performed in this part of the network. It is the main building block of CNN. These layers are comprised of a series of filters or kernels with nonlinear functions which extract local features from the input, and each kernel is used to calculate a feature map or kernel map. The first convolutional layer extracts low-level meaningful features such as edges, corners, textures and lines (Krig, 2014). Each layer in this phase extracts features with a higher level than the ones extracted from the previous layers; then the highest level features are extracted in the last layer (Chen, Han, Wang, Jeng, & Fan, 2006). Kernel size refers to the size of the filter, which convolves around the feature map, while the amount by which the filter slides (sliding process) is the step size. It controls how the filter convolves around the feature map. Therefore, the filter convolves around the different layers of the input feature map by sliding one unit each time (Arbib et al., 2015). Another essential feature of CNNs is padding which allows the input data to

112

Computational Learning Approaches to Data Analytics in Biomedical Applications

expand. For example, if there is a need to control the size of the output and the kernel width W independently, then zero padding is used for the input. The following equations describe how the convolution process is performed. In one dimension, the convolution between two functions is defined as follows: (4.27)

where f(x) and g(x) are two functions; is the convolution symbol; s is the variable of integration. In two dimensions, the convolution between two functions is defined as follows: ZZ

gðx; yÞ ¼ f ðx; yÞ  hðx; yÞ ¼

N

N

f ðs; tÞhðx  s; y  tÞdsdt:

(4.28)

3. Detection or non-linearity. Assuming robust deep learning is achieved, a hierarchical network could be trained on a large set of observations, and signals from this network could be extracted later to a relatively simple classification engine for robust pattern recognition. Robustness here refers to the ability to exhibit classification invariance to a diverse range of transformations and distortions, including noise, scale, rotation, various lighting conditions, displacement, etc. The prime purpose of convolution is to extract distinct features from the input. In this phase, the network learns complex models by detecting linear activation through nonlinear activation functions (Zheng, Liu, Chen, Ge, & Zhao, 2014). Examples of these activation functions are tanhðxÞ; sigmoidðxÞ and the rectified linear unit (ReLU) functions (Albelwi & Mahmood, 2017). The last function increases the nonlinearity without affecting the receptive field of the convolutional layer: ReLU ¼ maxðx; 0Þ. This function accelerates the learning process of the CNNs by reducing the gradient oscillation at all layers. In this stage, each layer consists of a generic multilayer network. 4. Subsampling (features pooling). Capturing spatiotemporal dependencies, based on regularities in the observations, is viewed as a fundamental goal for deep learning systems. In this phase, the resolution and network computational complexity are reduced from the previous stages by exclusively choosing features that are robust to noise and distortion. The output of this phase is a filtered subset of the features that have the most important or core information from the input data (Bengio, 2009; Ferreira & Giraldi, 2017). The pooling or subsampling is “tuned” by parameters in the learning process, but the basic mechanism is set by the network designer. The convolution process in this phase can be described by: 0 Xjc

¼f@

X

1 Xjc1 ð  Þkijc

þ bcj A;

(4.29)

i˛Mj

Where c is the convolution layer; X is the input feature; k is the kernel map; b is the bias; Mj is the subset selected from the features; i and j are inputs and outputs respectively.

Chapter 4  Selected approaches to supervised learning

113

5. Fully connected layers. The last part of a CNN topology consists of single dimensional layers that are fully connected to all activations in the previous layers (Bengio, 2009). These layers are usually used to train another classifier, which is typically a feed forward neural network. Training is performed by using cost functions such as: softmax, sigmoid cross-entropy or Euclidean loss in order to penalize the network vs, 2018; Schmidhuber, when it deviates from the true labels (i.e., targets) (Namate 2015; Simon Haykin, 2018). CNNs have recently (Robinson & Fallside, 1991) been trained with a temporal coherence objective to leverage the frame-to-frame coherence found in videos, though this objective need not be specific to CNNs.

4.4.2

Deep belief networks

Deep Belief Networks (DBNs) were invented as a solution for the problems encountered when using traditional neural networks training in deep layered networks, such as slow learning, becoming stuck in local minima due to poor parameter selection, and requiring a lot of training datasets. DBNs were initially introduced in (Larochelle, Erhan, Courville, Bergstra, & Bengio, 2007) as probabilistic generative models to provide an alternative to the discriminative nature of traditional neural nets. Generative models provide a joint probability distribution over input data and labels, facilitating the estimation of both PðxjyÞ and PðyjxÞ, while discriminative models only use the last model PðyjxÞ: As illustrated in Fig. 4.7, DBNs consist of several layers of neural networks, also known as “Boltzmann Machines”. Each of them is restricted to a single visible layer and a hidden layer. Associative memory

Label units

Top level units

Hidden units Note that each line is a twoway line;

………..

means generaƟve weights means detecƟve weights

………..

Hidden units

Hidden units

………..

Restricted Boltzmann Machines visible units

Input layer

………..

……….. FIG. 4.7 A deep belief network.

114

Computational Learning Approaches to Data Analytics in Biomedical Applications

Restricted Boltzmann Machines (RBMs) can be considered as a binary version of factor analysis. So instead of having many factors, a binary variable will determine the network output. The widespread RBNs allow for more efficient training of the generative weights of its hidden units. These hidden units are trained to capture higher-order data correlations that are observed in the visible units. The generative weights are obtained using an unsupervised greedy layer-by-layer method, enabled by contrastive divergence (Hinton, 2002). The RBN training process, known as Gibbs sampling, starts by presenting a vector, v; to the visible units that forward values to the hidden units. In the reverse direction, the visible unit inputs are stochastically found to reconstruct the original input. Finally, these new visible neuron activations are forwarded so single-step reconstruction hidden unit activations, h, can be attained.

4.4.3

Variational autoencoders

One of the applications of deep learning is using a multilayer neural network to convert high-dimensional data to low-dimensional. This neural network structure is called an Autoencoder. This entire Section 4.4.3 is a paraphrased excerpt from the blog (Wu, 2019), used with permission. It, in turn, is a condensed explanation of the original contribution in (Kingma & Welling, 2013), and also benefitted from the synopsis in (Doersch, 2016). A Variational Autoencoder (VAE) can be defined as a stochastic version of a conventional autoencoder which imposes some constraints on the distribution of latent variables. The upper portion of Fig. 4.8 shows the basic concept of an autoencoder. An input is mapped to itself through several layers of a neural network, where the middle layer will have a chokepoint of fewer nodes, forcing a compressed representation of the

FIG. 4.8 Variational Autoencoder Diagram. The target output is the same as the input. Due to the compressed hidden layer, this is necessarily an estimate rather than an exact mapping. The chokepoint created by the smaller hidden layer in the center creates a set of latent mappings. Figure adapted from (Kan, 2018; Tschannen, Bachem, & Lucic, 2018).

Chapter 4  Selected approaches to supervised learning

115

data. Certain constraints on the transfer functions of the hidden units (discussed below) constitute the “variational” adjective. This section will discuss the derivation of an Autoencoder and how to implement it in Tensorflow (TensorFlow, 2015). VAE aims to learn the underlying distribution of a dataset which is unknown and usually complicated. To set this up, we need the common measure (not symmetrical, thus strictly not a distance metric) of two distributions pðxÞ and qðxÞ. This is the Kullback-Leibler (KL) divergence (Kullback & Leibler, 1951), which is defined by:    pðxÞ : DKL ½pðxÞjjqðxÞ ¼ EpðxÞ log qðxÞ

(4.30)

To model the true data distribution, the KL divergence between the true data distribution qðxÞ and the model distribution pq ðxÞ should be minimized, where q is the optimization parameter of the model, as described by (4.31): DKL ½qðxÞjjpq ðxÞ ¼ EqðxÞ½log qðxÞ  log pq ðxÞ ¼ H½qðxÞ  EqðxÞ½log pq ðxÞ;

(4.31)

where qðxÞ is the underlying and unchanging distribution of the dataset, and the entropy of qðxÞ is a constant, therefore minDKL ½qðxÞjjpq ðxÞ ¼ maxq EqðxÞ½log pq ðxÞ; q

(4.32)

From (4.32), minimizing the KL divergence of the data distribution and model distribution is equivalent to the maximum likelihood method. VAE is a latent variable generative model which learns the distribution of data space x˛X from a latent space z˛Z. We can define a prior of the latent space pðzÞ, which is usually a standard normal distribution, then we can model the data distribution with a complex conditional distribution pq ðxjzÞ, so the model data likelihood can be computed as described by (4.33): Z

pq ðxÞ ¼

pqðxjzÞpðzÞdz;

(4.33)

z

However, direct maximization of the likelihood is intractable because of the integration. VAE instead optimizes a lower bound of pq ðxÞ, which can be derived using Eq. (4.34) (Jensen, 1906): If f is a convex function and X is a random variable, then Ef ðX Þ  f ðEX Þ;

(4.34)

and the equality holds only when X¼EX. In our case, (4.34) and (4.33) can be combined in (4.35): Z log pq ðxÞ ¼ log pq ðx; zÞdz;

(4.35)

z

Z

¼ log qf ðzjxÞ pq ðx; zÞqf ðzjxÞ dz;

(4.36)

z





¼ log Eqf ðzjxÞ pq ðx; zÞqf ðzjxÞ  Eqf ðzjxÞlog pq ðx; zÞqf ðzjxÞ :

(4.37)

116

Computational Learning Approaches to Data Analytics in Biomedical Applications

The last line of the derivation is called the Evidence Lower Bound (ELBO), which is used frequently in Variational Inference. The term qf ðzjxÞ is an approximate distribution of the true posterior pq ðzjxÞ of the latent variable z given datapoint x. qf ðzjxÞ; which is an instance of the Variational Inference family, it is used to perform inference of the data in the first place. For example, given a raw datapoint x, specify how to learn its representations z like shape, size, or category. The posterior of latent variables pq ðzjxÞ ¼ pq ðxjzÞpðzÞ=pq ðxÞ is also intractable because pq ðxÞ is intractable. VAE introduces a recognition model qf ðzjxÞ to approximate the true posterior pq ðzjxÞ. Similarly, to minimize the KL divergence between them as described by (4.38):



DKL qf ðzjxÞ pq ðzjxÞ ¼ Eqf ðzjxÞ log qf ðzjxÞ  log pq ðzjxÞ

¼ Eqf ðzjxÞ logqf ðzjxÞ  log pq ðxjzÞ  log pðzÞ þ log pq ðxÞ ¼ ELBO þ log pq ðxÞ;

(4.38)

Taking log pq ðxÞ out of the expectation because it does not depend on z, and rearranging (4.38) leads to (4.39):

ELBO ¼ log pq ðxÞ  DKL qf ðzjxÞjjpq ðzjxÞ :

(4.39)

This is the same objective function that is used to minimize the KL divergence between qf ðzjxÞ and pq ðzjxÞ and at the same time maximize log pq ðxÞ. Now, we need to maximize the ELBO as in (4.40-4.41):

ELBO ¼ Eqf ðzjxÞ log pq ðxjzÞ þ log pðzÞ  log qf ðzjxÞ ;

(4.40)



¼ Eqf ðzjxÞ½log pq ðxjzÞ  DKL qf ðzjxÞ jpðzÞ ;

(4.41)

The first term on Right Hand Side is the reconstruction error, which is the meansquare error of the real value data or the cross-entropy for binary value data. The second term is the KL divergence of approximate posterior and prior of the latent variable z, which can be computed analytically. There are two results can be concluded from (4.41): 1 The distribution of z can be computed given x using qf ðzjxÞ, and the distribution of x, can be computed given z, using pq ðxjzÞ. If both are implemented using neural networks, then they are the encoder and decoder of an Autoencoder, respectively. 2 VAE can generate new data while conventional Autoencoders fail. This advantage because the deterministic implementation of the first term in (4.41) is the same as that of a conventional Autoencoder. In the second term, VAE forces the mapping from data to latent variables to be close to the prior, so any time we sample a latent variable from the prior, the decoder knows what to generate, while a conventional Autoencoder distributes the latent variables randomly, with many gaps between them, which may result in samples from the gaps that are not intended to be samples by the encoder.

Chapter 4  Selected approaches to supervised learning

117

The implementation of VAE requires the following steps:

1. Compute the DKL q f ðzjxÞ pðzÞ term. We assume the prior of z is standard Gaussian, pðzÞ ¼ N ð0; IÞ, this is suitable when implementing VAE using neural networks, because of the ability of the decoder network to transform the standard Gaussian distribution to it at some layer, regardless of the true prior. Therefore, the approximate   posterior qf ðzjxÞ will also take a Guassian distribution form N z; m; s2 , and the

parameters m and s are computed using the encoder. Then DKL qf ðzjxÞ pðzÞ is computed using simple calculus:



DKL qf ðzjxÞ pðzÞ ¼ Eqf ðzjxÞ log qf ðzjxÞ  log pðzÞ Z    

¼ N z; m; s2 log N z; m; s2  log N ðz; 0; IÞ dz: ¼

1 2

J X

 2  logðsj Þ2 þ mj þ ðsj Þ2  1

(4.42)

j¼1

where j is the dimension index of vectors z, mj and sj denote the jth element of the mean and variance of z, respectively. 2. The gradient of ELBO with respect to q can be easily computed since the ELBO formula contains encoder parameters f and decoder parameters q, as described by (4.43):

Vq ELBO ¼ Vq Eqf ðzjxÞ log pq ðxjzÞ ¼ Eqf ðzjxÞ½Vq log pq ðxjzÞ L 1X ½Vq log pq ðxjzl Þ: x L l¼1

(4.43)

The last line comes from Monte Carlo estimation, since. zl wqf ðzjxÞ However, because a common gradient estimator like the score function estimator is impractical due to high variance, the ELBO gradient with respect to f needs special handling:

Vf Eqf ðzÞ½f ðzÞ ¼ Eqf ðzÞ f ðzÞVf log qf ðzÞ

L

1X x f ðzÞVf log qf ðzl Þ : L l¼1

(4.44)

VAE uses a ‘reparameterization trick’ to derive an unbiased gradient estimator. Instead of sampling zwqf ðzjxÞ directly, it reparameterize the random variable zf wqf ðzjxÞ using a differentiable transformation gf ðε; xÞ with an auxiliary noise variable ε, as described in (4.45). ze¼ gf ðε; xÞwith εwpðεÞ:

(4.45)

118

Computational Learning Approaches to Data Analytics in Biomedical Applications

  In the univariate Gaussian case, zwN m; s2 , we can sample εwN ð0; 1Þ and then use the transformation z ¼ m þ sε . In this way, we can compute the gradient with respect to f

Vf ELBO ¼ Vf Eqf ðzjxÞ½log pq ðxjzÞ  DKL qf ðzjxÞ pðzÞ ¼

L      1X log pq xjzðlÞ  DKL qf ðzjxÞjjpðzÞ ; L l¼1

(4.46)

where. zl wgf ðx; εl Þ ¼ m þ s1εl where εl wN ð0; IÞ The implementation code of the described VAE above is listed in (Wu, 2019).

4.5 Random forest, classification and Regression Tree, and related approaches The techniques in this section build on old methods (Breiman, Friedman, Stone, & Olshen, 1984). Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest (Duda, Hart, & Stork, 2000). This algorithm uses a combination of multiple decision trees to provide more accurate and stable prediction results. The generalization error for forests converges to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably (Schapire, 2013) to Adaboost (Freund & Schapire, 1997) but are more robust with respect to noise. Internal estimates monitor error, strength and correlation, and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are applicable to regression as well (Breiman, Friedman, Olshen, & Stone, 2017), but they continue to be popular due to their theoretical foundations, simplicity and performance. As discussed in (Breiman et al., 1984) and elsewhere, these techniques are consistent with methods from Bayesian inference. While finding an optimal decision tree is NP-hard, an initial attempt at one can often be found very quickly. Classification and Regression Trees are an important set of techniques for analyzing large datasets. They tend to be fast and easily understood. However, they can morph into excessively complex trees, and there can be issues when introducing new data or significant differences in results when making small changes in data. Regardless, they have been, and remain, an important class of approaches. See (Loh, 2014) for a survey article. A shorter but still useful survey, with some comments about software tools, by the same author is (Loh, 2014).

Chapter 4  Selected approaches to supervised learning

119

For an introduction to several related approaches, see (Duda et al., 2000), particularly the reviews of Bayesian and Maximum-Likelihood methods and nonparametric techniques. The foundation of Bayesian inference is built on Bayes’ Theorem: Given events A and B, and the probability P(B) s 0, then the conditional probability P (AjB) is given by: PðAjBÞ ¼ PðB j AÞPðAÞ=PðBÞ

(4.47)

Usually one is looking at many A’s and B’s using this method. This can be very useful for automated inference methods, but it is necessary to have a priori estimates of PðAÞ, which are called Bayesian priors. One usually also makes assumptions about the underlying distributions. Optimality theorems exist if certain assumptions are satisfied (Brownlee, 2016; Donges, 2018; Loh, 2014). The Random forest algorithm can be summarized by the following pseudo code: 1 Define the problem type Pt . 2 Select an initial subset of features. 2.1. Create a decision tree based on the selected subset of features. 2.2. Update the decision tree. 2.3. Increment trees counter: n ¼ n þ 1 3 Repeat (2) for all trees in the random forest (i.e. n times). 4 If Pt ¼¼ }Prediction} then (4) elseif Pt ¼¼ }Classification} then (5) Pt ¼ “Prediction”

4.1. For i ¼ 1 to n: Calculate tree prediction (yi ). P 4.2. Calculate the total output of the random forest Yt ¼ yi =n. 4.3. Return Yt Pt ¼ “Classification”

5 For all input patterns: 5.1. Read input pattern: For j ¼ 1: No. of input patterns (Read Xj ). 5.2. For i ¼ 1 to n (For each tree) 5.2.1. Assign label Xj to Ci using tree i 5.3. Find the Cj , which is the most repetitive Ci . (Majority vote) 5.4. Assign Xj to Class Cj . 5.5. Repeat 6 Return C 0 ¼ ½Cj  (Label vector).

120

Computational Learning Approaches to Data Analytics in Biomedical Applications

For random forest code in Python see (Al-jabery, 2019). As with any algorithm, there are advantages and disadvantages to using it. Therefore, the strengths and weaknesses for this algorithm are specified below. Strengths: 1. Unbiased algorithm: It has the power of mathematical democracy, since it relies on the decision made by multiple random decision trees and uses majority voting to specify the final output. 2. Multi-purpose algorithm: The random forest algorithm can be used for prediction and classification, as discussed previously (Donges, 2018). 3. Stability: The appearance of a new data point in the dataset is unlikely to affect the entire algorithm. Instead, it affects only one tree. 4. Overfitting immunity: If there are enough trees in the forest, the classifier will not overfit the model. 5. Robustness: The algorithm works well with categorical and numerical datasets. It also tolerates and handles missing values in datasets efficiently (Malik, 2018). Weaknesses: 1. Complexity: Random forest algorithms require many more computational resources because of the large number of decision trees joined together. However, that amount is still less than the computational power required for a unified decision tree that compensates for all of them. Slow learning: Like many other algorithms, random forest suffers from the lengthy training time that is required for the algorithm to adapt and learn the given patterns. However, this only occurs when using a large number of decision trees. In general, these algorithms train quickly but create predictions slowly. In most real-world applications the random forest algorithm is fast enough, but there can certainly be situations where run-time performance is important and other approaches would be preferred (Brownlee, 2016; Donges, 2018; Loh, 2014).

4.6 Summary This chapter reviews some of the most popular algorithms in supervised learning. It discusses recurrent neural networks and their training algorithms such as backpropagation and backpropagation through time, as well as Long Short-Term memory cells showing their architecture, concepts and applications. Some representative concepts, types and applications of deep learning algorithms were explained, and two of its most popular network structures were discussed: convolutional neural networks and deep belief networks. The also provides an overview of a random forest algorithm, detailing how it works, showing pseudo code and listing the strengths and weaknesses of this class of algorithms.

Chapter 4  Selected approaches to supervised learning

121

References Al-jabery, K. (2019). ACIL group/Computational_Learning_Approaches_to_Data_Analytics_in_Biomedical_ Applications GitLab. Albelwi, S., & Mahmood, A. (2017). A framework for designing the architectures of deep Convolutional Neural Networks. Entropy, 19(6), 242. https://doi.org/10.3390/e19060242. Anderson, B. D. O., & Moore, J. B. (1979). Rcommended 2) optimal filtering. Dover Publications. https:// doi.org/10.1109/TSMC.1982.4308806. Arbib, M. A., Stephen, G., Hertz, J., Jeannerod, M., Jenkins, B. K., Kawato, M., et al. (2015). Hierarchical recurrent neural encoder for video representation with application to captioning. Compute, 0(1), 1029e1038. abs/1503 https://doi.org/10.1016/j.ins.2016.01.039. Bellman, R. (1954). The theory of dynamic programming. Bulletin of the American Mathematical Society, 60(6), 503e516. https://doi.org/10.1090/S0002-9904-1954-09848-8. Bengio, Y. (2009). Learning deep architectures for AI. In Foundations and Trends in machine learning (Vol. 2). https://doi.org/10.1561/2200000006. Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (2017). Classification and regression trees. In Classification and regression trees. https://doi.org/10.1201/9781315139470. Breiman, L., Friedman, J., Stone, C., & Olshen, R. (1984). Classification and regression trees (wadsworth statistics/probability). New York: CRC Press. Brownlee, J. (2016). Master machine learning algorithms discover how they work and implement them from scratch. Machine Learning Mastery With Python. Brownlee, J. (2017). Machine learning mastery. Book. Chen, Y. N., Han, C. C., Wang, C. T., Jeng, B. S., & Fan, K. C. (2006). The application of a convolution neural network on face and license plate detection. In Proceedings - international conference on pattern recognition. https://doi.org/10.1109/ICPR.2006.1115. Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., et al. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. Doersch, C. (2016). Tutorial on variational autoencoders. Donges, N. (2018). The random forest algorithm e towards data science. Retrieved May 29, 2019, from https://towardsdatascience.com/the-random-forest-algorithm-d457d499ffcd. Duda, R. O., Hart, P. E., & Stork, D. G. (2000). Pattern classification (2nd ed.). Wiley. Feldkamp, L. A., Prokhorov, D. V., Eagen, C. F., & Yuan, F. (2011). Enhanced multi-stream kalman filter training for recurrent networks. In Nonlinear modeling. https://doi.org/10.1007/978-1-4615-5703-6_ 2. Feldkamp, L. A., & Puskorius, G. V. (1994). Training controllers for robustness: Multi-stream DEKF. Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN’94), 4, 2377e2382. https://doi.org/10.1109/ICNN.1994.374591. Ferreira, A., & Giraldi, G. (2017). Convolutional Neural Network approaches to granite tiles classification. Expert Systems with Applications, 84, 1e11. https://doi.org/10.1016/j.eswa.2017.04.053. Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119e139. https://doi.org/10. 1006/jcss.1997.1504. Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36(4), 193e202. https://doi.org/10. 1007/BF00344251.

122

Computational Learning Approaches to Data Analytics in Biomedical Applications

Gers, F. A., & Schmidhuber, J. (2000). Recurrent nets that time and count. In Proceedings of the IEEEINNS-ENNS international joint conference on neural networks. IJCNN 2000. Neural computing: New challenges and perspectives for the new millennium (pp. 189e194). https://doi.org/10.1109/ IJCNN.2000.861302. Goodfellow, I. (2015). Deep learning. In Nature methods (Vol. 13). https://doi.org/10.1038/nmeth.3707. Greff, K., Srivastava, R. K., Koutnik, J., Steunebrink, B. R., & Schmidhuber, J. (2017). Lstm: A search space odyssey. IEEE Transactions on Neural Networks and Learning Systems, 28(10), 2222e2232. https:// doi.org/10.1109/TNNLS.2016.2582924. Haykin, S. (1991). Adaptive filter theory. Englewood Clilfs, NJ: Prentice Hall. Haykin, Simon (2001). In Simon Haykin (Ed.), Kalman filtering and neural networks (first ed.) https:// doi.org/10.1002/0471221546. Haykin, S. (2018). Neural networks and learning machines (3rd ed.). Pearson India. Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8), 1771e1800. https://doi.org/10.1162/089976602760128018. Hinton, G. E., McClelland, J. L., & Rumelhart, D. E. (1986). Distributed representations. In Parallel distributed processing: Explorations in the microstructure of cognition (Vol. 1)Foundations. https:// doi.org/10.1146/annurev-psych-120710-100344. Hubel, D. H., & Wiesel, T. N. (1962). Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. The Journal of Physiology, 160(1), 106e154. https://doi.org/10.1113/jphysiol. 1962.sp006837. Hu, X., Prokhorov, D. V., & Wunsch, D. C. (2007). Time series prediction with a weighted bidirectional multi-stream extended Kalman filter. Neurocomputing, 70(13e15), 2392e2399. https://doi.org/10. 1016/j.neucom.2005.12.135. Hu, X., Vian, J., Choi, J., Carlson, D., & Wunsch, D. C. (2002). Propulsion vibration analysis using neural network inverse modeling. In Proceedings of the 2002 international joint conference on neural networks. IJCNN’02 (cat. No.02CH37290) (pp. 2866e2871). https://doi.org/10.1109/IJCNN.2002.1007603. Jensen, J. L. W. V. (1906). Sur les fonctions convexes et les ine´galite´s entre les valeurs moyennes. Acta Mathematica, 30, 175e193. https://doi.org/10.1007/BF02418571. Kan, C. E. (2018). What the heck are VAE-GANs?. Retrieved June 1, 2019, from Towards Data Science website: https://towardsdatascience.com/what-the-heck-are-vae-gans-17b86023588a. Kingma, D. P., & Welling, M. (2013). Auto-encoding variational Bayes. Kolen, J. F., & Kremer, S. C. (2001). A field guide to dynamical recurrent networks (1st ed.). Wiley-IEEE Press. Krig, S. (2014). Computer vision metrics: Survey, taxonomy, and analysis. In Computer vision metrics: Survey, taxonomy, and analysis. https://doi.org/10.1007/978-1-4302-5930-5. Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22(1), 79e86. https://doi.org/10.1214/aoms/1177729694. Larochelle, H., Erhan, D., Courville, A., Bergstra, J., & Bengio, Y. (2007). An empirical evaluation of deep architectures on problems with many factors of variation. In Proceedings of the 24th international conference on machine learning - ICML ’07. https://doi.org/10.1145/1273496.1273556. LeCun, Y., & Bengio, Y. (1998). Convolution networks for images, speech, and time-series. Igarss 2014. https://doi.org/10.1007/s13398-014-0173-7.2. Loh, W. Y. (2014). Fifty years of classification and regression trees. International Statistical Review. https:// doi.org/10.1111/insr.12016. Malik, U. (2018). Random forest algorithm with Python and scikit-learn. Retrieved May 29, 2019, from https://stackabuse.com/random-forest-algorithm-with-python-and-scikit-learn/.

Chapter 4  Selected approaches to supervised learning

123

vs, I. (2018). Deep convolutional neural networks: Structure, feature extraction and training. Namate Information Technology and Management Science, 20(1), 40e47. https://doi.org/10.1515/itms-2017-0007. Puskorius, G. V., & Feldkamp, L. A. (1991). Decoupled extended Kalman filter training of feedforward layered networks. IJCNN-91-Seattle International Joint Conference on Neural Networks, i, 771e777. https://doi.org/10.1109/IJCNN.1991.155276. Puskorius, G. V., & Feldkamp, L. A. (1994). Neurocontrol of nonlinear dynamical systems with Kalman filter trained recurrent networks. IEEE Transactions on Neural Networks, 5(2), 279e297. https://doi. org/10.1109/72.279191. Puskorius, G. V., & Feldkamp, L. A. (1997). Multi-stream extended Kalman filter training for static and dynamic neural networks. In 1997 IEEE international conference on systems, man, and cybernetics. Computational cybernetics and simulation, 3, 2006e2011. https://doi.org/10.1109/ICSMC.1997.635150. Robinson, T., & Fallside, F. (1991). A recurrent error propagation network speech recognition system. Computer Speech & Language, 5(3), 259e274. https://doi.org/10.1016/0885-2308(91)90010-N. Rumelhart, D. E., & McClelland, J. L. (1986). Parallel distributed processing. Cambridge, Mass: MIT Press. https://doi.org/10.1037//0021-9010.76.4.578. Schapire, R. E. (2013). The boosting approach to machine learning: An overview. https://doi.org/10.1007/ 978-0-387-21579-2_9. Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61, 85e117. https://doi.org/10.1016/j.neunet.2014.09.003. Simard, P. Y., Steinkraus, D., & Platt, J. C. (2003). Best practices for convolutional neural networks applied to visual document analysis. In Proceedings of the international conference on document analysis and recognition. ICDAR. https://doi.org/10.1109/ICDAR.2003.1227801. Singhal, S., & Wu, L. (2003). Training feed-forward networks with the extended Kalman algorithm. https://doi.org/10.1109/icassp.1989.266646. Sukittanon, S., Surendran, A. C., Platt, J. C., Burges, C. J. C., & Look, B. (2004). Convolutional networks for speech detection. International Speech Communication Association (Interspeech). TensorFlow. (2015). TensorBoard: Visualizing learning. Retrieved June 1, 2019, from TensorFlow website: https://www.tensorflow.org/guide/summaries_and_tensorboard. Tivive, F. H. C., & Bouzerdoum, A. (2003). A new class of convolutional neural networks (SICoNNets) and their application of face detection. In Proceedings of the International Joint Conference on Neural Networks, 3, 2157e2162, IEEE. https://doi.org/10.1007/978-0-387-21579-2_9 Tivive, F. H. C., & Bouzerdoum, A. (2004). A new class of convolutional neural networks (SICoNNets) and their application of face detection. https://doi.org/10.1109/ijcnn.2003.1223742. Tschannen, M., Bachem, O., & Lucic, M. (2018). Recent advances in autoencoder-based representation learning. Retrieved from http://arxiv.org/abs/1812.05069. Werbos, P. J. (1974). Beyond regression: New tools for prediction and analysis in the behavioral sciences (Harvard). Werbos, P. J. (1990). Backpropagation through time: What it does and how to do it. Proceedings of the IEEE, 78(10), 1550e1560. https://doi.org/10.1109/5.58337. Werbos, P. J. (1994). The Roots of Backpropagation: From Ordered Derivatives to Neural Networks and Political Forecasting. Wiley-Interscience. Wu, T. (2019). Variational autoencoder. Retrieved June 19, 2019, from GitHub website: https:// hustwutao.github.io/2019/06/19/variational-autoencoder/. Zheng, Y., Liu, Q., Chen, E., Ge, Y., & Zhao, J. L. (2014). Time series classification using multi-channels deep convolutional neural networks. In Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics). https://doi.org/10.1007/978-3-319-08010-9_33.

5 Statistical analysis tools 5.1 Introduction Statistics involves the collection, analysis, and interpretation of data. Often, it also involves the study of population characteristics by inference from sampling. In many fields of research, it is of interest to better understand some property of a large population (e.g., the average income of all residents of a state). The population value is called a parameter, and it typically remains unknown since it is difficult to collect data on all individuals in the population. Data are then collected on a smaller subset of the population, called a sample, and the population parameter is estimated by calculating a statistic from the sample. A statistical inference is a conclusion that patterns learned from sample data can be extended to a broader context, such as a population, through a probability model (Ramsey & Schafer, 2012). There are many considerations to make when designing a study that affect the inferences that can be made. Two main types of statistical inferences that are commonly of interest are population inferences and causal inferences (Ramsey & Schafer, 2012). Population inferences involve drawing conclusions about a population parameter from a sample, whereas causal inference involve trying to establish a cause and effect relationship between variables. The focus of the discussion here will be on population inferences since causal inferences typically involve a designed experiment. Designed experiments are very important in biomedical studies when answering hypothesis driven questions. For example, randomized control studies are foundational in clinical trials (Sibbald & Roland, 1998). However, the focus here is on connections to cluster analysis in biomedical studies, which is exploratory and hypothesis generating in nature. To make population inferences, it is important to clearly define the population of interest and obtain a sample that is representative of the population. In order to obtain a representative sample, random sampling should be conducted where a chance mechanism is used to select subjects. This helps prevent bias, which could result in over or underestimated values (Ramsey & Schafer, 2012). It is also important to determine how many people should be included in the study in order to draw generalizations about the entire population. There are different types of population inferences that are often of interest. A point estimate is a single statistic that is calculated from the data to estimate the population parameter. For example, the point estimate for the population average of the verbal IQ among 10e12 year old boys with autism spectrum disorder (ASD) would be the sample average. Frequently, instead of making inferences based on a single value, a confidence Computational Learning Approaches to Data Analytics in Biomedical Applications. https://doi.org/10.1016/B978-0-12-814482-4.00005-X Copyright © 2020 Elsevier Inc. All rights reserved.

125

126

Computational Learning Approaches to Data Analytics in Biomedical Applications

interval is constructed that provides a range of plausible values for the population parameter with a certain level of confidence. For example an interval could be obtained such that with 95% confidence the population average verbal IQ among 10e12 year old boys with ASD lies between the lower bound and the upper bound in the interval. Finally, another common type of inference is to conduct a hypothesis test, which determines whether a population parameter is equal to some pre-specified or theoretical value. For example, a hypothesis test could be conducted to determine whether the difference in the population average verbal IQ between 10 and 12 year old boys diagnosed and not diagnosed with ASD is equal to zero or not (i.e., is there a difference in the population averages between boys with and without ASD).

5.2 Tools for determining an appropriate analysis There are many types of statistical analyses that can be used to make statistical inferences. The appropriate statistical model should be determined by the study design and the questions of primary interest to the researcher. In this section, some basic statistical analysis methods are introduced, and a discussion is provided about how to select the appropriate tool. Asking a few questions about the variables (or features) being studied and how the data were collected can often lead to an appropriate method. Once the method is applied, further diagnostics can help evaluate whether the model assumptions hold or if an alternative analysis is needed. First, it is best to distinguish between a few different types of variables. The first distinction is between independent variables (IV) and dependent variables (DV). Independent or explanatory variables are variables that are thought to “explain” something in another variable. In an experiment, the IV is under the control of the researcher (e.g., randomly assigning patients to different treatments being studied). Whereas, in an observational study, the IV is not under the experimenter’s control, but it has to be observed (e.g., demographic information such as age, ethnicity, or education level). Causal inferences are more feasibly made from randomized experiments than observational studies, which have the drawback of potentially confounding variables. The presence of confounding variables may reveal an association between two variables that is actually driven by their relationship to a third variable not included in the analysis (the confounder) (Ramsey & Schafer, 2012). It is also important to know whether the variable is quantitative (Q) or categorical (C) when determining the appropriate statistical analysis. Categorical variables involve classifying individuals into groups or categories, whereas quantitative variables are numerical quantities. Both types of variables can be further categorized in a way that could affect the choice of statistical analysis. Quantitative variables can be discrete or continuous in nature. A discrete variable is one that takes on a finite or countably infinite number of values, whereas a continuous variable is one that takes on an uncountably infinite number of values (Devore, 2015). For example, the number of patients arriving at

Chapter 5  Statistical analysis tools

127

an emergency room during a 1 h period could be {0,1,2, .}, which would be a discrete random variable since it takes on a countably infinite number of values. However, the weight of a person could be any real number greater than zero and would be a continuous random variable. For categorical variables, one primary distinction is whether or not the categories have any inherent ordering. Categorical variables with no inherent ordering are called nominal variables, and those with ordering are called ordinal variables. For example, ethnicity would be nominal, whereas a Likert scale rating would be ordinal. In addition to knowing information about what types of variables are being studied, it is also helpful to distinguish the number of variables being investigated. Univariate data involve only one variable (or an analysis is conducted one variable at a time). Bivariate data involve two variables, and multivariate data involve more than two variables. The following questions provide a starting point to help guide a researcher to an appropriate statistical analysis: -

What is the main research question(s)? What variables are being investigated to answer this question(s)? How many variables are there? What relationships are being investigated? What is the independent variable(s)? What is the dependent variable(s)? What type of variables (Q or C) are the IVs and DVs?

For the purposes of this chapter and introducing some common types of statistical methods that may be of use to biomedical researchers, these will be the main questions that are addressed. However, there are many other questions that may also help determine the type of analysis and conclusions that can be made. The following questions are just a few examples: - Was there any random sampling or randomization into groups? - Are data collected over time or via some other type of sequential ordering? - Are there any variables for which individuals in the study are measured more than once? Fig. 5.1 illustrates a flowchart of common types of statistical analyses that are selected based on the type of data. In this chart, only continuous quantitative variables and nominal categorical variables are considered. Specific details for some of these analyses used in clustering applications will be provided in later subsections of this chapter, but a description of these methods can be found in many statistical textbooks (Bremer & Doerge, 2009; Devore, 2015; Samuels & Witmer, 2015). Each of these analyses has a certain set of model assumptions that need to be checked before doing statistical inferences. The models may be robust to certain assumptions and not robust to others, so it is important to know how the analyses are affected if an assumption is not met. If assumptions are not met, a remedy may be needed or an alternative analysis applied. For

128

Computational Learning Approaches to Data Analytics in Biomedical Applications

>1(Q) DV (Q) IVs or mix of (Q/C) IVs

Analysis Flowchart Type of Data

1(Q) DV 1(Q) IV 1(Q) DV 1(C) IV Simple Linear Regression/ Correlation Analysis

1(C) DV 1(C) IV

Bivariate

# C =2

2 sample ttest

# C >2

One-way ANOVA

1(C) DV 1(Q) IV # C =2

Logistic Regression

# C >2

Chi Square or Fisher’s Exact Test

Multinomial Regression

Multivariate

1(Q) DV >1(Q) IV or mix of (Q/C) IV

Multiple Linear Regression or ANCOVA

1(Q) DV >1(C) IV

Multi-factor ANOVA

>1(Q) DV (C) IVs

Multivariate Multiple Linear Regression or MANOVA

MANOVA

1(C) DV >1(Q) IV or mix of (Q/C) IV

Linear Discriminant Analysis or Logistic Regression

FIG. 5.1 Statistical analysis flowchart for different parametric statistical methods.

example, most of the analyses listed in Fig. 5.1 are parametric, in that the response variables or error terms of the model are assumed to follow a specific probability distribution. However, there are nonparametric methods that do not require this distributional assumption that may be more appropriate if parametric model assumptions are found not to hold. When the parametric assumptions hold, the nonparametric methods are often not as statistically powerful, but they are a good alternative when the parametric assumptions do not hold. For a review of nonparametric statistical methods see (Conover, 1999; Pett, 2015).

5.3 Statistical applications in cluster analysis There are many ways that statistical analysis can be used to aid in clustering applications. For example, statistical methods could be used to compare the performances of internal cluster validation indices, which are summary metrics that quantify the separation and compactness of clusters generated from different clustering settings (Arbelaitz, Gurrutxaga, Muguerza, Pe´rez, & Perona, 2013). Statistical ideas also underlie many cross validation and data imputation methods (James, Witten, Hastie, & Tibshirani, 2013; van Buuren, 2018). In this chapter, the focus will be placed on one way statistics can be incorporated well into a cluster analysis workflow (Fig. 5.2). As described in Chapter 2, during data pre-processing and prior to clustering, there are often many features available in the data that may not all be needed. Correlation analysis can aid in identifying these redundant features for removal prior to performing clustering [Chapter 2]. After an appropriate clustering method is selected and performed for a particular application, it is important to evaluate the results and better understand the cluster composition. One way to investigate this is to understand the importance of different features in the final clustering results. Features can be analyzed individually to determine if there are statistically

Chapter 5  Statistical analysis tools



IdenƟfy redundant features

• •

129

Enhance cluster evaluaƟon BeƩer understand feature importance

FIG. 5.2 An example of statistical applications in the clustering workflow. See (Al-Jabery et al., 2016) for a detailed example of this workflow used with a specific type of clustering.

significant differences between clusters, or a multivariate analysis can be conducted, which incorporates all features into a single analysis. Further details are provided below for the goal of using statistical methods for cluster evaluation.  Identify redundant features  Enhance cluster evaluation  Better understand feature importance

5.3.1

Cluster evaluation tools: analyzing individual features

One approach to cluster evaluation is to better understand the differences in cluster composition by investigating how individual features differ between clusters. This can help determine the importance of individual features and help subject matter experts better interpret the meaning of different clusters. To illustrate how the flowchart in Fig. 5.1 can be used to identify an appropriate analysis to accomplish this goal, consider the subset of analyses corresponding to bivariate data. In this case, the cluster label is the categorical independent variable, and an individual feature is the dependent variable. If the feature is quantitative and there are only two clusters [1(Q) DV, 1 (C) IV, #C ¼ 2], a two-sample t-test can be conducted to test for significant differences in the mean value of the feature between clusters. If the feature is quantitative and there are more than two clusters [1(Q) DV, 1 (C) IV, #C > 2], a one-way analysis of variance (ANOVA) can be conducted to test for a significant difference in means among the clusters and determine which clusters have statistically different means for the feature. However, if the feature is categorical [1(C) DV, 1 (C) IV], then a c2 square test or a Fisher’s Exact Test (FET) can be used to investigate whether there is an association between the feature and the cluster label. Each of these analyses are briefly introduced below along with several hypothesis testing concepts that are fundamental to all of the methods.

130

Computational Learning Approaches to Data Analytics in Biomedical Applications

5.3.1.1 Hypothesis testing and the 2-sample t-test Consider the case when there are k ¼ 2 clusters (i.e., the number of categories #C ¼ 2). Individuals in the two clusters are thought to be samples drawn from two different populations. It is of interest to test whether the population mean of a quantitative feature differs significantly between clusters. A two-sample t-test (Ramsey & Schafer, 2012; Rosner, 2015) can be employed to accomplish this goal. Defining the Hypotheses. A hypothesis test will be conducted that will result in a decision between two competing hypotheses: the null hypothesis (H0) and the alternative hypothesis (Ha). The null hypothesis is usually designed to represent the status quo or to signify that nothing of interest is happening. The alternative hypothesis, however, typically represents the researcher’s theory that something noteworthy and different from the status quo is occurring. For the 2-sample t-test, the null and alternative hypotheses are: H0 : m1 ¼ m2 vs. Ha : m1 sm2

where mi represents the population mean of the feature in cluster i for i ¼ 1,2. Under the alternative hypothesis, the population mean of the feature differs between the clusters, indicating that the feature is useful for providing insight into the differences in cluster composition. One of the main ideas behind hypothesis tests is to assume the status quo (H0) unless there is enough evidence in the data collected to indicate that H0 is untrue and thus Ha should be concluded. Since the data are obtained from samples of the population, it is very rare to observe the means being exactly equal in the sample. So, it is necessary to determine how different these sample means can be while considering variation in the data and still be compatible with the null hypothesis. If the sample means are different enough while accounting for the inherent variation, it would suggest the null hypothesis may be untrue and would provide evidence for the alternative. Making a decision between these hypotheses requires setting a criteria that can be compared to a threshold and tied to a probability model, so that uncertainty can be quantified (Ramsey & Schafer, 2012). Test Statistic. To accomplish this, data are collected by obtaining samples from the underlying populations, and a summary statistic, called a test statistic, is calculated. The test statistic is a single value calculated from the data designed to gauge the plausibility of the alternative compared to the null hypothesis. For the 2-sample t-test, the test statistic is as follows: x1  x2  tuv ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi; s21 s22 þ n1 n2

where xi , si2 , and ni are the feature sample mean, sample variance, and sample size of cluster i ¼ 1; 2, respectively. This test statistic is often referred to as Welch’s t-test, and it is designed for cases with two populations that may have different variances. There is an

Chapter 5  Statistical analysis tools

131

alternative version of the test statistic that is designed for when the two population variances are equal, but many researchers have recommended Welch’s version since the equal variance assumption can be difficult to assess and inference issues can arise when it is not met (Delacre, Lakens, & Leys, 2017; Ruxton, 2006). Defining a Decision Rule. Since the test statistic is a random variable whose value will likely change if a different sample is collected, the distribution of the test statistic, called the sampling distribution, can be determined. Specifically, when the null hypothesis is true, the sampling distribution of the test statistic can be specified exactly or approximately so that it is known how the test statistic will behave under H0. The sampling distribution of the test statistic when H0 is true can be used to establish a threshold for whether to reject H0 or not using either the rejection region approach or by calculating a p-value. The rejection region approach involves establishing a threshold (called a critical value) for the test statistic that would lead to a decision to reject H0 or not. The critical value is based on the sampling distribution of the test statistic under the null distribution. If the test statistic value is located in one of the tails of the null distribution (far away from where the distribution is centered), then it is unlikely that the data were obtained under the assumption that the null hypothesis is true. Therefore, H0 can be rejected. The critical value establishes the threshold for how far away the test statistic should be from where the sampling distribution under the null is centered in order to reject H0. The decision to reject H0 or not can equivalently be made by calculating a p-value and setting a threshold for it. The p-value for a hypothesis test is the probability that the test statistic is at least as extreme (contradictory to H0) as the one observed, given the null hypothesis is true. It is a measure of the compatibility between the data and the null hypothesis, with smaller p-values indicating greater evidence against H0. Since it is a probability, it will be a value between 0 and 1. The closer the p-value is to zero, the less likely it is to obtain the observed results just due to chance alone when the null is actually true. Making the testing decision based on the p-value requires establishing a threshold called the significance level (usually denoted as a) such that p-values smaller than a lead to rejecting H0 and concluding evidence for the alternative. For p-values greater than or equal to a, H0 is not rejected. The critical value and significance level can be chosen such that both the rejection region and p-value approaches lead to the same decision. The advantage of the p-value approach is that it provides quantitative information about the strength of evidence against the null as compared to the rejection region approach, which just yields a yes or no decision of whether to reject H0 or not. For the 2-sample t-test, the sampling distributions and rejection rules are given in Table 5.1. Errors in Testing and Statistical Power. Since the data represent a sample from two larger populations, the underlying truth about which of the two hypotheses is true is unknown. Thus, the decision that is made could be correct or incorrect. There are two possible errors that can be made in a hypothesis test. The first is called a type I error, or a

132

Computational Learning Approaches to Data Analytics in Biomedical Applications

Table 5.1 Sampling Distribution and Rejection Rules for 2-sample t-test. Note that tðnÞ represents a t-distribution with n degrees of freedom and tp ðnÞ is the pth percentile of the distribution. Type of 2-sample t-test Unequal variance (Welch’s t-test)

Sampling distribution under H0

Critical value and rejection rule

 wtðdf Þ tuv

Reject H0 if:  t  > t1a ðdf Þ uv 2



2 s21 s22 þ n n Where df ¼  12 2 2 s21

n1

n1  1

s22

þ

Else fail to reject H0

p-value and rejection rule Reject H0 if:      tuv Where Twtðdf Þ Else fail to reject H0

n2

n2  1

false positive. This error occurs when the null hypothesis is rejected even though it is true. It is called a false positive, since the researcher has claimed something beyond the status quo was occurring (a ‘positive’) even though it was not. In other words, it is claiming an effect when there is none. For the clustering scenario, this would mean saying the mean values of a particular feature differ between clusters when they really do not in the larger population of individuals. The probability of a false positive is bounded above by the significance level a, which is typically why a is set to be a small value such as 0.01, 0.05, or 0.10. The second type of error is called a type II error, or a false negative. This error occurs when the null hypothesis is not rejected even though it is false. It is called a false negative since it would mean that an existing effect is overlooked. For the clustering scenario, this would indicate the mean values of a particular feature differs between clusters, but the test indicated there was no difference. The probability of a false negative is defined as the quantity b. There are also two ways a correct decision could be made, and it is sometimes useful to think about the probabilities associated with the correct decisions. The probability of correctly rejecting a null hypothesis (true positive) is called the power and is calculated as 1- b. The probability of correctly failing to reject a true null hypothesis (true negative) is called the confidence level and is calculated as 1- a. Table 5.2 illustrates the different types of outcomes in a hypothesis test. While it would be ideal to keep both a and b small, they are inversely related. In fact, there are four components that may be specified by the experimenter that are all related: the sample size (number of individuals per group), a practical effect size (magnitude of the study effect relative to the variation that would be practically meaningful), the significance level, and the power. If three of these components are specified, the fourth can be calculated. If planning can be done in advance, the sample size needed to achieve a low false positive and false negative rate for a meaningful effect size can be calculated. For cluster evaluation, this is not usually feasible, since the sample size of each cluster is

Chapter 5  Statistical analysis tools

Table 5.2

133

Different outcomes from a hypothesis test.

unknown beforehand. However, the overall sample size could be controlled. Other texts provide a more detailed discussion on sample size and power calculations (e.g., Ryan, 2013). Assumptions. The 2-sample t-test has a set of assumptions that are important to understand and check. First, it is assumed that the individuals in the sample are independent both within a group and between groups. This means that the value of the response variable for one individual does not depend on and is not correlated with the response value of any other individual in the study. It is very important that this assumption is met. Otherwise, the stated type I error rate that is set by the significance level may not be accurate (Lissitz & Chardos, 1975). However, independence is difficult to check graphically or with a test. Rather, thought must be given to how the data were collected to determine whether there may be an inherent dependency in between individual data points. For example, if multiple data points are collected on the same individuals at different time points, these data points would not be independent. The second assumption is that the data are random samples from two normally distributed populations. Normality can be checked by creating histograms or normal probability plots of the quantitative response variable within each of the two groups. The test is more robust to departures from normality, especially for large sample sizes. Robust means that certain properties of the test, such as the stated type I error rate, are still reasonably accurate even if the assumption is not met. Note that the t-test is sensitive to outliers. If influential outliers are present, a nonparametric test such as the Mann-Whitney test (Mann & Whitney, 1947) may be more useful since it is more resistant to being heavily affected by outliers. While the Mann-Whitney test does not require normality, it still relies on the independence assumption. Thus, a more sophisticated analysis is needed for either the parametric or nonparametric approaches when independence is not met.

5.3.1.2 Summary of hypothesis testing steps and application to clustering In summary, the basic steps for conducting a hypothesis test are as follows: 1. Formulate the question of interest and translate it into null (H0) and alternative (Ha) hypotheses.

134

Computational Learning Approaches to Data Analytics in Biomedical Applications

2. Decide on a data collection protocol that will address the question of interest. This includes determining an appropriate statistical test, setting the significance level a, and doing a sample size calculation to also control b, if possible. 3. Once data has been collected, check the assumptions. 4. Calculate the test statistic. 5. Make the testing decision (reject H0 or not) using either the rejection region or pvalue approach. 6. Write conclusions for publication in context of the question of interest (including reporting the p-value). In conclusion, the 2-sample t-test can be used to test whether or not there is a statistically significant difference in the true means of an individual feature between two clusters. Features that differ significantly between clusters indicate they may be an important factor in helping distinguish the groups and can be examined for clinical relevance in a biomedical setting. However, if there are many features and multiple 2sample t-tests are conducted, it will be important to control the familywise false positive rate across the set of tests. While not discussed here, there are many methods that can be used to do this such as Bonferroni (Bland & Altman, 1995) and the False Discovery Rate approach (Benjamini & Hochberg, 1995). More details about methods that enable controlling the overall false positive rate can be found in (Bender & Lange, 2001; Rice, Schork, & Rao, 2008).

5.3.1.3 One-way ANOVA Consider the case where there are k  3 clusters (i.e., the number of categories #C  3). It is important to test whether the population mean of a quantitative feature differs significantly between clusters and, if so, which clusters have significantly different means. A one-way analysis of variance (ANOVA) can be used to test these questions (Kutner, Nachtsheim, Neter, & Li, 2004; Ramsey & Schafer, 2012; Samuels & Witmer, 2015). The idea behind the one-way ANOVA, is that it compares variation between groups to variation within groups (Fig. 5.3). If “enough” variation is occurring between groups relative to the within group variation, then there is a difference somewhere among the k population means. Determining how much is “enough” involves establishing the test statistic and decision rule as described below. ANOVA model and Assumptions. Suppose there are k populations with means m1 ; m2; .; mk . Consider the model: yij ¼ mi þ εij

where yij is the value of the feature for the jth individual within the ith cluster. The model breaks these values of yij into two components, the population mean of the ith cluster mi and an error term εij that represent random deviations of individuals from the cluster mean. The assumptions of the model are that individuals in the study are independent and that individuals within each group represent a random sample from a normally distributed population, with mean mi and variance s2 . This inherently implies that the population variances are the same for all clusters. Thus, the main assumptions are

Chapter 5  Statistical analysis tools

135

1.0

0.8

Column 1

0.6

0.4

0.2

0.0

-0.2

1

2

3

Cluster FIG. 5.3 The y-axis represents a quantitative feature plotted against the cluster number (k ¼ 3). The red boxes represent the cluster means. ANOVA testing utilizes the ratio of variation between the cluster means to within cluster variation.

(1) independence, (2) normality within groups, and (3) constant variance. The KruskalWallis test (Kruskal & Wallis, 1952) can be used as a nonparametric test for an alternative analysis. Global Test. There are often two types of hypothesis tests that are of interest for a one-way ANOVA, and these tests are typically performed sequentially. The first test performed is called a global test, which determines whether there is a difference anywhere in the means among a set of k populations. The hypotheses for the global test are: H0: m1 ¼ m2 ¼ . ¼ mk versus Ha: Not all of the mi ’s are equal. Where mi represents the population mean of the feature in cluster i for i ¼ 1,2, .,k. The test statistic is called an F-test, and is as follows: k 1 P ni ðy i yÞ2 k  1 i¼1 Between group variation  F ¼ 2 ¼ Within group variation ni  k P 1 P yij  y i n  k i¼1 j¼1

where yij is the value of the feature for the jth individual within the ith cluster, y i is the sample mean of cluster i, y is the overall mean of the feature across all individuals, ni is the sample size of cluster i, and n is the total number of individuals in the study. The numerator of the test statistic represents the between group variation of individual cluster means from the overall mean. The numerator is also often referred to as the mean square between groups, MS(between), since is calculated by dividing the between group sum of squares, SS(between), by its degrees of freedom (k-1). That is: k P ni ðy i yÞ2 SSðbetweenÞ i¼1 ; MSðbetweenÞ ¼ ¼ k1 df ðbetweenÞ

136

Computational Learning Approaches to Data Analytics in Biomedical Applications

The denominator of the test statistic represents within group variation of individual values from the group mean. Similarly, it is often referred to as mean square within group, MS(within) or more simply the mean square error (MSE), since it represents the variation that is unexplained by group differences. It is calculated by dividing the within group error sum of squares, SS(within) or SSE, by its degrees of freedom (n-k). That is: ni  k P P

MSE ¼

SSE i¼1 ¼ dfE

j¼1

yij  y i

2

nk

;

If the between group variation is larger than the within group variation, the F-test statistic will be large, and this will indicate that a difference exists among the group means (Ha). The sampling distribution of F  when the null is true is an F-distribution with k-1 numerator and n-k denominator degrees of freedom, that is F (k-1,n-k). The p-value, critical value, and rejection rule are as follows: p-value ¼ P(F > F *) where F w F(k-1,n-k) Reject H0 if F *> F1-a(k-1,n-k) or p-value< a

Where F1-a(k-1,n-k) is the 1-a percentile of the F-distribution. Tukey Pairwise Comparisons. When the global test indicates there is a difference among the cluster means, it is sometimes useful to determine which clusters have statistically different means. This involves conducting a hypothesis test for a difference in means between all k (k-1)/2 pairs of clusters. There are several different methods that can be used to compare group means or some combination of them (Bender & Lange, 2001; Kutner et al., 2004), but Tukey’s method is preferred when it is of interest to conduct all of the pairwise comparisons. Tukey’s method (Braun, 1994; Kutner et al., 2004) is a multiple testing method that controls the false positive rate across all the tests. The method is referred to as the Tukey-Kramer method (Braun, 1994; Kramer, 1956) when the number of individuals in each group (cluster) is not the same (unbalanced). In addition to controlling the familywise error rate, the Tukey procedure differs from conducting individual t-tests between the groups since it utilizes the MSE as a pooled variance that estimates the common variance across the k populations. The testing procedure is given in Table 5.3. Table 5.3 Testing procedure for Tukey’s pairwise comparisons method, where qðk; n kÞ represents the studentized range distribution with k and n-k degrees of freedom and q1a ðk; n kÞ is the 1-a percentile of the distribution. Hypotheses H0: mi ¼ mj Ha: mi s mj Where is j

Test statistic and sampling distribution under H0 pffiffiffi 2 yi  yj  q ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi wqðk; n kÞ  1 1 þ MSE ni nj

Critical value and rejection rule Reject H0 if: jq j > q1a ðk; n kÞ Else fail to reject H0

Chapter 5  Statistical analysis tools

137

5.3.1.4 c2 test for independence Thus far only quantitative features have been considered, but some features may be categorical in nature. For categorical features, a c2 test of independence (Bremer & Doerge, 2009; Conover, 1999; Samuels & Witmer, 2015) can be conducted to determine whether or not there is an association between the feature categories and the cluster labels. Data is typically organized into a contingency table where the rows and columns represent different categories for the two categorical variables. Counts for the number of occurrences of each category combination are given inside the table with totals on the final row and column. In clustering applications, the rows of the contingency table would be the different clusters and the columns would be different values of the categorical feature. Table 5.4 illustrates a contingency table for a data set with k ¼ 3 clusters and a categorical feature with 3 groups. Data in a contingency table can be visualized with a mosaic plot (Fig. 5.4). The width of the bars is relative to the size of each cluster, so it is easy to see that cluster 3 is the largest while clusters 1 and 2 are similar in size. The height of the bars represents the percentage of observations that fall in the three groups within each cluster (i.e., row percentages). For the data in Table 5.4, cluster 1 has a high percent of observations in group 1, whereas clusters 2 and 3 have a high percentage in group 3. Table 5.4 Example of contingency table for k ¼ 3 clusters and categorical feature with 3 groups. Contingency table

Group 1

Group 2

Group 3

Total

Cluster 1 Cluster 2 Cluster 3 Total

28 1 15 44

13 2 24 39

2 46 77 125

43 49 116 208

Mosaic Plot 1.00

0.75

3

0.50 2 0.25 1 0.00

1

3

2

Cluster FIG. 5.4 Mosaic plot of the data from contingency table.

138

Computational Learning Approaches to Data Analytics in Biomedical Applications

Table 5.5 Testing procedure for c2 independence test, where r ¼ number of rows, c ¼ number of columns, c2 ððr 1Þðc 1ÞÞ represents the c2 distribution with (r-1)(c-1) degrees of freedom and c21a ððr 1Þ ðc 1ÞÞ is the 1-a percentile of the distribution. Hypotheses H0: The two categorical variables are independent (no association) Ha: The two categorical variables are not independent (there is an association)

Test statistic and sampling distribution under H0 c2 ¼

X ðobserved count  expected countÞ2 expected count

wc2 ððr  1Þðc  1ÞÞ Where row total  column total expected count ¼ overall total

Critical value and rejection rule Reject H0 if: c2 > c21a ððr 1Þ ðc 1ÞÞ Else fail to reject H0

The c2 test of independence offers a formal way to test whether there is a statistically significant association between the cluster label and the feature categories. The testing procedure is given in Table 5.5. The test statistic is calculated by finding the expected count under independence for each “cell” in the contingency table. There are a total of r x c “cells” or combinations of categories, where r is the number of row categories, and c is the number of column categories. Within each cell, the difference between the expected and observed counts is squared and then scaled by the expected count. These values are added together across all cells to obtain the test statistic value. Large values of the test statistic are indicative of a deviation from independence, and the formal rejection rule to establish significance is given in Table 5.5. The p-value can be calculated using a c2 distribution. However, it should be noted that the use of the c2 distribution is an approximation based on large sample theory (Conover, 1999). For small samples, an alternative test called Fisher’s Exact Test (FET) (Bremer & Doerge, 2009; Conover, 1999; Samuels & Witmer, 2015) can be used, which utilizes the exact distribution. The rule of thumb is to use the c2 test when less than 20% of the expected cell counts are less than 5; otherwise the FET is more appropriate (Kim, 2017).

5.3.2

Cluster evaluation tools: multivariate analysis of features

The previous section described ways that different types of traditional statistical analyses can be used to evaluate differences in individual features between clusters. This can help subject matter experts better understand the feature characteristics of different clusters. However, cluster evaluation can be enhanced by utilizing a multivariate analysis that includes all the features, rather than considering them separately. The details of these multivariate analysis methods are not covered here, but some examples of specific cluster evaluation questions that could be addressed with multivariate analysis are described briefly. For more information about multivariate statistical analyses see (Johnson & Wichern, 2008).

Chapter 5  Statistical analysis tools

139

One question that may be of interest is to determine which features best discriminate the clusters. To answer this question, the relationship between a categorical dependent variable (cluster membership) with more than one quantitative independent variable (feature) should be investigated. A descriptive linear discriminant analysis (LDA) is one option (Fig. 5.1) for accomplishing this goal. LDA finds linear combinations of features that maximize group separation, called canonical discriminant functions. The importance of individual features in distinguishing clusters can be evaluated by investigating how strongly each feature correlates with the canonical discriminant functions. This can aid subject-matter researchers in better understanding which features are most important in cluster separation. Another question that could be addressed with multivariate statistical analyses is whether there is a significant difference in the feature means among the C clusters after accounting for a covariate feature not included in the clustering (e.g., age). An analysis of covariance (ANCOVA) could be used to address this question (feature is Q DV, Cluster is C IV, covariate is Q IV). This could be important if there were a particular variable (covariate) that is not of direct interest for clustering but could be related to the features used in clustering. This type of analysis would allow the researcher to test for differences in the feature means between clusters after adjusting for the covariate. There are many other cluster evaluation questions that could potentially be addressed with multivariate statistical analyses, but it is important for the subject-matter expert to be involved in framing these questions according to their research goals. It is also important to keep in mind that additional assumptions are often needed for multivariate analyses, and these should be verified for their appropriateness. Although this chapter does not focus on the details of multivariate analyses, this section illustrates some potential applications in cluster evaluation.

5.4 Software tools and examples 5.4.1

Statistical software tools

There are many different statistical software available for implementing statistical methods. Commercial software includes SAS, JMP, SPSS, Stata, and Minitab, among others. R is a popular open source, open development language, and Bioconductor contains many statistical methods for bioinformatics applications that are implemented in R. There are many advantages and disadvantages for these different software that will not be discussed here. For the purposes of illustrating how the analyses described above can be used to evaluate differences in features between clusters, JMP software (JMPÒ, Version 13. SAS Institute Inc.) will be used in the example below. Some advantages of JMP are that it is powerful as a data exploration tool and has many nice visualization features. JMP has a GUI interface, but scripting is also an option. Some limitations of JMP are that it is commercial and not open source; thus, there is a cost to use it, and it is not designed so researchers can easily implement new methods.

140

Computational Learning Approaches to Data Analytics in Biomedical Applications

5.4.1.1 Example: clustering autism spectrum disorder phenotypes Autism spectrum disorder (ASD) is a complex disease characterized by high variation in many phenotypes including behavioral, clinical, physiologic, and pathological factors (Georgiades, Szatmari, & Boyle, 2013). Cluster analysis is a useful approach for sorting out the phenotypic heterogeneity with the goal of identifying clinically relevant subgroups that may benefit from different diagnoses and treatments (Al-Jabery et al., 2016). In this example, data were obtained from 208 ASD patients through the Simons Simplex Collection (Fiscbach & Lord, 2010) site at the University of MissourieThompson Center for Autism and Neurodevelopmental Disorders. Simplex indicates that only one child (called a proband) is on the ASD spectrum, but neither of the biological parents nor do siblings have the disease. A set of 27 phenotypic features (25 quantitative and 2 categorical) are available that provide information about different characteristics including ASD-specific symptoms, cognitive and adaptive functioning, language and communication skills, and behavioral problems. Table 5.6 provides a list of these variables with a Table 5.6 List of variable labels, definitions, and types for ASD data example. Q ¼ quantitative, C¼Categorical. Variables with a eD indicate the variable was discarded prior to clustering. Label

Definition

Type

Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10 Var11 Var12 Var13 Var14 Var15 Var16 Var17 Var18 Var19 Var20 Var21 Var22 Var23 Var24 Var25 Var26 Var27

Overall verbal IQ Overall nonverbal IQ Full scale IQ Module of ADOS administered ADI-R B nonverbal communication total ADOS communication social interaction total ADI-R A total abnormalities in reciprocal social interaction ADOS social affect total ADI-R C total restricted repetitive & stereotypical patterns of behavior ADOS restricted and repetitive behavior (RBB) total Repetitive behavior scale-revised (RBS-r) overall score Aberrant behavior checklist (ABC) total score Regression Vineland II composite standard score Vineland II daily living skills standard score Vineland II communication standard score Peabody picture vocabulary test (PPVT4A) standard score Social responsiveness scale (SRS) parent-awareness raw score SRS parent e Cognition raw score SRS parent e Communication raw score SRS parent e Mannerisms raw score SRS parent e Motivation raw score SRS parent total raw score Vineland II socialization standard score RBS-R subscale V sameness behavior Child behavior checklist (CBCL) internalizing problems total CBCL Externalizing problems total

Q Q Q C Q Q Q Q Q Q Q Q C Q Q Q Q Q Q Q Q Q Q Q Q Q Q

eD eD eD eD eD e e e e

D D D D

Chapter 5  Statistical analysis tools

141

brief definition. (Al-Jabery et al., 2016) applied a novel ensemble subspace clustering model to identify meaningful subgroups and aid in better understanding the phenotypic complexity present in ASD. In the following example, the statistical methods described previously are used to evaluate feature importance and understand clinical relevance of clusters identified using the clustering method applied in (Al-Jabery et al., 2016).

5.4.1.2 Correlation analysis As discussed previously [Chapter 2], correlation analysis can be utilized to check the strength of relationships between pairs of quantitative variables. This enables the detection of highly correlated features that can be selected for removal prior to clustering since they contain redundant information. For example, taking the first 5 of the quantitative phenotypic features in the ASD dataset for illustration purposes, the Pearson (Fig. 5.5B) and Spearman (Fig. 5.5C) correlation values can be obtained as well as a visualization of the pairwise relationships in a scatterplot matrix (Fig. 5.5A). Note that the Var4 is categorical and is not included in this analysis. The following JMP menu commands are used to generate these results.

n n n JMP Commands: Correlation Analysis  Pearson pairwise correlations and scatterplot matrix: Analyze >> Multivariate Methods >> Multivariate >> Y-columns [enter Q variables] The following optional analyses can be obtained by selecting further options within the result output of the previous command.  Spearman pairwise correlation: Multivariate >> Nonparametric correlations >> Spearman’s r  Adding histograms to scatterplot matrix: Scatterplot Matrix >> Show histogram >> Horizontal.

n n n

The scatterplot matrix provides a visualization for each pair of quantitative variables. The plots above and below the diagonal are identical, with the axes switched. Thus one only needs to look at half of the plots (e.g., upper diagonal). In Fig. 5.5A, it is apparent that Vars1-3 have strong linear pairwise relationships. All three of these variables are related to IQ (see Table 5.6). The histogram for each of the quantitative variables is also provided as an additional option to visualize the shape of the variable distributions and reveal obvious outliers (none are seen here).

142

Computational Learning Approaches to Data Analytics in Biomedical Applications

(A)

(B)

(C)

FIG. 5.5 JMP correlation results for five variables in ASD dataset. A. Scatterplot matrix with histograms of individual variables on diagonal. B. Pearson correlation results. C. Spearman correlation results.

Chapter 5  Statistical analysis tools

143

The Pearson correlation (Fig. 5.5B) results are given in the same format as the scatterplot matrix. The diagonal values are all 1 since they represent the correlation of the variable with itself and only half of the values (e.g., upper diagonal) are needed. For example, the Pearson correlation between the verbal (Var1) and nonverbal IQ (Var2) is 0.8566, and the visualization of that relationship can be seen in the same position of the scatterplot matrix. Note that the values are colored according to their strength, with dark blue being high positive correlations and dark red being high negative correlations. The Spearman correlation (Fig. 5.5C) results are presented in a different way. Each pairwise correlation is listed in a set of two columns that include all possible pairwise combinations. The correlation is given along with a p-value for testing whether the Spearman correlation is zero or not and a bar graph that represents the correlation value. The Spearman correlation for the verbal (Var1) and nonverbal IQ (Var2) is 0.8485, which is significantly different from zero. It can be seen from the bar chart that the IQ variables have the highest Spearman correlation, which aligns with the Pearson results and   25 scatterplot matrix. Note that all ¼ 300 pairwise correlations can be calculated for 2 the 25 quantitative variables and utilized as one method for removing redundant variables prior to cluster analysis.

5.4.1.3 Cluster evaluation of individual features The subspace clustering approach described in (Al-Jabery et al., 2016) is applied to the ASD data, and the top three clustering configurations selected by majority voting of three validation indices (Davies-Bouldin, Silhoutte, and Calinski-Harabasz) are further evaluated to better understand how variables differ between clusters. The subspace clustering method has an uni-dimensional clustering step that offers an alternative way of removing indiscriminant features rather than checking redundancy through the pairwise correlations. A total of 9 variables were removed as part of this phase of the method, and these are noted in Table 5.6. Note that JMP also has the ability to perform some types of clustering (e.g., k-means and hierarchical) through the menu options Analyze > Clustering, but those are not explored here since the focus in this chapter is on using statistical methods for cluster evaluation rather than cluster methodology. Two cluster results. One of the top three clustering results identified two clusters. To analyze the individual features from these results, consider choosing the appropriate analysis based on Fig. 5.1. The independent variable will be the cluster identifier, which will be categorical with C ¼ 2 categories. The dependent variable will be one of the variables listed in Table 5.6. For the 16 quantitative dependent variables, the appropriate analysis is the 2-sample t-test. For the 2 categorical dependent variables, a Chi-square or Fisher’s exact test should be performed.

144

Computational Learning Approaches to Data Analytics in Biomedical Applications

2-sample t-test. As an illustration, consider quantitative Var6, the ADOS Communication Social Interaction Total. The following JMP menu commands are used to generate the t-test results.

n n n JMP Commands: 2-sample t-test  Two sample t-test Analyze >> Fit Y by X >> Y-Response [enter Q variable]. X-Factor [enter Cluster ID variable]. Select further options within the result output of the previous command.  Means and Std Dev  t-Test (Welch’s t etest, assumes unequal variance)

n n n

First, it is helpful to look at the means and standard deviations for each cluster (Fig. 5.6A). The results give the sample sizes for the two clusters (n1 ¼ 189, n2 ¼ 19) as well as the means (M1 ¼ 0.5583, M2 ¼ 0.8535), standard deviation (SD1 ¼ 0.1971, SD2 ¼ 0.1469), standard error of the mean (SEM1 ¼ 0.0143, SEM2 ¼ 0.0337) and a 95% confidence interval for the true cluster means (CI for Cluster 1 Mean [0.5300,0.5866], CI for Cluster 2 Mean [0.7827, 0.9243]). The Welch’s t-test results are given in Fig. 5.6B. In the left column of Fig. 5.6B, notice that the “Difference” indicates the difference in sample means between cluster 2 and cluster 1. The standard error and a 95% confidence interval for the true difference are also given. Notice that zero is not in this interval indicating there is a significant

(A) Means and Std Deviations Level 1 2

Std Err Mean Lower 95% Upper 95% Number Mean Std Dev 189 0.558301 0.197093 0.01434 0.58658 0.53002 19 0.853469 0.146900 0.03370 0.92427 0.78267

(B) t Test 2-1 Assuming unequal variances 0.295168 t Ratio Difference Std Err Dif 0.036624 DF Upper CL Dif 0.370592 Prob > |t| Lower CL Dif 0.219744 Prob > t Confidence 0.95 Prob < t

8.059465 25.02574 > Y-Response [enter C variable]. X-Factor [enter Cluster ID variable]. Select further options within the result output of the previous command.  Exact Test > Fisher’s Exact Test

n n n

The contingency table and mosaic plot of the data are given in Fig. 5.7A. It can be seen that within each cluster, there are similar proportions of individuals that are administered the ADOS modules labeled “0” and “0.5”. However, cluster 1 has a high proportion of individuals administered the ADOS module labeled “1”; whereas that proportion is small in cluster 2. The test results are given in Fig. 5.7B. Summary information is provided in the first row, such as the overall sample size (N ¼ 208), the degrees of freedom (2e1)x (3e1) ¼ 2, negative log likelihood, and R2. Below the summary information, two different methods (Likelihood Ratio, Pearson) for calculating the c2 test statistic and p-value (Prob > ChiSq) are given. The method described previously corresponds to the Pearson method, which has a test statistic of 21.425 and a p-value of ChiSq > Y-Response [enter Q variable]. X-Factor [enter Cluster ID variable]. Select further options within the result output of the previous command.  Means/Anova  Compare Means » All Pairs, Tukey HSD

n n n

First, the sample sizes and means for each cluster are given in Fig. 5.8A, along with 95% confidence intervals for the true mean. Observe that cluster 1 has the largest sample

(A) Means for Oneway Anova Number Mean Std Error Lower 95% Upper 95% 181 0.554202 0.01453 0.52556 0.5828 22 0.781508 0.04168 0.69934 0.8637 5 0.846212 0.08742 0.67386 1.0186

Level 1 2 3

Std Error uses a pooled estimate of error variance

(B) Analysis of Variance Source Cluster Error C. Total

DF 2 205 207

Sum of Squares Mean Square 1.3623642 0.681182 7.8332278 0.038211 9.1955920

F Ratio 17.8269

Prob > F irði; jÞ jIj $ ðjIj  1Þ

(6.4)

Chapter 6  Genomic data analysis 179

r

P P

condition ¼

ASR is defined as

i˛I

j˛J; j>irði; jÞ

(6.5)

jJj $ ðjJj1Þ

  ASRðBÞ ¼ 2 $ max rgene ; rcondition

(6.6)

The ASR’s value is in the range½ 1; 1, where both 1 and 1 represent a perfect trendpreserving bicluster. ASR is one of the few bicluster quality measures that can detect both shifting and scaling patterns of biclusters as well as shift-scale (combined pattern) biclusters (Pontes, Girldez, et al., 2015). Submatrix Correlation Score (SCS). The Submatrix Correlation Score (SCS) (Yang, Dai, & Yan, 2011) is based on the Pearson correlation measure. SCS is defined as the row or column of the bicluster exhibiting the largest (in magnitude) average correlation with other rows or columns, respectively. Let P   i2 ˛I; i2 si1 jrðri1 ; ri2 Þj Srow ¼ min 1  i1˛I I 1

(6.7)

P   j2 ˛J; j2 sj1 jrðcj1 ; cj2 Þj Scol ¼ min 1  j1˛J J 1

(6.8)

where r denotes the Pearson correlation, ri denotes row vector i of the bicluster with I rows and cj denotes column vector j of the bicluster with J columns. SCS is then given by the minimum of these two values: SCS ¼ min {Srow, Scol}. SCS can detect shift, scale and shift-scale biclusters; however, it exhibits some difficulty with trend-preserving biclusters given that it focuses on linearly correlated biclusters. Transposed Virtual Error (V E T ). Transposed Virtual Error (VET) (Pontes et al., 2010) is another bicluster quality measure that correctly identifies shift, scale and shift-scale biclusters. Transposed Virtual Error is an improvement on Virtual Error (VE) (Pontes et al., 2007), which does not identify shift-scale biclusters. Both VE and VE T require standardized biclusters. A bicluster is standardized by subtracting the row mean from each element of the bicluster and dividing by the row standard deviation, i.e. b ¼ Bij  miJ ; B siJ

i ¼ 1; 2; :::; jIj;

j ¼ 1; 2; :::; jJj

(6.9)

where miJ is the mean of row i in B; and siJ is the standard deviation of row i in B. VE computes a virtual gene r, which is a vector imitating a gene whose entries are column means across all genes in the bicluster. Explicitly, the standardized virtual gene b as is calculated for a standardized bicluster B b rj ¼

jIj 1 Xb B ij ; jIj i¼1

j ¼ 1; 2; :::; jJj

(6.10)

180

Computational Learning Approaches to Data Analytics in Biomedical Applications

Finally, VE is defined as VEðBÞ ¼

jIj jJj  1 X X b rj  B ij  b jIj $ j Jj i¼1 j¼1

(6.11)

To compute VET, the bicluster should be transposed prior to calculating VE. VET computes a virtual condition r and measures the deviation of conditions in the bicluster from r. The virtual condition r is calculated as b rj ¼

j Jj 1 Xb B ij ; j Jj j¼1

j ¼ 1; 2; :::; j Jj

(6.12)

and VET is calculated as VE T ðBÞ ¼

jIj X j Jj X   1 B b ij  b ri  jIj , jJj i¼1 j¼1

(6.13)

VET is equal to zero for perfect shifting, scaling or shift-scale patterns. Special Cases for V E T . Constant rows in expression data pose an issue when computing VET. When one or more rows are constant, the standard deviation of at least one row is zero, and thus, the result of Eq. (6.13) is undefined. A constant row is highly unlikely in real data applications, so a standard deviation of zero should be a non-issue. For the context of this work with synthetic data, VET is set to one if any zero-division errors occur. This does produce false negatives in the case where a constant row is part of a constant bicluster. ii. External validation measures This section describes commonly used external validation metrics, relevance and recovery scores. For a given dataset D, let SðAi Þ denote the set of biclusters returned by applying a specific biclustering algorithm Ai on D, while G denotes the corresponding set of known ground truth biclusters for D. The relevance score MSðS; GÞ, is a measure of the extent to which the generated biclusters SðAi Þ are similar to the ground truth biclusters in the gene (row) dimension. The recovery score, given by MSðG; SÞ, quantifies the proportion of the subset of G that were retrieved by Ai . A high relevance score implies that a large percentage of the biclusters discovered by the algorithm are significant, while a high recovery score indicates that a large percentage of the actual ground truth biclusters are very similar to the ones returned by the algorithm. The relevance and recovery scores are derived from the match score (Prelic et al., 2006). The match score (MS) between two sets of biclusters S 1 and S 2 is defined as: MSðS 1 ; S 2 Þ ¼

1 X jB1 XB2 j max jS 1 j B1 ˛S 1 B2 ˛S 2 jB1 WB2 j

(6.14)

which reflects the average of the maximum similarity for all biclusters B1 in S1 with respect to the biclusters B2 in S2 . The intersection of two biclusters B1 ˛S 1 and B2 ˛ S 2

Chapter 6  Genomic data analysis 181

denotes the set of rows common to both B1 and B2 . Similarly, the union of two biclusters is the set of rows that exist in either B1 or B2 or both. The match score takes on values between 0 and 1, inclusive. In the case where no rows of any bicluster in S 1 are found in any bicluster in S 2 , jB1 XB2 j ¼ 0 for all possible B1 ˛S 1 , B2 ˛S 2 . Subsequently, MS ¼ 0 (Eq. 6.1). Similarly, if the sets of biclusters S 1 and S 2 are identical, then both jB1 XB2 j ¼ jS 1 j andjB1 WB2 j ¼ jS 1 j, yielding a match score of one. The match score is also referred to as similarity score (Wang et al., 2016).

6.5 Summary This chapter has reviewed current trends in computational analysis of genomics data. As discussed, there are different types of genomic data characterized by high dimensionality and high volume. Current methodologies for the analysis of DNA methylation data have been summarized. For genotype data, specifically SNPs, a framework was presented for merging clustering with family-based association testing. Biclustering is a readily applicable unsupervised learning technique for analyzing gene expression data. This chapter also reviewed the current state-of-the-art methods and discussed relevant internal and external validation metrics. Given that this field is rapidly evolving due to advancement in technology, it is anticipated that the type of data and format available will also change. However, the fundamental questions related to personalized medicine that still need to be answered will always be relevant. The methodologies presented in this paper attempt to address some of these fundamental questions.

References 23andMe. (2016). What is the difference between genotyping and sequencing?. Retrieved June 1, 2019, from 23andMe website: https://customercare.23andme.com/hc/en-us/articles/202904600-What-isthe-difference-between-genotyping-and-sequencing-. Aryee, M. J., Jaffe, A. E., Corrada-Bravo, H., Ladd-Acosta, C., Feinberg, A. P., Hansen, K. D., et al. (2014). Minfi: A flexible and comprehensive bioconductor package for the analysis of Infinium DNA methylation microarrays. Bioinformatics (Oxford, England), 30(10), 1363e1369. https://doi.org/10. 1093/bioinformatics/btu049. Ayadi, W., Elloumi, M., & Hao, J.-K. (2009). A biclustering algorithm based on a bicluster enumeration tree: Application to DNA microarray data. BioData Mining, 2(1), 9. https://doi.org/10.1186/17560381-2-9. Barkow, S., Bleuler, S., Prelic, A., Zimmermann, P., & Zitzler, E. (2006). BicAT: A biclustering analysis toolbox. Bioinformatics, 22(10), 1282e1283. https://doi.org/10.1093/bioinformatics/btl099. Ben-Dor, A., Chor, B., Karp, R., & Yakhini, Z. (2003). Discovering local structure in gene expression data: The order-preserving submatrix problem. Journal of Computational Biology : A Journal of Computational Molecular Cell Biology, 10(3e4), 373e384. https://doi.org/10.1089/ 10665270360688075. Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, 57(1), 289e300. https://doi.org/ 10.1017/CBO9781107415324.004.

182

Computational Learning Approaches to Data Analytics in Biomedical Applications

Bergmann, S., Ihmels, J., & Barkai, N. (2003). Iterative signature algorithm for the analysis of large-scale gene expression data. Physical Review E, 67(3). https://doi.org/10.1103/PhysRevE.67.031902. Bibikova, M., Le, J., Barnes, B., Saedinia-Melnyk, S., Zhou, L., Shen, R., et al. (2009). Genome-wide DNA methylation profiling using Infinium  assay. Epigenomics, 1(1), 177e200. https://doi.org/10.2217/ epi.09.14. Bland, j. M., & Altman, D. G. (1995). Multiple significance tests: The Bonferroni method. BMJ, 310(6973), 170. https://doi.org/10.1136/bmj.310.6973.170. Bleuler, S., Prelic, A., & Zitzler, E. (2004). An EA framework for biclustering of gene expression data. In Proceedings of the 2004 congress on evolutionary computation (IEEE cat. No.04TH8753) (pp. 166e173). https://doi.org/10.1109/CEC.2004.1330853. Brito da Silva, L. E., Elnabarawy, I., & Wunsch, D. C. (2019). A survey of adaptive resonance theory neural network models for engineering applications. Neural Networks, 1e43. Butcher, L. M., & Beck, S. (2015). Probe lasso: A novel method to rope in differentially methylated regions with 450K DNA methylation data. Methods (San Diego, Calif.), 72, 21e28. https://doi.org/10.1016/j. ymeth.2014.10.036. Caldas, J., & Kaski, S. (2008). Bayesian biclustering with the plaid model. In 2008 IEEE workshop on machine learning for signal processing (pp. 291e296). https://doi.org/10.1109/MLSP.2008.4685495. Carpenter, G. A., Grossberg, S., Markuzon, N., Reynolds, J. H., & Rosen, D. B. (1992). Fuzzy ARTMAP: A neural network architecture for incremental supervised learning of analog multidimensional maps. IEEE Transactions on Neural Networks, 3(5), 698e713. https://doi.org/10.1109/72.159059. Carpenter, G. A., Grossberg, S., & Rosen, D. B. (1991). Fuzzy ART: Fast stable learning and categorization of analog patterns by an adaptive resonance system. Neural Networks, 4(6), 759e771. https://doi.org/ 10.1016/0893-6080(91)90056-B. Chekouo, T., & Murua, A. (2015). The penalized biclustering model and related algorithms. Journal of Applied Statistics, 42(6), 1255e1277. https://doi.org/10.1080/02664763.2014.999647. Cheng, Y., & Church, G. M. (2000). Biclustering of expression data. Proceedings/. In International conference on intelligent systems for molecular Biology ; ISMB. International conference on intelligent systems for molecular biology. https://doi.org/10.1007/11564126. Cheng and Church. (2013). Retrieved June 1, 2019, from Kemal Eren website: http://www.kemaleren. com/post/cheng-and-church/. Clifford, H., Wessely, F., Pendurthi, S., & Emes, R. D. (2011). Comparison of clustering methods for investigation of genome-wide methylation array data. Frontiers in Genetics, 2, 88. https://doi.org/10. 3389/fgene.2011.00088. CLSLabMSU. (2017). Biclustering algorithm comparison. Retrieved June 1, 2019, from GitHub website: https://github.com/clslabMSU/Biclustering-Algorithm-Comparison. Collins, F. S., & Barker, A. D. (March 2007). Mapping the cancer genome. Scientific America. Csardi, G., Kutalik, Z., & Bergmann, S. (2010). Modular analysis of gene expression data with R. Bioinformatics, 26(10), 1376e1377. https://doi.org/10.1093/bioinformatics/btq130. Dale, J., Zhao, J., & Obafemi-Ajayi, T. (2019). Multi-objective optimization approach to find biclusters in gene expression data. In IEEE conference on computational intelligence in bioinformatics and computational biology (CIBCB). Deb, K., Pratap, A., Agarwal, S., & Meyarivan, T. (2002). A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation, 6(2), 182e197. https://doi.org/10.1109/ 4235.996017. Doerge, R. W. (2002). Mapping and analysis of quantitative trait loci in experimental populations. Nature Reviews Genetics, 3(1), 43e52. https://doi.org/10.1038/nrg703.

Chapter 6  Genomic data analysis 183

Du, P., Zhang, X., Huang, C.-C., Jafari, N., Kibbe, W. A., Hou, L., et al. (2010). Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis. BMC Bioinformatics, 11(1), 587. https://doi.org/10.1186/1471-2105-11-587. EpistasisLabAtUPenn. (2015). Ebic. Retrieved June 1, 2019, from GitHub website: https://github.com/ EpistasisLab/ebic. ¨ . V. (2012). A comparative analysis of biclustering Eren, K., Deveci, M., Ku¨c¸u¨ktunc¸, O., & C ¸ atalyu¨rek, U algorithms for gene expression data. Briefings in Bioinformatics, 14(3), 279e292. https://doi.org/10. 1093/bib/bbs032. Farkas, L. G. (1994). Anthropometry of the head and face in clinical practice (2nd ed.). New York, NY: Raven Press. Feinberg, A. P., Koldobskiy, M. A., & Go¨ndo¨r, A. (2016). Epigenetic modulators, modifiers and mediators in cancer aetiology and progression. Nature Reviews Genetics, 17(5), 284. https://doi.org/10.1038/nrg. 2016.13. Figueroa, M. E., Lugthart, S., Li, Y., Erpelinck-Verschueren, C., Deng, X., Christos, P. J., et al. (2010). DNA methylation signatures identify biologically distinct subtypes in acute myeloid leukemia. Cancer Cell, 17(1), 13e27. https://doi.org/10.1016/j.ccr.2009.11.020. Fogel, D. B. (1998). Evolutionary computation: The fossil record (1st ed.). Wiley-IEEE Press. Fortin, J.-P., Triche, T. J., & Hansen, K. D. (2017). Preprocessing, normalization and integration of the Illumina HumanMethylationEPIC array with minfi. Bioinformatics (Oxford, England), 33(4), 558e560. https://doi.org/10.1093/bioinformatics/btw691. Gallo, C. A., Carballido, J. A., & Ponzoni, I. (2009). BiHEA: A hybrid evolutionary approach for microarray biclustering. https://doi.org/10.1007/978-3-642-03223-3_4. Georgiades, S., Szatmari, P., & Boyle, M. (2013). Importance of studying heterogeneity in autism. Neuropsychiatry, 3(2), 123e125. https://doi.org/10.2217/npy.13.8. Gu, J., & Liu, J. S. (2008). Bayesian biclustering of gene expression data. BMC Genomics, 9(Suppl. 1), S4. https://doi.org/10.1186/1471-2164-9-S1-S4. Henriques, R. (2015). BicPAMS. Retrieved June 1, 2019, from https://web.ist.utl.pt./rmch/bicpams/. Henriques, R., Ferreira, F. L., & Madeira, S. C. (2017). BicPAMS: Software for biological data analysis with pattern-based biclustering. BMC Bioinformatics, 18(1), 82. https://doi.org/10.1186/s12859-017-1493-3. Henriques, R., & Madeira, S. C. (2014a). BicPAM: Pattern-based biclustering for biomedical data analysis. Algorithms for Molecular Biology, 9(1), 27. https://doi.org/10.1186/s13015-014-0027-z. Henriques, R., & Madeira, S. C. (2014b). BicSPAM: Flexible biclustering using sequential patterns. BMC Bioinformatics, 15(1), 130. https://doi.org/10.1186/1471-2105-15-130. Henriques, R., & Madeira, S. C. (2015). Biclustering with flexible plaid models to unravel interactions between biological processes. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 12(4), 738e752. https://doi.org/10.1109/TCBB.2014.2388206. Henriques, R., & Madeira, S. C. (2016). BicNET: Flexible module discovery in large-scale biological networks using biclustering. Algorithms for Molecular Biology, 11(1), 14. https://doi.org/10.1186/ s13015-016-0074-8. Hochreiter, S., Bodenhofer, U., Heusel, M., Mayr, A., Mitterecker, A., Kasim, A., et al. (2010a). Fabia: Factor analysis for bicluster acquisition. Bioinformatics, 26(12), 1520e1527. https://doi.org/10.1093/ bioinformatics/btq227. Hochreiter, S., Bodenhofer, U., Heusel, M., Mayr, A., Mitterecker, A., Kasim, A., et al. (2010b). Fabia: Factor analysis for bicluster acquisition. https://doi.org/10.18129/B9.bioc.fabia.

184

Computational Learning Approaches to Data Analytics in Biomedical Applications

Houseman, E. A., Accomando, W. P., Koestler, D. C., Christensen, B. C., Marsit, C. J., Nelson, H. H., et al. (2012). DNA methylation arrays as surrogate measures of cell mixture distribution. BMC Bioinformatics, 13(1), 86. https://doi.org/10.1186/1471-2105-13-86. Huang, Q., Tao, D., Li, X., & Liew, A. (2012). Parallelized evolutionary learning for detection of biclusters in gene expression data. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 9(2), 560e570. https://doi.org/10.1109/TCBB.2011.53. Illumina. (2015). Illumina methylation BeadChips achieve breadth of coverage using 2 Infinium chemistries. Retrieved June 1, 2019, from Illumina, Inc. website: https://www.illumina.com/content/dam/illuminamarketing/documents/products/technotes/technote_hm450_data_analysis_optimization.pdf. Illumina. (2019). Comprehensive coverage for epigenome-wide association studies. Retrieved June 1, 2019, from Illumina, Inc. website: https://www.illumina.com/techniques/microarrays/methylationarrays.html. Jaffe, A. E., Murakami, P., Lee, H., Leek, J. T., Fallin, M. D., Feinberg, A. P., et al. (2012). Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies. International Journal of Epidemiology, 41(1), 200e209. https://doi.org/10.1093/ije/dyr238. Jin, Z., & Liu, Y. (2018). DNA methylation in human diseases. Genes & Diseases, 5(1), 1e8. https://doi.org/ 10.1016/j.gendis.2018.01.002. Jones, P. A. (2012). Functions of DNA methylation: Islands, start sites, gene bodies and beyond. Nature Reviews Genetics, 13(7), 484e492. https://doi.org/10.1038/nrg3230. Kaminskas, E., Farrell, A., Abraham, S., Baird, A., Hsieh, L.-S., Lee, S.-L., et al. (2005). Approval summary: Azacitidine for treatment of myelodysplastic syndrome subtypes. Clinical Cancer Research, 11(10), 3604e3608. https://doi.org/10.1158/1078-0432.CCR-04-2135. Kerr, G., Ruskin, H. J., Crane, M., & Doolan, P. (2008). Techniques for clustering gene expression data. Computers in Biology and Medicine, 38(3), 283e293. https://doi.org/10.1016/j. compbiomed.2007.11.001. Kiselev, V. Y., Andrews, T. S., & Hemberg, M. (2019). Challenges in unsupervised clustering of single-cell RNA-seq data. Nature Reviews Genetics, 20(5), 273e282. https://doi.org/10.1038/s41576-018-0088-9. Kriegel, H.-P., Kro¨ger, P., & Zimek, A. (2009). Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering and correlation clustering. ACM Transactions on Knowledge Discovery from Data, 3(1), 1e58. https://doi.org/10.1145/1497577.1497578. Kurdyukov, S., & Bullock, M. (2016). DNA methylation analysis: Choosing the right method. Biology, 5(1), 3. https://doi.org/10.3390/biology5010003. Laird, N. M., & Lange, C. (2006). Family-based designs in the age of large-scale gene-association studies. Nature Reviews. Genetics, 7(5), 385e394. https://doi.org/10.1038/nrg1839. Laird, N. M., & Lange, C. (2011). The fundamentals of modern statistical genetics (1st ed.) https://doi.org/ 10.1007/978-1-4419-7338-2. Lakizadeh, A., & Jalili, S. (2016). BiCAMWI: A genetic-based biclustering algorithm for detecting dynamic protein Complexes. PLOS ONE, 11(7), e0159923. https://doi.org/10.1371/journal.pone.0159923. Langdon, W. B., & Poli, R. (2002). Foundations of genetic programming (1st ed.) https://doi.org/10.1007/ 978-3-662-04726-2. Leek, J. T., & Storey, J. D. (2007). Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genetics, 3(9), 1724e1735. https://doi.org/10.1371/journal.pgen.0030161. Lehmann, E. L. (1975). Nonparametrics: Statistical methods based on ranks (1st ed.). New York: SpringerVerlag. Lewis, C. M., & Knight, J. (2012). Introduction to genetic association studies. Cold Spring Harbor Protocols, 2012(3), 297e306. https://doi.org/10.1101/pdb.top068163.

Chapter 6  Genomic data analysis 185

Li, G., Ma, Q., Tang, H., Paterson, A. H., & Xu, Y. (2009). Qubic: A qualitative biclustering algorithm for analyses of gene expression data. Nucleic Acids Research, 37(15), e101. https://doi.org/10.1093/nar/ gkp491. Liu, Y., Aryee, M. J., Padyukov, L., Fallin, M. D., Hesselberg, E., Runarsson, A., et al. (2013). Epigenomewide association data implicate DNA methylation as an intermediary of genetic risk in rheumatoid arthritis. Nature Biotechnology, 31, 142e147. https://doi.org/10.1038/nbt.2487. Liu, Y., Li, X., Aryee, M. J., Ekstro¨m, T. J., Padyukov, L., Klareskog, L., et al. (2014). GeMes, clusters of DNA methylation under genetic control, can inform genetic and epigenetic analysis of disease. American Journal of Human Genetics, 94(4), 485e495. https://doi.org/10.1016/j.ajhg.2014.02.011. Li, D., Xie, Z., Le Pape, M., & Dye, T. (2015). An evaluation of statistical methods for DNA methylation microarray data analysis. BMC Bioinformatics, 16(1), 217. https://doi.org/10.1186/s12859-015-0641-x. Li, E., & Zhang, Y. (2014). DNA methylation in mammals. Cold Spring Harbor Perspectives in Biology, 6(5). https://doi.org/10.1101/cshperspect.a019133. Madeira, S. C., & Oliveira, A. L. (2004). Biclustering algorithms for biological data analysis: A survey. IEEE/ ACM Trans. Comput. Biol. Bioinforma., 1(1), 24e45. https://doi.org/10.1109/TCBB.2004.2. Mallik, S., Odom, G. J., Gao, Z., Gomez, L., Chen, X., & Wang, L. (2018). An evaluation of supervised methods for identifying differentially methylated regions in Illumina methylation arrays. Briefings in Bioinformatics. https://doi.org/10.1093/bib/bby085. Minicat157. (2016). Unibic. Retrieved from Source Forge website: https://sourceforge.net/projects/unibic/. Mitchell, M. (1998). An introduction to genetic algorithms (1st ed.). MIT Press. Morris, T. J., Butcher, L. M., Feber, A., Teschendorff, A. E., Chakravarthy, A. R., Wojdacz, T. K., et al. (2014). ChAMP: 450k Chip analysis methylation pipeline. Bioinformatics (Oxford, England), 30(3), 428e430. https://doi.org/10.1093/bioinformatics/btt684. Mukhopadhyay, A., Maulik, U., & Bandyopadhyay, S. (2009). A novel coherence measure for discovering scaling biclusters from gene expression data. Journal of Bioinformatics and Computational Biology, 07(05), 853e868. https://doi.org/10.1142/S0219720009004370. Murphy, A., Weiss, S. T., & Lange, C. (2008). Screening and replication using the same data set: Testing strategies for family-based studies in which all probands are affected. PLoS Genetics, 4(9), e1000197. https://doi.org/10.1371/journal.pgen.1000197. Nardone, S., Sams, D. S., Zito, A., Reuveni, E., & Elliott, E. (2017). Dysregulation of cortical neuron DNA methylation profile in autism spectrum disorder. Cerebral Cortex, 27(12), 5739e5754. https://doi. org/10.1093/cercor/bhx250. Obafemi-Ajayi, T., Miles, J. H., Takahashi, T. N., Qi, W., Aldridge, K., Zhang, M., et al. (2014). Facial structure analysis separates autism spectrum disorders into meaningful clinical subgroups. Journal of Autism and Developmental Disorders, 45(5), 1302e1317. https://doi.org/10.1007/s10803-014-2290-8. Obafemi-Ajayi, T., Settles, L., Su, Y., Germeroth, C., Olbricht, G. R., Wunsch, D. C., et al. (2017). Genetic variant analysis of boys with autism: A pilot study on linking facial phenotype to genotype. IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 253e257. https://doi.org/10. 1109/BIBM.2017.8217658. Oghabian, A., Kilpinen, S., Hautaniemi, S., & Czeizler, E. (2014). Biclustering methods: Biological relevance and application in gene expression analysis. PLoS ONE, 9(3), e90801. https://doi.org/10.1371/ journal.pone.0090801. Orzechowski, P., Sipper, M., Huang, X., & Moore, J. H. (2018). Ebic: An evolutionary-based parallel biclustering algorithm for pattern discovery. Bioinformatics, 34(21), 3719e3726. https://doi.org/10. 1093/bioinformatics/bty401. Pedersen, B. S., Schwartz, D. A., Yang, I. V., & Kechris, K. J. (2012). Comb-p: Software for combining, analyzing, grouping and correcting spatially correlated P-values. Bioinformatics (Oxford, England), 28(22), 2986e2988. https://doi.org/10.1093/bioinformatics/bts545.

186

Computational Learning Approaches to Data Analytics in Biomedical Applications

Peters, T. J., Buckley, M. J., Statham, A. L., Pidsley, R., Samaras, K., V Lord, R., et al. (2015). De novo identification of differentially methylated regions in the human genome. Epigenetics & Chromatin, 8(1), 6. https://doi.org/10.1186/1756-8935-8-6. Pidsley, R., Zotenko, E., Peters, T. J., Lawrence, M. G., Risbridger, G. P., Molloy, P., et al. (2016). Critical evaluation of the Illumina MethylationEPIC BeadChip microarray for whole-genome DNA methylation profiling. Genome Biology, 17(1), 208. https://doi.org/10.1186/s13059-016-1066-1. Pontes, B., Divina, F., Gira´ldez, R., & AguilareRuiz, J. S. (2007). Virtual error: A new measure for evolutionary biclustering. Evolutionary Computation,Machine Learning and Data Mining in Bioinformatics, 217e226. https://doi.org/10.1007/978-3-540-71783-6_21. Pontes, B., Gira´ldez, R., & Aguilar-Ruiz, J. S. (2010). Measuring the quality of shifting and scaling patterns in biclusters. Pattern Recognition in Bioinformatics, 242e252. https://doi.org/10.1007/978-3-64216001-1_21. Pontes, B., Gira´ldez, R., & Aguilar-Ruiz, J. S. (2013). Configurable pattern-based evolutionary biclustering of gene expression data. Algorithms for Molecular Biology, 8(1), 4. https://doi.org/10.1186/17487188-8-4. Pontes, B., Gira´ldez, R., & Aguilar-Ruiz, J. S. (2015). Biclustering on expression data: A review. Journal of Biomedical Informatics, 57, 163e180. https://doi.org/10.1016/j.jbi.2015.06.028. Pontes, B., Girldez, R., & Aguilar-Ruiz, J. S. (2015). Quality measures for gene expression biclusters. PLOS ONE, 10(3), e0115497. https://doi.org/10.1371/journal.pone.0115497. Prelic, A., Bleuler, S., Zimmermann, P., Wille, A., Bu¨hlmann, P., Gruissem, W., et al. (2006). A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics (Oxford, England), 22(9), 1122e1129. https://doi.org/10.1093/bioinformatics/btl060. Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., A R Ferreira, M., Bender, D., et al. (2007). Plink: A toolset for whole genome association and population-based linkage analyses. American Journal of Human Genetics, 81. Rappoport, N., & Shamir, R. (2018). Multi-omic and multi-view clustering algorithms: Review and cancer benchmark. Nucleic Acids Research, 46(20), 10546e10562. https://doi.org/10.1093/nar/gky889. Ritchie, M. E., Phipson, B., Wu, D., Hu, Y., Law, C. W., Shi, W., et al. (2015). Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research, 43(7), e47. https://doi.org/10.1093/nar/gkv007. Robinson, M. D., Kahraman, A., Law, C. W., Lindsay, H., Nowicka, M., Weber, L. M., et al. (2014). Statistical methods for detecting differentially methylated loci and regions. Frontiers in Genetics, 5(324), 1e7. https://doi.org/10.3389/fgene.2014.00324. Roy, S., Bhattacharyya, D. K., & Kalita, J. K. (2016). Analysis of gene expression patterns using biclustering. In Methods in molecular biology (Vol. 1375, pp. 91e103). https://doi.org/10.1007/7651_2015_280. Saria, S., & Goldenberg, A. (2015). Subtyping: What it is and its role in precision medicine. IEEE Intelligent Systems, 30(4), 70e75. https://doi.org/10.1109/MIS.2015.60. Schwartzman, O., & Tanay, A. (2015). Single-cell epigenomics: Techniques and emerging applications. Nature Reviews. Genetics, 16(12), 716e726. https://doi.org/10.1038/nrg3980. Sofer, T., Schifano, E. D., Hoppin, J. A., Hou, L., & Baccarelli, A. A. (2013). A-clustering: A novel method for the detection of co-regulated methylation regions, and regions associated with exposure. Bioinformatics, 29(22), 2884e2891. https://doi.org/10.1093/bioinformatics/btt498. Spielman, R. S., McGinnis, R. E., & Ewens, W. J. (1993). Transmission test for linkage disequilibrium: The insulin gene region and insulin-dependent diabetes mellitus (IDDM). American Journal of Human Genetics, 52(3), 506e516. Stegle, O., Teichmann, S. A., & Marioni, J. C. (2015). Computational and analytical challenges in singlecell transcriptomics. Nature Reviews Genetics, 16(3), 133e145. https://doi.org/10.1038/nrg3833.

Chapter 6  Genomic data analysis 187

Teschendorff, A. E., Zhuang, J., & Widschwendter, M. (2011). Independent surrogate variable analysis to deconvolve confounding factors in large-scale microarray profiling studies. Bioinformatics (Oxford, England), 27(11), 1496e1505. https://doi.org/10.1093/bioinformatics/btr171. The ENCODE Project Consortium. (2012). An integrated encyclopedia of DNA elements in the human genome. Nature, 489, 57e74. https://doi.org/10.1038/nature11247. Tian, Y., Morris, T. J., Webster, A. P., Yang, Z., Beck, S., Feber, A., et al. (2017). ChAMP: Updated methylation analysis pipeline for Illumina BeadChips. Bioinformatics (Oxford, England), 33(24), 3982e3984. https://doi.org/10.1093/bioinformatics/btx513. Van Steen, K., & Lange, C. (2005). Pbat: A comprehensive software package for genome-wide association analysis of complex family-based studies. Human Genomics, 2(1), 67e69. Wade, N. (September 4, 2007). In The genome race, the sequel is personal (New York Times). Wang, Z., Li, G., Robinson, R. W., & Huang, X. (2016). UniBic: Sequential row-based biclustering algorithm for analysis of gene expression data. Scientific Reports, 6(1), 23466. https://doi.org/10.1038/ srep23466. Wang, D., Yan, L., Hu, Q., Sucheston, L. E., Higgins, M. J., Ambrosone, C. B., et al. (2012). Ima: an R package for high-throughput analysis of Illumina’s 450K Infinium methylation data. Bioinformatics (Oxford, England), 28(5), 729e730. https://doi.org/10.1093/bioinformatics/bts013. Wilhelm-Benartzi, C. S., Koestler, D. C., Karagas, M. R., Flanagan, J. M., Christensen, B. C., Kelsey, K. T., et al. (2013). Review of processing and analysis methods for DNA methylation array data. British Journal of Cancer, 109(6), 1394e1402. https://doi.org/10.1038/bjc.2013.496. Wunsch, D. C. (2009). ART properties of interest in engineering applications. In IEEE/INNS international joint conference on neural networks (pp. 3556e3559). Atlanta, GA, USA: IEEE Press. Wunsch II, D. C., Xu, R., & Kim, S. (2015). Patent No. US 9043326 B2. U.S.: U.S. Patent Office. Xu, R., & Wunsch, D. C. (2011). Bartmap: A viable structure for biclustering. Neural Networks, 24(7), 709e716. https://doi.org/10.1016/j.neunet.2011.03.020. Yang, W.-H., Dai, D.-Q., & Yan, H. (2011). Finding correlated biclusters from gene expression data. IEEE Transactions on Knowledge and Data Engineering, 23(4), 568e584. https://doi.org/10.1109/TKDE. 2010.150. Yang, J., Wang, H., Wang, W., & Yu, P. S. (2005). AN IMPROVED BICLUSTERING METHOD FOR ANALYZING GENE EXPRESSION PROFILES an improved biclustering method for analyzing gene expression profiles. International Journal on Artificial Intelligence Tools, 14(5), 771e789. https://doi. org/10.1142/S0218213005002387. Yeoh, E.-J., Ross, M. E., Shurtleff, S. A., Williams, W. K., Patel, D., Mahfouz, R., et al. (2002). Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell, 1(2), 133e143. Zablotsky, B., Black, L. I., Maenner, M. J., Schieve, L. A., & Blumberg, S. J. (2015). Estimated prevalence of autism and other developmental disabilities following questionnaire changes in the 2014 national health Interview survey. National Health Statistics Reports, (87), 1e20. Zhang, J. (2010). A Bayesian model for biclustering with applications. Journal of the Royal Statistical Society: Series C (Applied Statistics), 59(4), 635e656. https://doi.org/10.1111/j.1467-9876.2010.00716.x. Zhang, Q., Zhao, Y., Zhang, R., Wei, Y., Yi, H., Shao, F., et al. (2016). A comparative study of five association tests based on CpG set for epigenome-wide association studies. PloS One, 11(6), e0156895. https://doi.org/10.1371/journal.pone.0156895. Zhao, S., Fung-Leung, W.-P., Bittner, A., Ngo, K., & Liu, X. (2014). Comparison of RNA-seq and microarray in transcriptome profiling of activated T cells. PLoS ONE, 9(1). https://doi.org/10.1371/journal. pone.0078644.

7 Evaluation of cluster validation metrics 7.1 Introduction Unsupervised learning, also known as clustering (see chapter 3), is increasingly popular because it efficiently organizes an overwhelming amount of data (Jain, 2010; Xu & Wunsch, 2009). Cluster analysis has been applied to a wide range of applications as an exploratory tool to enhance knowledge discovery. For example, in biomedical applications, cluster analysis aids in disease subtyping, i.e. the task of identifying homogenous patient subgroups that can guide prognosis, treatment decisions and possibly predict outcomes or recurrence risks (Saria & Goldenberg, 2015). This naturally translates to clustering: finding meaningful subgroups in a given dataset. Here, the definition of “meaningful” is problem-dependent. For a given clustering algorithm, multiple results can be obtained from the same dataset by varying parameters. Ultimately, the validity of any subgrouping depends on whether computationally-discovered subgroups actually uncover/expose a domain-specified variation that is meaningful and significant in the domain application. To guide a successful cluster analysis, one uses a quality measure to quantitatively evaluate how well the resulting set of partitions fits the input data. Cluster validity refers to the formal procedures used to evaluate clustering results in a quantitative and objective fashion (Jain, 2010). It determines which set of clusters is optimal for approximating the underlying subgroups in the dataset as well as how many clusters exist in the data. Broadly, two main types of cluster validation indices (CVI) have been proposed and studied extensively in the literature: external validation indices and internal indices (Kova´cs, Lega´ny, & Babos, 2005; Liu, Li, Xiong, Gao, & Wu, 2010; Xu & Wunsch, 2009). Relative CVIs (Brun et al., 2007) have also been discussed as a third category, but they are not considered in the context of this work, as their applicability is limited and they have proven to be approximately as effective as internal CVIs (Brun et al., 2007). Internal CVIs are very important in the context of clustering since there is no ground truth (or labeled data) available. They are used to determine the best partitioning for the data based on the inherent structural properties (compactness and separability) of that data. They are also used to assess the general effectiveness of a clustering method. Internal CVIs have been proposed and evaluated in (Arbelaitz, Gurrutxaga, Muguerza, Pe´rez, & Perona, 2013; Brun et al., 2007; Dubes, 1987; Halkidi, Batistakis, & Vazirgiannis, 2001; Kova´cs et al., 2005; Liu et al., 2010, 2013; Maulik & Bandyopadhyay, 2002; Milligan & Cooper, 1985; Vendramin, Campello, & Hruschka, 2010). Computational Learning Approaches to Data Analytics in Biomedical Applications. https://doi.org/10.1016/B978-0-12-814482-4.00007-3 Copyright © 2020 Elsevier Inc. All rights reserved.

189

190

Computational Learning Approaches to Data Analytics in Biomedical Applications

The majority of prior CVI evaluation analyses/surveys are focused on which set of indices can determine the optimal number of clusters in the data. However, that is not enough to guarantee that it is the optimal set of clusters for the data. When considering applications in which clustering is highly beneficial, it is important to discover meaningful subgroups, not just the optimal number of clusters. Domain experts are very interested in understanding the features that define the resulting subgroups generated by a clustering algorithm, not just the number of clusters found. The question then remains: which set of indices is most reliable? No one-size-fits-all solution is available. In light of this, is there a way to combine the results generated from multiple indices to leverage their individual benefits jointly? It is imperative to develop quality measures capable of identifying optimal partitions for a given dataset. The benefits of processing and inferring information from unlabeled data for a variety of domain problems is undeniable. Thus, improving the quality measures for unsupervised learning remains a vital task. In this work, the authors investigate a statistics-based evaluation framework to empirically assess the performance of five highly cited internal CVIs on real datasets using six common clustering algorithms. The objective is threefold: i) to assess the consistency/reliability of the internal CVIs in accurately determining the optimal partitioning of the data (not just the optimal number of clusters) using rigorous statistical analysis; ii) to assess the performance of CVIs in relation to diverse clustering algorithms as well as diverse distributions/complexities of the datasets; iii) to provide a guide for combining the results from multiple CVIs, using an ensemble validation paradigm, to accurately determine the most optimal clustering scheme/configuration for a given dataset.

7.2 Related works Multiple papers have investigated the performance of various internal CVIs, usually based on their success in identifying the optimal number of partitions in the dataset, using one or two clustering algorithms. The empirical evaluations conducted in (Kova´cs et al., 2005; Liu et al., 2010; Milligan & Cooper, 1985) were specifically performed on a wide range of synthetic datasets. The analysis carried out by Maulik and Bandyopadhy (Maulik & Bandyopadhyay, 2002) employed three artificial datasets and two real datasets. A model-based evaluation of internal CVIs in (Brun et al., 2007), again utilizing synthetic datasets, demonstrated that the performance of validity indices is highly variable. They observed that for the complex models for which the clustering algorithms performed poorly, internal indices failed to predict the error of the algorithm. They concluded that not much faith should be placed on a validity score unless there is evidence, either in terms of sufficient data for model estimation or prior model knowledge, that a validity measure is well-correlated to the error rate of the clustering algorithm. It is well known that biological/biomedical datasets are usually complex. Hence the question remains, how can the clustering results of those datasets be evaluated to obtain reliable/meaningful results? A detailed study on the validation properties (monotonicity, noise, density, subclusters, and skewed distribution) of 11 widely-used internal CVIs is presented in

Chapter 7  Evaluation of cluster validation metrics 191

(Liu et al., 2010) and extended to 12 in (Liu et al., 2013) to include a new metric based on the notion of nearest neighbors, CVNN. In (Arbelaitz et al., 2013), a general contextindependent cluster evaluation process is presented for 30 different CVIs using 3 clustering algorithms on both synthetic and real datasets. Their results showed that the “best-performing” indices on the synthetic data was not the same for the real data. Their results ranked the indices in three groups by overall performance. The first group included Silhouette, DavieseBouldin, CalinskieHarabasz, generalized Dunn, COP and SDbw. This chapter evaluates Silhouette, DavieseBouldin, CalinskieHarabasz, Dunns and Xie-Beni, as described in detail in Section 7.3. Another approach that has been developed for validating the number of clusters present in a dataset is to view clustering as a supervised classification problem, in which the ‘true’ class labels must also be estimated (Tibshirani & Walther, 2005). The output labels from the clustering algorithm are used to train and build classification models to assess the quality of the clustering result. The basic idea is that ‘true’ class labels will improve the prediction strength of the classification models. Hence, the resulting “prediction strength” measure assesses the quality of the clustering results. Bailey reviews in (Bailey, 2013), a recent work on alternative clusterings, that multiple clusterings are reasonable for a given dataset and uses a set of constraints to determine the multiple partitions suited for the data. Vendramin et al., (2010) conduct their evaluation of the CVIs on the basis that even though an algorithm might correctly estimate the correct number of clusters contained in a given data, it still does not guarantee the quality of the clustering. They propose assessing the value of CVIs to correctly determine the quality of the clustering by applying external CVIs. Thus, good cluster validation indices would correlate well with external indices. It is expected that a good relative clustering validity measure will rank the partitions according to an ordering that is similar to those established by an external criterion, since external criteria rely on supervised information about the underlying structure in the data. They compute the Pearson correlation coefficient between the various clustering indices and Jaccard index (external CVI) on the K-means clustering output from 972 synthetic datasets with varying numbers of clusters, size and distributions. A statistical test is employed to determine statistically significant differences among the average correlation values for the different CVIs. This chapter builds on Vendramin et al.’s approach (Vendramin et al., 2010) of finding the correlation between internal and external CVIs to assess the performance of the internal CVIs specifically for biomedical data analysis. The approach is significantly extended by using 6 varied clustering algorithms (to ensure robustness of the results beyond K-means) on 14 real biological datasets. Additionally, three different external CVIs are considered. The Spearman (rather than the Pearson) correlation coefficient is employed to measure the correlation of the ranks of the partitions between internal and external indices. It is less sensitive to outliers and can quantify the strength of any monotonic relationship (not just linear ones). This current work provides a more extensive evaluation of internal validation indices across real biological data on a variety of widely-used algorithms.

192

Computational Learning Approaches to Data Analytics in Biomedical Applications

7.3 Background Internal CVIs usually employ varied measures to compute the degree of separateness and compactness of the clusters to evaluate and determine an optimal clustering scheme (Kova´cs et al., 2005). “Compactness” measures how close members of each cluster are to each other, i.e. cluster homogeneity/similarity, while “separateness” measures how separated the clusters are from each other. The assumption is that a good clustering result should yield clusters that are compact and well separated. However, a key thing to consider is the level of separateness and compactness desired may vary by domain application since clustering is data-driven. As an illustrative example, consider the commonly used iris dataset (Clifford, Wessely, Pendurthi, & Emes, 2011). It comprises 3 classes of 50 instances, each represented by a set of 4 features, where each class refers to a type of iris plant. It is known that one class is linearly separable from the other two, as illustrated in Fig. 7.1. The iris dataset is often employed to demonstrate how effective a cluster algorithm is at uncovering these three subgroups though two of them are non-linearly separable. Spectral clustering algorithm (Von Luxburg, 2007) was applied to cluster the dataset to reveal meaningful underlying subgroups. Since it is assumed that the number of clusters is not known a priori, the number of clusters k were varied from 2 to 7, and 5 different CVIs were applied to determine the optimal clustering configuration. According to Table 7.1, 4 of the CVIs selected the “k ¼ 2” result as the optimal scheme for the data. A visual comparison of the results (Fig. 7.1) demonstrate that at k ¼ 2, the greatest separation is observed among the two clusters; at k ¼ 3, the level of separation is weaker 2.0 1.5 1.0 0.5 0.0 –0.5 –1.0 –1.5 –4

–3

–2

–1

0

1

2

3

4

5

FIG. 7.1 Visualization of Iris dataset using Principal Component Analysis (PCA). The red, Yellow and green overlay are the partitions obtained for spectral clustering 3-clusters result while the rectangular overlay denotes its 2cluster result. The 2-cluster result was selected optimal by four of the indices.

Chapter 7  Evaluation of cluster validation metrics 193

Table 7.1

Evaluation of Spectral algorithm on Iris data. Number of clusters (k)

Internal CVIs

2

3

4

5

6

7

Silhouettea Davies bouldinb Xie-Benib Dunna Calinski-Harabasz (CH)a

0.687 0.766 0.065 0.339 502.8

0.555 1.962 0.160 0.133 556.1

0.496 3.155 0.278 0.137 526.0

0.371 4.612 0.558 0.062 454.8

0.334 5.983 0.475 0.074 429.4

0.347 6.900 0.447 0.083 438.4

a

The higher the value, the better. The lower the value, the better the result.

b

between cluster 2 and 3, though the compactness is greater. Thus, the “k ¼ 3” result selected by the Calinski-Harabasz (CH) index is actually more meaningful and representative of the underlying complex iris structure in this context. This illustration exposes the potential flaw that may lie in using the CVI alone to determine the optimal clustering scheme and/or make a final decision based on the majority voting of a combination of CVIs. The issue with the traditional approach of simply applying one or more internal CVIs and deciding based only on a numeric value is a lack of understanding of the underlying structure of the real labeled dataset being used to assess the algorithm and which set of CVI(s) would be most appropriate to follow. In this case, CH was rightly aligned with the optimal scheme; however, that is not always the case, as is illustrated further. As previously mentioned, in clustering applications for biomedical data analysis, discovering the optimal partitions in the data is as important, if not more, than the optimal number of subgroups. This is because the subgroups (clusters) identified are subsequently analyzed further to determine the discriminant features and potential biomarkers. In this setting, internal CVIs are also used to compare the best results obtained across multiple algorithms to determine the optimal solution. For the iris dataset, consider the “k ¼ 3” clustering results of two other clustering algorithms (affinity propagation and K-means (Bailey, 2013)) in addition to spectral clustering. Their outcomes were evaluated using the CH index, given that it appeared to select the most reliable result in the case of spectral clustering applied to the iris data. Since the ground truth information is known for iris data, the various k ¼ 3 results can be evaluated using the percentage accuracy (an external validation index). Accuracy specifies the percentage of data points correctly assigned to their proper partition. It is commonly used in the context of supervised learning. Table. 7.2 reveals a conflict in the optimal clustering scheme between the CH index and Accuracy. The K-means “k ¼ 3” result would be most desirable going by the internal CVI (CH). However, according to the Accuracy index, the affinity propagation clustering result is most optimal, as it attains an accuracy of 0.95.

194

Computational Learning Approaches to Data Analytics in Biomedical Applications

Table 7.2 Iris data.

Evaluation of Calinski-Harabasz Index across multiple algorithms on Results of varied algorithms at k [ 3 scheme

Validation indices

Affinity propagation

K-means

Spectral

Calinski-Harabasz (CH) Accuracy

555.03 0.95

561.63 0.92

556.12 0.90

This example raises key questions that this chapter attempts to address. How can the reliability of internal validation measures be improved given that clustering is supposed to be an autonomous way to discover meaningful subgroups?

7.3.1

Commonly used internal validation indices

A brief description of the commonly used internal validation metrics assessed in this work is provided here for context. The following describes 5 out of the 11 widely used indices in the literature, as rated in (Bailey, 2013). The notations and definitions employed are similar to those presented in (Liu et al., 2010). Let D denote the dataset; N: number of objects in D; c: center of D; k: number of clusters; Ci: the ieth cluster; ni: number of objects in Ci; ci: center of Ci; d (x, y): distance between x and y. The validation indices can be defined mathematically as follows: (1) Silhouette index (SI) (Rousseeuw, 1987). This is a composite index that measures both the compactness (using the distance between all the points in the same cluster) and separation of the clusters (based on the nearest neighbor distance). It computes the pairwise difference of the between- and within-cluster distances. A larger average SI value indicates a better overall quality of the clustering result. Some variations have been proposed in literature (Vendramin et al., 2010). The standard definition given by (7.1) is: ) ( 1 X 1 X bðxÞ  aðxÞ SI ¼ k i ni x˛C maxx ½bðxÞ; aðxÞ

(7.1)

i

where aðxÞ and bðxÞ is defined as: 3 2 X X 1 1 aðxÞ ¼ dðx; yÞ ; bðxÞ¼ minj;jsi 4 dðx; yÞ 5: ni  1 y˛C ; ysx ni y˛C i

j

(2) Calinski-Harabasz index (CH) (Calinski & Harabasz, 1974). This measures between-cluster isolation and within-cluster coherence. It is based on average between- and within-cluster sum of squares by computing its between-cluster

Chapter 7  Evaluation of cluster validation metrics 195

isolation and within-cluster coherence. The maximum value determines the optimal clustering configuration. P ni d 2 ðci ; cÞ=ðk  1Þ CH ¼ P Pi 2 i x˛Ci d ððx; ci ÞÞ=ðN  kÞ

(7.2)

(3) Dunn’s index (Liu et al., 2010). This defines inter-cluster separation as the minimum pairwise distance between objects in different clusters and intra-cluster compactness as the maximum value of the largest distance between a pair of objects in the same cluster. Multiple variations have been proposed in literature. A maximum value is optimal. The standard definition was utilized: ( DI ¼ mini minj

minx˛Ci ; y˛Cj dðx; yÞ maxk ðmaxx;y˛Ck dðx; yÞÞ

!) (7.3)

(4) Xie-Beni index (XB) (Liu et al., 2010). This defines the inter-cluster separation as the minimum square distance between cluster centers and the intra-cluster compactness as the mean square distance between each data object and its cluster center. A minimum value is an optimal result. " XB ¼

XX

#

  d ðx; ci Þ = N , mini;jsi d 2 ðci ; cj Þ 2

(7.4)

i x˛Ci

(5) Davies-Bouldin index (DB) (Bolshakova & Azuaje, 2003). This measures the average value of the similarity between each cluster and its most similar cluster. The index computes the dispersion of a cluster and a dissimilarity measure between pairs of clusters. A lower DB index implies a better cluster configuration. 9 82 3 = < 1X X 1X 1 DB ¼ maxj;jsi 4 dðx; ci Þ þ dðx; cj Þ5 = dðci ; cj Þ ; : ni x˛C k i nj x˛C i

j

(7.5)

196

Computational Learning Approaches to Data Analytics in Biomedical Applications

7.3.2

External validation indices

As mentioned in Section 7.2, the external validation indices (Jain, 2010) are used for evaluating the quality of the clustering results as ranked by the internal CVIs. A brief description of the three external CVIs is presented as follows. Note that all three indices range between 0 and 1. A higher value denotes a better result. (1) Clustering Accuracy: This metric is also known as classification accuracy or the inverse of clustering error. Since it is an external metric, it is assumed that the ground truth information of each point is known. Clustering accuracy can be defined as the percentage of correctly assigned data points in each partition over the entire sample size N. According to (Brun et al., 2007), the error of a clustering algorithm is the expected difference between its labels and the labels generated by the labeled point process (in this context - the known ground truth). Using this approach, clustering accuracy can be formally defined as follows. Let DAi denote the labeling of a dataset D given by a clustering algorithm Ai, while DP is the ground truth label. Let LAi ðD; xÞ and LP ðD; xÞ denote the label of x˛D for DAi and DP respectively. The label accuracy (7.6) between the clustering label and the ground truth label is the proportion of points that have the same label: εðDP ; DAi Þ ¼

jfx : LP ðD; xÞ ¼ LAi ðD; xÞgj : N

(7.6)

Since the agreement or disagreement between two partitions is independent of the indices used to label their clusters, the partition accuracy (7.7) is defined as: ε ðDP ; DAi Þ ¼ maxp ðDP ; pDAi Þ

(7.7)

where the maximum is taken over all of the possible permutations pD of the k clusters in DAi. The clustering accuracy is the inverse of the partition error defined in (Brun et al., 2007). A greedy approach is utilized in this work to maximize the label accuracy over all the partitions. Ai

(2) Adjusted Rand Index (ARI): This is an extension of the Rand Index to account for a correction of the results due to chance. The Rand index measures the agreement between true clustering and predicted clustering: ARI normalizes, so the expected value is 0 when the clusters are selected by chance and 1 when a perfect match is achieved. Let a denote the number of pairs belonging in the same class in true clustering and in the same cluster in predicted clustering (PP); b: the number of pairs belonging in the same class in PT and in different clusters in PP; c: the number of pairs belonging in different classes in PT and in the same cluster in PP; d: the number of pairs belonging in different class in both PT and PP. Then, ARI is given by (7.8).

Chapter 7  Evaluation of cluster validation metrics 197

ða þ cÞða þ bÞ M ; ARI ¼ ða þ cÞða þ bÞ ða þ cÞða þ bÞ  2 M a 

(7.8)

where M ¼ a þ b þ c þ d ¼ N ðN 1Þ=2. Analogous adjustments can be made for other clustering indices, but due to its simplicity, the Adjusted Rand Index is most popular among them as of this writing. (3) Jaccard Index: This was introduced as an improvement over the original Rand index. It eliminates the term d to distinguish between good and bad partitions. It is defined by:

Jaccard Index ¼

7.3.3

a : aþbþc

(7.9)

Statistical methods

An overview of the statistical methods employed in the assessment is presented here. Section 7.4 demonstrates how they are utilized to provide a meaningful evaluation.  Spearman’s rank correlation coefficient (Conover, 1999). This is a measure of association between two variables. It is calculated by ranking the data within two different variables and computing the Pearson correlation coefficient on the ranks for the two variables. Spearman’s correlation ranges in value from -1 to 1, with values near 1 indicating similarity in ranks for the two variables and values near -1 indicating ranks are dissimilar for the two variables. Spearman’s correlation will be used to assess agreement in ranks between internal and external validation indices. Since it is based on ranks, it is less sensitive to outliers than the Pearson correlation coefficient and can also measure the strength of any monotonic relationship, whereas Pearson is utilized only for linear relationships. This robustness to outliers, ability to capture the strength of more general types of relationships and natural interpretation of a rank correlation make it a more desirable metric than the Pearson correlation for evaluating many clustering validation indices.  Three-Factor Analysis of Variance (ANOVA). ANOVA modeling is employed to test for significant differences in average Spearman correlation values among different datasets, algorithms and internal CVIs. For overall significant effects, pairwise comparisons are made to identify significant differences between groups within each of the factors using Tukey’s method to control the type I error for the multiple comparisons made.

198

Computational Learning Approaches to Data Analytics in Biomedical Applications

7.4 Evaluation framework The statistics-based evaluation framework applied for assessing the performance of internal CVIs is described as follows:  Perform clustering on t real datasets (D ¼ fD1 ; D2 ; . Dt g) using a varied set of r clustering algorithms (A ¼ fA1 ; A2 ; . Ar g) and varying parameters to generate a set of m clustering configurations for each dataset Di per algorithm Ai .  For each dataset Di , compute the values of the p internal and q external CVIs on the r  m set of the clustering results.  Obtain the Spearman correlation value for each set of m clustering configurations obtained per dataset Di per algorithm Ai by comparing each internal CVI’s result for each set of m clustering configurations with its corresponding external CVI (per index) results. This yields a q  p set of correlation values per algorithm Ai per dataset Di .  Conduct the 3-factor ANOVA to test for significant differences in the average Spearman correlation between different internal CVIs, algorithms and datasets. Follow up significant effects with Tukey’s multiple comparison procedure to determine which groups are significantly different. Additionally, conduct 2-way interactions to determine whether the effect of one factor depends on another factor. Follow up significant interactions with an interaction plot and overall summary of Tukey pairwise comparisons. For the results presented in this paper, r ¼ 6 algorithms were applied to cluster t ¼ 14 datasets drawn from the UCI machine learning repository (Dua & Graff, 2017) and Kaggle datasets (Kaggle Inc, 2016) as described in Table. 7.3. The datasets are listed in an increasing order of assumed complexity based on the increasing number of clusters Table 7.3

Overview of real biological datasets.

Dataset

Tag

Clusters (k)

Features (Dim)

Sample size (N)

Haberman Vertebral_2 Pima indians diabetes* Indian liver patient database (ILPD) Physical spine data* Parkinson disease Breast cancer (Wisconsin) Iris Vertebral_3 Seeds Wine Breast tissue Ecoli Yeast

D-01 D-02 D-03 D-04 D-05 D-06 D-07 D-08 D-09 D-10 D-11 D-12 D-13 D-14

2 2 2 2 2 2 2 3 3 3 3 6 8 10

3 6 8 10 12 22 30 4 6 7 13 9 7 8

306 310 768 597 310 195 569 150 310 210 178 106 336 1484

Chapter 7  Evaluation of cluster validation metrics 199

followed by features. The number of clusters, k, was selected as the parameter to vary. By varying k from 2 to 10, a set of m¼9 clustering configuration were generated per algorithm per dataset: thus, a total of 756 results. For the internal CVIs where a maximum value indicates an optimal clustering result (SI, CH, Dunn’s), positive values of Spearman’s correlation are expected if the internal and external CVIs are in agreement, and the closer that value is to 1 the stronger the agreement. However, for internal indices where a minimum value implies optimal clustering (XB, DB), values closer to -1 would have a stronger agreement. To ensure the Spearman’s correlation values were comparable across all internal indices, the XB and DB index values are negated so that the maximum is best. To assess whether there were statistically significant differences in the average Spearman correlations based on the different factors explored (datasets, algorithms and internal validation indices), a 3-factor ANOVA model (7.10) with all of the main effects and second order interactions was conducted. The three-way interaction was not considered to be of direct interest for testing and was assumed to be insignificant. The model is given by: Yijk ¼ m þ Ii þ Aj þ Dk þ ðIAÞij þ ðIDÞik þ ðADÞjk þ εijk ;

(7.10)

where Y ijk is the Spearman correlation between one of the external validation indices and internal validation index i, for algorithm j in dataset k. Let m denote the overall average Spearman’s correlation, Ii is the effect for the internal CVI (i ¼ 1,.., 5), Aj is the effect for the algorithm (j ¼ 1,.., 6), Dk is the effect for the  dataset (k ¼ 1,..,14), while ðIAÞij ; ðIDÞik , ðADÞjk are the interaction terms. εijk wN 0; s2 are independent and identically distributed error terms. F-tests are conducted for the overall mean differences between indices, algorithms and datasets. There is also testing to determine whether the effect of one factor depends on another factor (interactions). Pairwise comparisons are made to identify overall significant differences between groups within each of the factors using Tukey’s method to control the type I error for the multiple comparisons made. The effectiveness of applying majority voting across the values of the internal CVIs for each dataset across all the clustering configurations obtained for the six algorithms was also assessed.

7.5 Experimental results and analysis This section presents the results of the evaluation of the 5 internal CVIs on the 14 real datasets. Six different algorithms (Jain, 2010; Xu & Wunsch, 2009) (affinity propagation (AF), K-means, spectral clustering and agglomerative hierarchical clustering (HC) using 3 different linkage methods (complete, average and ward)) were applied to the real datasets to conduct a detailed and robust comparison of the varied internal CVIs. The K-means and hierarchical algorithms were implemented using python scikit libraries (Pedregosa et al., 2012), while spectral clustering was carried out in Matlab using the Shi & Malik algorithm (Von Luxburg, 2007). The AF algorithm was implemented using the Matlab CVAP toolbox (Wang, Wang, & Peng, 2009).

200

Computational Learning Approaches to Data Analytics in Biomedical Applications

Fig. 7.2 provides the average Spearman’s correlation values by algorithm for each of the internal and external CVIs. Notice that CH provides positive correlations on average with Accuracy and ARI, but results are varied for Jaccard. DB, SI and XB perform reasonably well with Accuracy and ARI in all but two of the hierarchical methods (Average and Complete). Dunn’s provides the worst performance with many negative correlations, especially with Accuracy and ARI. Note that while Jaccard has been used in previous studies, it is not as effective in capturing the variation in results that appears to exist in the internal CVI’s as illustrated by Accuracy and ARI. Hence, the Jaccard index was excluded in subsequent results analyses. Fig. 7.3 illustrates the average Spearman’s correlation values by dataset for each of the internal CVIs with Accuracy and ARI. Both Accuracy and ARI indicate that there is a lot of variation in the average correlations by internal CVI and dataset. Dunn’s performs the worst again with varying results for the other internal CVIs. All indices have low average correlations in dataset D14, which has the most perceived degree of complexity among all the datasets explored, as indicated by its large number of clusters and sample size (Table. 7.3). The 3-factor ANOVA using Spearman’s correlation between internal CVIs and Accuracy was fit to further explore the differences in means for the factors of interest (internal CVIs, algorithms, datasets and their second order interactions). All factor effects are significant at the a ¼ 0.05 significance level with a p-value 0.05). CH has the second to lowest average (r s ¼0.238), and it is significantly different from DB (p ¼ 0.0002), but not SI or XB (p > 0.05). Dunn’s has the lowest average (r s ¼-0.169), and it is significantly different than all the other internal CVIs (p < 0.0001 for all comparisons). It is the only CVI with a negative average. It can be observed from the main effect plot for the algorithm (Fig. 7.4B) that, when averaged over the dataset and internal CVI, the Spectral algorithm has the highest average Spearman correlation (r s ¼0.442), followed by AF (r s ¼0.415) and K-means (r s ¼0.336). There is no significant difference between these algorithms (p > 0.05). Among the hierarchical methods, Ward’s had the next highest average (r s ¼0.224) followed by complete linkage (r s ¼0.120) with no significant difference between them

Accuracy

AF

H-Average H-Complete H-Ward

Spearman

Spearman XB

1.00 0.75 0.50 0.25 0.00 –0.25 –0.50 –0.75 –1.00

SI

1.00 0.75 0.50 0.25 0.00 –0.25 –0.50 –0.75 –1.00

Kmeans

Spectral

AF

H-Average H-Complete H-Ward

SI

1.00 0.75 0.50 0.25 0.00 –0.25 –0.50 –0.75 –1.00

XB

1.00 0.75 0.50 0.25 0.00 –0.25 –0.50 –0.75 –1.00

Dunns

1.00 0.75 0.50 0.25 0.00 –0.25 –0.50 –0.75 –1.00

XB

1.00 0.75 0.50 0.25 0.00 –0.25 –0.50 –0.75 –1.00 Kmeans

Spectral

AF

H-Average H-Complete H-Ward

Kmeans

Spectral

FIG. 7.2 Results by Algorithm. Bar plots of the average Spearman’s correlation between each of 5 internal CVIs (CH, DB, Dunns, SI, XB) and 3 external CVI’s (Accuracy, Adjusted Rand Index, Jaccard) for each of the 6 algorithms (AF, H-Average, H-Complete, H-Ward, Kmeans, Spectral). Averages were taken over the 14 datasets tested. Error bars represent one standard error above or below the mean. The Jaccard index conveyed the least information about the variations of the indices, thus, those results were excluded from the remaining analyses.

Chapter 7  Evaluation of cluster validation metrics 201

SI

1.00 0.75 0.50 0.25 0.00 –0.25 –0.50 –0.75 –1.00

DB

1.00 0.75 0.50 0.25 0.00 –0.25 –0.50 –0.75 –1.00

Dunns

1.00 0.75 0.50 0.25 0.00 –0.25 –0.50 –0.75 –1.00

CH

1.00 0.75 0.50 0.25 0.00 –0.25 –0.50 –0.75 –1.00

DB

1.00 0.75 0.50 0.25 0.00 –0.25 –0.50 –0.75 –1.00

Dunns

1.00 0.75 0.50 0.25 0.00 –0.25 –0.50 –0.75 –1.00

Jaccard

CH

1.00 0.75 0.50 0.25 0.00 –0.25 –0.50 –0.75 –1.00

DB

1.00 0.75 0.50 0.25 0.00 –0.25 –0.50 –0.75 –1.00

Spearman

Adjusted RandIndex

CH

1.00 0.75 0.50 0.25 0.00 –0.25 –0.50 –0.75 –1.00

202 Computational Learning Approaches to Data Analytics in Biomedical Applications

FIG. 7.3 Results by Dataset. Bar plots of the average Spearman’s correlation between each of 5 internal CVIs (CH, DB, Dunns, SI, XB) and 2 external CVI’s (Accuracy, Adjusted Rand Index) for each of the 14 datasets (see Table 7.3 for tag descriptions). Averages were taken over the 6 algorithms tested. Error bars represent one standard error above or below the mean. As can be observed, all the indices performed badly at the tail end of difficulty of the datasets (to the right).

Chapter 7  Evaluation of cluster validation metrics 203

(B)

1 0.8 0.6 0.4 0.2 0 –0.2 –0.4 –0.6 –0.8 –1

Spearman LS Means

Spearman LS Means

(A)

CH

Spearman LS Means

(C)

DB

Dunns Internal CVI

SI

1 0.8 0.6 0.4 0.2 0 –0.2 –0.4 –0.6 –0.8 –1

XB

AF

H-Average H-Complete H-Ward Kmeans Spectral Algorithm

1 0.8 0.6 0.4 0.2 0 –0.2 –0.4 –0.6 –0.8 –1 D01

D02

D03

D04

D05

D06

D07

D08

D09

D10

D11

D12

D13

D14

Dataset

FIG. 7.4 Main Effect Plots. Plot of means for the main effects (A) Internal CVI, (B) Algorithm, (C) Dataset of the 3-factor ANOVA. The average Spearman’s correlation along with standard error bars for each level within each factor is given.

(p > 0.05). The Average method had the lowest average Spearman correlation (r s ¼-0.043). It is significantly different (p < 0.0001) from all of the other methods except for the complete linkage (p ¼ 0.0553). For the dataset’s main effect (Fig. 7.4C), averaging is conducted over internal CVIs and algorithms. The dataset with the highest average Spearman correlation is D02 (r s ¼0.572). Several other datasets have a similar average Spearman correlation and were not statistically different from D02. The most complex dataset (D14) had a significantly lower average (r s ¼-0.406) than all of the others (p < 0.05 for all comparisons). The interaction plot for the Internal CVI*Algorithm is given in Fig. 7.5A. Since this interaction is not significant, no pairwise comparisons were conducted. It is evident from the plot that the lines are roughly parallel (with some small, but insignificant exceptions.) Additionally, Dunn’s has the lowest average across all algorithms, and the other four internal CVI’s are all similar and not statistically different for each algorithm. Fig. 7.5B illustrates the interaction plot for the Internal CVI*Dataset. Note that the lines are intersecting in many places and are not parallel, which is indicative of a significant interaction. This implies that comparisons between the internal CVIs depends on the dataset. Although pairwise comparisons were performed there are too many to efficiently discuss, so this work focuses on a few noteworthy observations. Dunn’s has the lowest

204

Computational Learning Approaches to Data Analytics in Biomedical Applications

Spearman LS Means

(A) CH DB Dunns SI XB

1 0.8 0.6 0.4 0.2 0 –0.2 –0.4 –0.6 –0.8 –1 AF

H-Average

H-Complete

(B) Spearman LS Means

H-Ward

Kmeans

Spectral

Algorithm

1 0.8 0.6 0.4 0.2 0 –0.2 –0.4 –0.6 –0.8 –1

CH DB Dunns SI XB

D01

D02

D03

D04

D05

D06

D07

D08

D09

D10

D11

D12

D13

D14

Dataset

FIG. 7.5 Interaction Plots. (A) Plot of means for the Internal CVI*Algorithm (B) Internal CVI*Dataset interactions of the 3-factor ANOVA. The average Spearman correlation for each Algorithm and Dataset are given, with separate lines for the 5 internal CVIs (CH, DB, Dunns, SI, XB).

average Spearman correlation across most datasets, but there are some exceptions, most notably in datasets with more clusters (D12, D13, D14). For most of the datasets, the other 4 internal CVIs are similar and not significantly different, with a few exceptions. In D4 and D13, there is no significant difference among any of the internal CVIs. Datasets D06 and D07 exhibit a different pattern than the others. It is interesting that these two datasets have the largest number of features. For the datasets with the largest number of features (D06, D07, D11) and largest number of clusters (D12, D13, D14), the internal CVIs vary more in the ordering of the average Spearman correlation. An evaluation of the majority voting across all the indices is provided in Table. 7.4. Usually three of the five indices (excluding CH and Dunn’s) agreed on the same configuration. The voting fell short of the optimal result across the 6 algorithms as indicated by the highest accuracy value in all cases except for D-14 i.e. yeast, the most complex dataset with 10 clusters and 8 features. A visualization of the plots of the ground truth and optimal clustering configuration scheme is illustrated in Fig. 7.6. As can be observed from the visualization, the H-complete cluster 2 result (Fig. 7.6B) is clearly a suboptimal scheme when compared to ground truth clusters, but this is not detected by the CVIs. A more viable path for designing a robust cluster validation paradigm should include visualization of the resulting clusters for visual validation by the human user. See Chapter 8 and 9 for further discussion on visualization methods and tools.

Chapter 7  Evaluation of cluster validation metrics 205

Table 7.4

Assessment of majority voting scheme per dataset. Majority internal CVIs choice

Data

Algorithm

D-01 D-02 D-03 D-04 D-05 D-06 D-07 D-08 D-09 D-10 D-11 D-12

H-average H-complete/H-average H-complete/H-average H-complete Affinity H-ward H-complete/H-average Spectral/H-ward/H-average H-complete/H-average Affinity H-ward/H-average Affinity/k-means/all HC methods No consensus H-complete/H-average

D-13 D-14

Choice versus Actual ky

Votes

Accuracy @ choice

Highest accuracy across all algorithms

2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/3 2/3 2/3 2/3 2/6

3/5 3/5 3/5 3/5 4/5 3/5 2/5 3/5 3/5 2/5 2/5 4/5

0.843 0.887 0.846 0.801 0.952 0.810 0.822 0.900 0.877 0.943 0.669 0.689

0.935 0.968 0.940 0.955 0.977 0.892 0.902 0.953 0.965 0.957 0.994 0.981

-/8 2/10

n/a 3/5

n/a 0.877

0.929 0.877

7.6 Ensemble validation paradigm In varied cluster analysis applications, one metric is usually selected for identifying the optimal set of partitions. Sometimes more than one metric is used, and the results are combined by a simple majority voting (i.e. the clustering output selected is based on the most popular among the metrics). Applying basic majority voting across multiple internal CVIs makes the assumption that these metrics can be evaluated in the same domain value space. Each metric views the task of determining the optimal clustering configuration from a different perspective (see Section 7.3.1). To effectively combine the strengths of each metric, an ensemble cluster validation paradigm as discussed in (Nguyen, Nowell, Bodner, & Obafemi-Ajayi, 2018) is proposed. The objective of the ensemble cluster validation method is to leverage the strengths of the diverse multiple metrics by utilizing aggregated ranks to determine the optimal clustering for a given dataset. The ensemble method selects the top result with the highest aggregated ranks for further domain-specific analysis. The framework can be applied to agglomerate any number of m internal CVIs deemed useful for the specific cluster analysis. By applying one or multiple clustering algorithms to a given dataset and varying the set of parameters, a set of n possible clustering configuration outputs (Diji ¼ 1, .,n) are obtained. Each CVI gives a rank to each obtained clustering output Di based on its performance criteria. For each CVI, the top r most optimal clustering outputs from the set of Di are selected and assigned an adjusted score based on the rank. For a given CVI, Cj, its best (i.e. highest ranked) output, is assigned a score of r. Likewise,

206

Computational Learning Approaches to Data Analytics in Biomedical Applications

(A) 0.8

0.6

0.4

0.2

0.0

–0.2

–0.4 –0.6

–0.4

–0.2

0.0

0.2

0.4

0.6

0.8

PCA Visualization of clustering result of H-complete 2-cluster configuration of yeast data

(B) 0.8

0.6

0.4

0.2

0.0

–0.2

–0.4 –0.6

–0.4

–0.2

0.0

0.2

0.4

0.6

0.8

PCA Visualization of Actual cluster of yeast data FIG. 7.6 Visualization of yeast data using PCA to illustrate the poor clustering configuration obtained from the highly ranked method. (A) Clustering result of H-complete 2-cluster configuration of yeast data. (B) Actual 10-cluster configuration of yeast data.

Chapter 7  Evaluation of cluster validation metrics 207

the second highest ranked output is assigned a score of r 1, and the third is r 2. The last of the r best performing, i.e. the rth best performing score, is assigned a score of 1. Note that any Di that is not part of the top r is assigned a score of 0. The final weighted score Wi of each clustering output Di is the sum of the scores from each CVI. The final rank of the weighted ensemble validation score is subsequently applied to determine the optimal scheme, with the maximum value being the most optimal. For further discussion of this ensemble validation method, including its application in a biomedical data analysis of phenotype data see (Nguyen et al., 2018).

7.7 Summary This chapter presents the preliminary results of a detailed evaluation of 5 commonly used internal CVIs by assessing their performance on real biological datasets with available ground truth using rigorous statistical methods. An evaluation framework is presented for evaluating the quality of the partitions selected by the CVIs, not just the optimal number of clusters. Studying the performance of CVIs to determine the underlying structure of known benchmark datasets currently used to evaluate clustering methods will advance knowledge of the performance of CVIs and aid in understanding their selection and use. As demonstrated in this work, there is not one universal internal CVI that works the best across real datasets, and a majority voting approach was also demonstrated to be not entirely effective. Thus, an open area of research is to identify effective metrics to assess clustering results, including investigating how different metrics deal with clusters that are outliers.

References Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Pe´rez, J. M., & Perona, I. (2013). An extensive comparative study of cluster validity indices. Pattern Recognition, 46(1), 243e256. https://doi.org/10.1016/j. patcog.2012.07.021. Bailey, J. (2013). Alternative clustering Analysis : A review. In C. C. Aggarwal, & C. K. Reddy (Eds.), Data clustering: Algorithms and applications (1st ed., pp. 533e548). Taylor & Francis. Bolshakova, N., & Azuaje, F. (2003). Cluster validation techniques for genome expression data. Signal Processing, 83(4), 825e833. https://doi.org/10.1016/S0165-1684(02)00475-9. Brun, M., Sima, C., Hua, J., Lowey, J., Carroll, B., Suh, E., et al. (2007). Model-based evaluation of clustering validation measures. Pattern Recognition, 40(3), 807e824. https://doi.org/10.1016/j. patcog.2006.06.026. Calinski, T., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics Theory and Methods, 3(1), 1e27. https://doi.org/10.1080/03610927408827101. Clifford, H., Wessely, F., Pendurthi, S., & Emes, R. D. (2011). Comparison of clustering methods for investigation of genome-wide methylation array data. Frontiers in Genetics, 2, 88. https://doi.org/10. 3389/fgene.2011.00088. Conover, W. J. (1999). Practical nonparametric statistics (3rd ed.). New York, NY: Wiley. Dua, D., & Graff, C. (2017). Machine learning repository. Retrieved from University of California. School of Information and Computer Science website: http://archive.ics.uci.edu/ml.

208

Computational Learning Approaches to Data Analytics in Biomedical Applications

Dubes, R. C. (1987). How many clusters are best? - an experiment. Pattern Recognition, 20(6), 645e663. https://doi.org/10.1016/0031-3203(87)90034-3. Halkidi, M., Batistakis, Y., & Vazirgiannis, M. (2001). On clustering validation techniques. Journal of Intelligent Information Systems, 17(2e3), 107e145. https://doi.org/10.1023/A:1012801612483. Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 31(8), 651e666. https://doi.org/10.1016/j.patrec.2009.09.011. Kaggle Inc. (2016). Kaggle - your home for data science. Retrieved March 8, 2018, from https://www. kaggle.com/. Kova´cs, F., Lega´ny, C., & Babos, A. (2005). Cluster validity measurement techniques. Proceedings of the 6th International Symposium of Hungarian Researchers on Computational Intelligence, 2006, 1e11. https://doi.org/10.7547/87507315-91-9-465. Liu, Y., Li, Z., Xiong, H., Gao, X., & Wu, J. (2010). Understanding of intenal clustering validation measures. In IEEE internatinal conference on data mining (pp. 911e916). https://doi.org/10.1109/ICDM.2010. 35. Liu, Y., Li, Z., Xiong, H., Gao, X., Wu, J., & Wu, S. (2013). Understanding and enhancement of internal clustering validation measures. IEEE Transactions on Cybernetics, 43(3), 982e994. https://doi.org/10. 1109/TSMCB.2012.2220543. Maulik, U., & Bandyopadhyay, S. (2002). Performance evaluation of some clustering algorithms and validity indices. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(12), 1650e1654. https://doi.org/10.1109/TPAMI.2002.1114856. Milligan, G. W., & Cooper, M. C. (1985). An examination of procedures for determining the number of clusters in a dataset. Psychometrika, 50, 159e179. https://doi.org/10.1007/BF02294245. Nguyen, T., Nowell, K., Bodner, K. E., & Obafemi-Ajayi, T. (2018). Ensemble validation paradigm for intelligent data analysis in autism spectrum disorders. In 2018 IEEE conference on computational intelligence in bioinformatics and computational biology (CIBCB) (pp. 1e8). https://doi.org/10.1109/ CIBCB.2018.8404960. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2012). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825e2830. Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20(C), 53e65. https://doi.org/10.1016/ 0377-0427(87)90125-7. Saria, S., & Goldenberg, A. (2015). Subtyping: What it is and its role in precision medicine. IEEE Intelligent Systems, 30(4), 70e75. https://doi.org/10.1109/MIS.2015.60. Tibshirani, R., & Walther, G. (2005). Cluster validation by prediction strength. Journal of Computational and Graphical Statistics, 14(3), 511e528. https://doi.org/10.1198/106186005X59243. Vendramin, L., Campello, R. J. G. B., & Hruschka, E. R. (2010). Relative clustering validity criteria: A comparative overview. Statistical Analysis and Data Mining, 3(4), 209e235. https://doi.org/10.1002/ sam.10080. Von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing, 17(4), 395e416. https://doi.org/10.1007/s11222-007-9033-z. Wang, K., Wang, B., & Peng, L. (2009). Cvap: Validation for cluster Analyses. Data Science Journal, 8, 88e93. https://doi.org/10.2481/dsj.007-020. Xu, R., & Wunsch, D. C., II. (2009). Clustering. IEEE Press/Wiley. https://doi.org/10.1002/9780470382776.

8 Data visualization 8.1 Introduction Data visualization is a key component in data analytics, despite the rapidly increasing amount of data that needs to be analyzed and mined for knowledge inference. According to Keim (2002), for data mining to be effective, it is important to include a human in the data exploration process. Hence, the challenge is to present the data in some visual form that allows the human to seamlessly gain insight, draw conclusions, and directly interact with the data. This is especially useful when little is known about the data and exploration goals are vague, such as in unsupervised learning. Data visualization offers an efficient means of representing distributions and structures of datasets in a way that allows the data to be easily understood and creates a strong visual impact. It can also reveal hidden patterns in the data (Chen, Guo, & Wang, 2015). The strength of the visualization techniques for data mining lies in the successful dynamic and interactive integration of human capabilities into an intuitive visual interface. In biomedical applications, data visualization methods make it easier for the domain experts to be involved in the biological validation of the model, thus ensuring a more effective datadriven approach. This is especially challenging with big data analysis. Characteristics such as high dimensionality, multiple data sources, and/or format further limit the application of known standard techniques. Nevertheless, there has been a lot of traction in the data mining research community and the graph mining community to provide solutions for effective data exploration such as (Chen et al., 2015; Holzinger & I., 2014; Myatt & Johnson, 2011; Ward, Grinstein, & Keim, 2015; Zhu, Heng, & Teow, 2017). This chapter provides a brief overview of standard visualization techniques based on data transformation. According to Liu et al. (Jouppi et al., 2017), data transformation methods can be grouped into four categories: dimensionality reduction, topological data analysis, regression analysis, and subspace clustering. It primarily focuses on the dimensionality reduction and topological data analytic methods to visualize data. Additionally, neural network-based visualization methods suited for deep learning techniques are briefly described. Lastly, this chapter also introduces recent state of the art methods designed to promote interactive and integrated machine learning.

8.2 Dimensionality reduction methods Dimensionality reduction is one of the fundamental techniques for analyzing and visualizing high-dimensional datasets (Jouppi et al., 2017). Projecting high-dimensional Computational Learning Approaches to Data Analytics in Biomedical Applications. https://doi.org/10.1016/B978-0-12-814482-4.00008-5 Copyright © 2020 Elsevier Inc. All rights reserved.

209

210

Computational Learning Approaches to Data Analytics in Biomedical Applications

data into fewer dimensions is a core problem of machine learning and data mining (Tang, Liu, Ming, & Mei, 2016). According to Tang et al. (Tang et al., 2016), the main idea is to preserve the intrinsic structure of the high-dimensional data in the low-dimensional space. An overview of both linear and nonlinear techniques commonly used in biomedical applications are presented here.

8.2.1

Linear projection algorithms

8.2.1.1 Principal component analysis Principal component analysis (PCA), or Karhunen-Loe´ve transformation, is one of the best-known dimensionality reduction approaches. It constructs a linear combination of a set of vectors that can best describe the variance of data (Duda, Hart, & Stork, 2000; Jolliffe, 1986). Given a set of N d-dimensional input patterns {x1, ., xi, ., xN}, each of which can be written as a linear combination of a set of d orthonormal vectors, d P xi ¼ cij vj . PCA approximates the data by using a linear l-dimension subspace based j¼1 N  2 P on the squared error criterion, J ¼ xi  xi0  . xi’ is an approximation of xi only when i¼1 l d P P cij vj þ ej vj , where ej are a subset of l vectors vj is kept, represented as xi’ ¼ j¼1

j¼lþ1

constants used to replace the original coefficients cij (lþ1  j  d). Using the Lagrange optimization method, the minimization of J is obtained when the vectors vj are the eigenvectors of the scatter matrix, Svj ¼ ljvj, where l are the N P ðxi mÞðxi  mÞT , corresponding eigenvalues. The scatter matrix S is defined as S ¼ i¼1

given the mean vector m ¼ Jmin ¼

d P

1 N

N P

xi . The resulting minimum error criterion function is

i¼1

lj .

j¼lþ1

This indicates that the minimum error can be obtained by retaining the l largest eigenvalues and the corresponding eigenvectors of S. The retained eigenvectors are called the principal components. PCA calculates the eigenvectors while minimizing the error sum of squares by approximating the input patterns. Multiple variations of PCA that have been suggested in the literature are reviewed in Xu & Wunsch, 2009. The interactive PCA (iPCA) system (Jeong, Zeimeiwicz, Fisher, Ribarskey, & Chang, 2009) provides a framework to visualize the results of PCA using multiple coordinated views. The goal is to aid the user in understanding both the PCA process and the dataset (Liu, Maljovec, Wang, Bremer, & Pascucci, 2017).

8.2.1.2 Independent component analysis PCA is appropriate for Gaussian distributions because it relies on second-order relationships stored in the scatter (covariance) matrix. Other linear transformations,

Chapter 8  Data visualization

211

such as independent component analysis (ICA) and projection pursuit, consider higherorder statistical information and therefore are used for non-Gaussian distributions (Cherkassky & Mulier, 1998; Hyva¨rinen, 1999; Jain, Duin, & Mao, 2000). The basic goal of ICA is to find the components that are most statistically independent from each other (Hyva¨rinen, Karhunen, & Oja, 2001; Hyva¨rinen & Oja, 2000; Jutten & Herault, 1991). In the context of blind source separation, ICA aims to separate the independent source signals from the mixed observation signals. Given a d-dimensional random vector x, Hyva¨rinen (Hyva¨rinen & Oja, 2000) summarized three different formulations of ICA: (1) General model. ICA seeks a linear transformation s ¼ Wx, so that si in the vector s¼(s1, ., sd)T are as independent as possible while maximizing some independence measure function f (s1, ., sd). (2) Noisy ICA model. ICA estimates the model x ¼ Asþþ, where the components si in the vector s¼(s1, ., sl)T are statistically independent from each other, A is a nonsingular dl mixing matrix, and ε is a d-dimensional random noise vector. (3) Noise-free ICA model. ICA estimates the model without considering noise x ¼ As. The noise-free ICA model takes the simplest form among the three definitions, and most ICA research is based on this formulation (Hyva¨rinen & Oja, 2000). To ensure that the noise-free ICA model is identifiable, all the independent components si, with the possible exception of one component, must be non-Gaussian. Alternately, the constant matrix A must be of full column rank, and d must be no less than l, i.e., d  l (Comon, 1994). The non-Gaussian condition can be justified because for normal random variables, statistical independence is equivalent to uncorrelatedness, and any decorrelating representation would generate independent components, which causes ICA to be ill-posed from a mathematical point of view. A situation in which the components si are nonnegative was discussed in (Plumbley, 2003). The estimation of the ICA model consists of two major steps: the construction of an objective function, or contrast function, and the development and selection of an optimization algorithm for maximizing or minimizing the objective function. (Hyva¨rinen, 1999) reviewed many such objective functions and their corresponding optimization algorithms. For example, mutual information between the components is regarded as the “most satisfying” objective function in the estimation of the ICA model (Hyva¨rinen et al., 2001). The minimization of such an objective function can be achieved through a gradient descent method.

8.2.2

Nonlinear projection algorithms

Nonlinear projection algorithms are also referred to as manifold learning. They allow the capture of more complex structures, but the resulting embedding can be difficult to interpret. Multidimensional scaling is a nonlinear projection method that fits original multivariate data into a low- (usually two- or three-) dimensional structure while tending

212

Computational Learning Approaches to Data Analytics in Biomedical Applications

to maintain the proximity information as much as possible (Borg & Groenen, 1997). Thus, data points that are close in the original feature space should also be near each other in the projected space. Nonlinear dimension reduction can occur in either a metric or nonmetric setting (Liu et al., 2017). Multiple graph-based techniques have been proposed that handle metric inputs such as Isomap (Tenenbaum, De Silva, & Langford, 2000), Locally Linear Embedding (LLE) (Roweis & Saul, 2000) and Laplacian Eigenmap (LE) (Belkin & Niyogi, 2003). This work reviews Isomap. Nonmetric MDS refers to the class of nonlinear projection methods that minimize mapping error directly through iterative optimizations (Liu et al., 2017) to capture nonmetric dissimilarities. Nonmetric MDS includes t-Distributed Stochastic Neighbor Embedding (t-SNE) (van der Maaten & Kinton, 2008) as well as LargeVis (Tang et al., 2016), which are also summarized briefly.

8.2.2.1 Isomap The isometric feature mapping (Isomap) algorithm is a local version of multidimensional scaling (MDS) that tends to explore more complicated nonlinear structures in the data (Tenenbaum et al., 2000). Isomap is interested in the estimation of the geodesic distances between all pairs of data points, which are the shortest paths between the points on a manifold and provide the best representation of the intrinsic geometry of the data. In order to calculate the geodesic distances, ISOMAP first constructs a symmetric neighborhood graph by connecting a pair of points xi and xj if xi is one of K nearest neighbors of xj, or if xi is in the ε-neighborhood of xj. The graph is also weighted using the Euclidean distance between neighboring points. The geodesic distances are then calculated as the shortest paths among edges using an algorithm like Floyd’s algorithm (Tenenbaum et al., 2000). Let D (xi,xj) represent the Euclidean distance between points xi and xj. The geodesic distance DG(xi,xj) is initially set as, 

DG ðx i ; x j Þ ¼

Dðx i ; x j Þ if x i and x j are connected ; N otherwise

(8.1)

and further calculated for each data point xk, k ¼ 1, .,N in turn as DG ðx i ; x j Þ ¼ minðDG ðx i ; x j Þ; DG ðx i ; x k Þ þ DG ðx k ; x j ÞÞ:

(8.2)

With the obtained geodesic distance matrix G ¼ {DG(xi,xj)}, MDS can be applied to embed the data in a lower-dimensional space. Let F be the new space with E[{D(x’i,x’j)} as the Euclidean distance matrix, where x’i and x’j are corresponding points, the criterion function for seeking an optimal configuration of these points is defined as rP ffiffiffiffiffiffiffiffiffiffiffi   J ¼ sðGÞ  sðEÞ 2 , where GL2 ¼ g2 and s is an operator converting distances to L

i;j

ij

inner products. The minimization of the criterion function is achieved when the lth component of the coordinates x’i is set by using the lth eigenvalue ll (in decreasing pffiffiffiffi order) and the ith component of the lth eigenvector vli of the matrix s(G), xil’ ¼ ll vli .

Chapter 8  Data visualization

213

8.2.2.2 T-Distributed Stochastic Neighbor Embedding (t-SNE) T-SNE (van der Maaten & Kinton, 2008) is based on the SNE (Hinton & Roweis, 2003) method. Given a set of high-dimensional data-points x1, .,xn, the similarity of datapoint xj to xi is the conditional probability pjji, that xi would pick xj as its neighbor. For nearby points, the conditional probability is high, and for far way points, it is close to zero. The conditional probability is given by:    2 exp  xi  xj  =2s2i :   pjji ¼ P  2   =2s2 i k!¼ l exp  xl  xk

(8.3)

Note that piji ¼ 0 and s2i is the variance of the Gaussian. T-SNE performs a binary search for s2i that produces a metric P, which measures the perplexity that is specified by P the user such that P(i) ¼ 2H(i) where HðiÞ ¼  pðjjiÞlog2 pðjjiÞ is the Shannon entropy of P(i). The goal is to discover a low-dimensional representation, yi, of each data point xi. The similarity of this representation can be defined similarly with a heavy-tailed T-distribution for low-dimensional points. The conditional probability is given by: 

 2 1 1 þ yi  yj  qjji ¼ P   2 1   k !¼ l 1 þ yl  yk

(8.4)

A key observation is that if y correctly models x, then the difference between pjji and qjji would be minimal. Motivated by this observation, t-SNE aims to find a lowdimensional data representation that minimizes the Kullback-Leibler divergence between the two probability distributions. C ¼ KLðPjjQÞ ¼

XX pi;j pi;j Log qjji i j

!

(8.5)

The definition for similarity of xj to xi can be problematic when xi is an outlier. For such an outlier, the values of pjji are extremely small for all j, so the location of its map point yi has little effect on the loss function. The problem is addressed by defining pjji to be the symmetric conditional probability: pi;j ¼

This will ensure that

P j

pjji þ pijj 2n

(8.6)

1 pi;j > 2n , for all data points x.

Given the definitions of pi,j and qi,j, the gradient of C can be utilized with respect to y for optimization:  X  2 1  vC pi;j  qi;j ðyi  yj Þ 1 þ yi  yj  ¼4 vyi j

(8.7)

214

Computational Learning Approaches to Data Analytics in Biomedical Applications

Mathematically, the gradient update with momentum a(t) and learning rate h is given by: y ðtÞ ¼ y ðt1Þ þ h

  vC þ aðtÞ y ðt1Þ  y ðt2Þ vy

(8.8)

The current gradient is added to an exponentially decaying sum of previous gradients to determine the changes in the coordinates of the map points at each iteration of the gradient search. Overall T-SNE Algorithm (van der Maaten & Kinton, 2008): Data: dataset X ¼ {x1, .,xn, } cost function parameters: perplexity P(i),optimization parameters: number of iterations T, learning rate h, momentum a(t). Result: low-dimensional data representation YT ¼ y1, .,yn. Begin ——compute pairwise distance pjji with perplexity P(i) p þp ——set pi;j ¼ jji2n ijj ——sample initial solution Y0 ¼ y1, .,yn, from N(0,104I) ——for t ¼ 1 to T do ——————compute low-dimensional affinities q jji 2 1   P vC    p y ——————compute gradient. vy ¼ 4 q y  y 1 þ y i;j i;j i j i j i  j  ðt1Þ ðtÞ ðt1Þ vC ðt2Þ ——————set y ¼ y þ h vy þ aðtÞ y y ——end end When applying any of the visualization techniques, it is helpful to use more than one method as they draw out different aspects of the data. For example, Fig. 8.1 illustrates three visualization techniques (PCA, ISOMAP, and t-SNE) on a subset of data drawn from the Citicoline Brain Injury Treatment Trial (COBRIT) on the effects of Citicoline on traumatic brain injury (TBI) (Zafonte et al., 2012). This dataset is publicly available by permission from Federal Interagency Traumatic Brain Injury Research (FITBIR) Informatics System website (NIH, 2019). The data matrix utilized a set of 40 phenotype features (including 6 categorical features, 5 ordinal features) for 802 patients. The class labels are based on the Glasgow Coma Scale (GCS) scores: complicated mild (13e15), moderate (912), and severe (3e8). As can be observed, the PCA visualization did not seem to display any variation, in contrast to the t-SNE and ISOMAP plots.

8.2.2.3 LargeVis LargeVis is a graph visualization technique for reducing and visualizing high dimensional data within R2 and R3 dimensions (Tang et al., 2016). The results of this algorithm are comparable in accuracy to the current “state-of-the-art” algorithm for graph visualization, t-SNE, while producing the graph and dimension reduction much quicker than

Chapter 8  Data visualization

215

FIG. 8.1 Visualization of COBRIT TBI data based on varying levels of severity as quantified by the Glasgow Coma Scale (GCS) scores.

t-SNE. Like t-SNE, LargeVis utilizes K-Nearest Neighbors as the validation of accuracy and for graph construction. The limitation of t-SNE lies in its inefficiency with the KNN graph construction through which the algorithm builds a large number of trees to obtain high accuracy. LargeVis remedies this by instead using neighbor exploration techniques to improve its graph accuracy. They use the basic idea of “a neighbor of my neighbor is also likely to be my neighbor” (Tang et al., 2016). The process used is as follows: build a few random projection trees; then, for each node, search the neighbors of its nearest neighbors. Repeat this for multiple iterations until the accuracy of the graph is improved. Regarding

216

Computational Learning Approaches to Data Analytics in Biomedical Applications

the computation of the weights of the edges in the KNN graph, a similar approach as in t-SNE is utilized where the conditional probability from ! x i to ! x j is calculated by:

pjji ¼

 2. exp ! xi ! x j  2s2 i

!

! ! 2. i;k˛E exp jj x i  x k jj 2s2

P

!;

(8.9)

i

such that piji ¼ 0 and si is chosen by setting the perplexity of the conditional distribution p.ni equal to a perplexity u. Then, the graph is made symmetrical by setting the weight between ! x i and ! x j as: wij ¼

pjji þ pijj 2N

(8.10)

Once the graph has been constructed, to visualize the graph, the algorithm projects the nodes of the graph into the desired 2D or 3D space. To do so, a probabilistic model is used. Given a pair of vertices (vi,vj), let the probability of observing the binary edge, eij ¼ 1, between vi and vj be:     Pðeij ¼ 1Þf ¼ j! y j j y i !

(8.11)

Where ! y i is the embedding of vertex vi in the low-dimensional space, f ð $ Þ is a probabilistic function with respect to yi and yj. When yi is close to yj in low-dimensional space, there is a probability of observing a binary edge between the two vertices. For f ð $ Þ, most probabilistic functions can be used. However, this equation only defines the probability of observing a binary edge between a pair of vertices. To extend it to general weight edges, it is defined as: Pðeij ¼ wij Þ ¼ Pðeij ¼ 1Þwij

(8.12)

With this definition, given a weight graph G¼(V,E), the likelihood of the graph can be calculated as: O¼

Y

ði;jÞ˛E

pðeij ¼ 1Þwij

Y

ði;jÞ˛E 0

ð1  pðeij ¼ 1ÞÞg f

X

ði;jÞ˛E

wij log pðeij ¼ 1Þ þ

X

g logð1  pðeij ¼ 1ÞÞ (8.13)

ði;jÞ˛E 0

in which E0 is the set of vertex pairs that are not observed, and g is a unified weight assigned to the negative edges. The first part of the equation models the likelihood of the observed edges. If maximized, similar data points will keep close together in low dimensions. The second part models the likelihood of all vertex pairs without edges or with negative edges. If maximized, dissimilar data will be further away from each other. By maximizing the overall equation, both goals should be achieved; however, achieving a maximized value is computationally expensive because the number of negative edges is very significant. To solve this problem, the equation is optimized.

Chapter 8  Data visualization

217

To optimize the problem, for all vertices, i, randomly sample some vertices j according to a noisy distribution Pn(j) and treat (i,j) as the negative edges. The selection for the noisy distribution follows Pn ðjÞfdj0:75 , in which dj is the degree of the vertex j. Let M be the number of negative samples for each positive edge, the objective function can be redefined as: O¼

X

wij log pðeij ¼ 1Þ þ

ði;jÞ˛E

M X

!

EjkwPn ðjÞg logð1  pðejk ¼ 1ÞÞ

(8.14)

k¼1

The algorithm utilizes asynchronous stochastic gradient descent which is very efficient and effective on sparse graphs. After optimization, the time complexity of this step takes O(sM) where M is the number of negative samples and s is the dimensions chosen (either R2 or R3), leaving the overall time complexity as O(sMN) where N is the number of nodes (Tang et al., 2016).

8.2.2.4 Self-organizing maps A widely used visualization method is the self-organizing map (SOM) (Kohonen, 1982). Each position in the SOM lattice is associated with a weight in the data space; therefore, a nonlinear dimensionality reduction is achieved when mapping from the input space (data space) to the output space (SOM lattice). Its neurons are arranged in a lattice (output space), in an organized manner, according to a given topology (e.g., hexagonal or rectangular). During the training process, the closest neuron w to each data sample x in the input space is determined and updated according to PN w j ðt þ 1Þ ¼ Pi¼1 N

hj;bmu ðtÞxi

i¼1 hj;bmu ðtÞ

(8.15)

where wj is the weight of the neuron, xi is the ith sample presented, N is the cardinality of the dataset and hj,bmu(t) is the monotonically decreasing neighborhood function of the best matching unit. The neighborhood kernel is usually defined as a Gaussian function of distance between the neurons in the output space: hj;bmu ðtÞ ¼ exp



  ! rbmu  rj 2 2s2 ðtÞ

(8.16)

where rj and rbmu are the locations of neuron j and the BMU in the lattice and monotonically decreasing neighborhood radius function. This property can be used for visualization purposes by inferring data characteristics from the SOM neurons. There have been several advancements proposed for SOM-based visualization methods which can be categorized as image-based, graph-based, and projection-based methods. (i) Image-based methods: In image-based methods, the U-matrix (Ultsch, 1993) is utilized as a visualization technique to represent the SOM. The Euclidean distance

218

Computational Learning Approaches to Data Analytics in Biomedical Applications

between each neuron in the SOM is calculated and depicted in the U-matrix. After the SOM is trained on the data, the distance between each neuron in the Umatrix is assumed to represent the distance between each data sample. The boundary matrix (Manukyan, Eppstein, & Rizzo, 2012) is a visualization scheme that computes interneuron distances, similar to the U-matrix, for sparsely matched SOMs (low sample to neuron ratio). The boundary matrix includes a post-processing step called the cluster reinforcement phase to display sharpened cluster boundaries in sparsely matched SOMs. The smoothed data histograms (Pampalk, Rauber, & Merkl, 2002) are a visualization method that aims to estimate the probability density function. This is achieved by allowing more than one BMU for each data sample. BMU is redefined by the number of samples inside the Voronoi cells pertaining to each SOM neuron. The number of BMUs considered for each data sample is controlled by a userdefined smoothing parameter. (ii) Graph-based method: In the graph-based method, the CONNvis (Tasdemir & Merenyi, 2009) is a visualization technique for SOMs which utilizes the Delaunay triangulation graph to encode the weight of the edges by using local data distribution between adjacent neurons (nodes). The weights of the graph are stored in a connectivity/similarity matrix (CONN), where each element (i,j) consists of the number of samples x in a dataset X for which neurons i and j are the first and the second BMUs and vice versa. Cluster connections (Merkl & Rauber, 1997) and DISTvis (Tasdemir, 2010) are graph-based SOM visualizations that depict local distances. Cluster connections display the connections of neighboring neurons in the output grid proportional to their weights’ distance, whereas DISTvis is a rendering of the graph DIST, whose edges’ weights encode Euclidean distances on the SOM grid, allowing connections between any neurons. Thresholds and graylevel scales depict intensity and enhance the visual representation of clusters in both methods. (iii) Projection-based method: In the standard SOM, the data distribution is not faithfully represented as it does not preserve interneuron distance. Visualizationinduced SOM (ViSOM) (Yin, 2002) was introduced to address this challenge and to provide a low computational cost alternative. ViSOM introduced a mapping function to allow the projection of new data samples on the trained manifold without having to use the whole dataset to recalculate. ViSOM has a distancepreserving property in addition to the topology preservation present in a standard SOM. ViSOM is a uniform quantizer, whereas SOM is a density-based quantizer: its neurons are uniformly distributed over the data manifold. The AC-ViSOM (Tapan & Siong, 2008) is a hybridization of ViSOM and the modified adaptive coordinates (modified-ACs), which aims to automate the selection of the regularization parameter and improve the ViSOM resource utilization (quantity of dead neurons). The modified-AC is a variant of the adaptive coordinates which views the displacement of the SOM weights in the input space and mimics them in the

Chapter 8  Data visualization

219

output space. In this manner, information obtained during the training process is used for visualization. In the PolSOM (Xu, Xu, & Chow, 2010), the output space is defined in a polar coordinate system. The neurons and data positions in the output space are encoded using radii and angles to express the importance of each feature. The neurons are distributed in the polar plane in the intersections of rings and radial axes. The data samples have their associated positions in the output space adjusted throughout the learning process in order to be close to their respective BMU. In the projection step, this representation emphasizes the differences among the clusters by displaying a correlation between features (angles) and feature values (radii). SOM-IT: SOM-IT (Brito da Silva & Wunsch, 2018) represents an improvement to the previously mentioned image-based visualization techniques by incorporating information-theoretic measures into the heatmap. Let X ¼ x1,x2, .xn represent the dataset with n samples. Each neuron i is associated with a subset Hi of X with at least MinPts data points. First, these subsets Hi are generated based on Voronoi cells, i.e., the BMUs for each data point are determined. If the number of points in Hi is less than MinPts, then the subset Hi is reset to include all of the MinPts closest to this neuron. This step is accomplished using standard k-nearest neighbors (k-NNs). SOM-IT utilized Cross Information Potential (CIP) as the similarity measure for the U-matrix. Cross Information Potential between two subsets Hi and Hj is defined as: V ðHi ; Hj Þ ¼

jHj j jHi j X X 1 Gd ðxi ; xj ; Si;j Þ jHi jjHj j i j

where G is the Gaussian kernel: Gd ðxi ; xj ; SÞ ¼ exp



  ! xi  xj 2 2S

(8.17)

(8.18)

To compute CIP, it is necessary to compute Si,j ¼ SiþSj with Si ¼ 1 ðNi 1ÞXiT Xi . SOM-IT can also incorporate the representative center for each cluster to utilize the representative Cross Information Potential (rCIP) as the similarity for the U-matrix. The rCIP between two subsets Hi and Hj is defined as V ðHi ; Hj Þ ¼

jHj j jHi j X X 1 Gk ðci  cj ; Si;j Þ jHi jjHj j i j

(8.19)

Where Gk is a multivariate Gaussian distribution parameterized by mean cicj and covariance Si,j, and ci, cjare respectively the arithmetic mean of cluster i, j. SOM-IT Algorithm. Data: dataset X ¼ {x1, .,xn, } Result: U-matrix.  Begin.  Train the SOM.

220

Computational Learning Approaches to Data Analytics in Biomedical Applications

 Generate the subsets Hi: if the number of points falling into the Voronoi region of neuron i is less than MinPts, then recreate Hi using standard k-NN or modified k-NN.  Compute the covariance matrix and mean.  Compute CIP or rCIP.  Generate IT-vis following the arrangement of the U-matrix with similarity measure CIP or rCIP. End.

8.2.2.5 Visualization of commonly used biomedical data sets from the UCI machine learning repository (Dua & Taniskidou, 2017) In this section, we summarize some of the above techniques (PCA, ISOMAP, LargeVis, SOM-IT) discussed by visualizing commonly used biomedical data sets from the UCI machine learning repository. Fig. 8.2 demonstrates the techniques for E.coli data which has 8 clusters, 7 attributes and 336 samples. Fig. 8.3 displays the results for Parkinson’s disease data which has 2 clusters, 22 attributes and 195 samples. Likewise, the visualization techniques are applied to breast tissue data (6 clusters, 9 attributes and 106

FIG. 8.2 Visualization of E.coli data (8 clusters/7 attributes/336 samples).

Chapter 8  Data visualization

221

FIG. 8.3 Visualization of Parkinson’s disease data (2 clusters/22 attributes/195 samples).

samples) in Fig. 8.4. As can be observed from each of the visualization results, the patterns reflected are varied by technique and data set.

8.3 Topological data analysis Topological data analysis is a relatively new field for the visualization and analysis of data by using geometric mathematical principles (Liu et al., 2017). Much of its principles revolve around similar ideas such as contours, elevation, and overall densities of clusters or groups of points. There are multiple algorithms that can create the topological graph; the examples illustrated in this text are based on Mapper algorithms. A Mapper algorithm receives an input dataset of n-dimensions and applies a lens (projection/function) to transform the data. The lens typically performs a dimensionality reduction technique on the dataset to some k dimension, where k < n. The technique can be anything from the elevation of data on a dimension to the median transformation to a full algorithm such as a t-SNE or KNN dimension reduction. Once the lens is created, the algorithm creates a selection of covers (buckets, hypercubes, collections, etc.) in each dimension of the lens with each cover having a degree of overlap, allowing for possible shared data

222

Computational Learning Approaches to Data Analytics in Biomedical Applications

FIG. 8.4 Visualization of breast tissue data (6 clusters/9 attributes/106 samples).

points. The python library, KeplerMapper, that created the network maps/graphs uses hypercubes for the cover (Saul & van Veen, 2017). The number of hypercubes is as follows Nhypercubes ¼ nk

(8.20)

Where n is the number of hypercubes in one dimension, and k is the number of dimensions for the lens. The dimension-reduced data is then placed in those buckets, and a graph is created by mapping the points in each bucket. Furthermore, the color and distribution of each bucket is based on the clustering of the original high-dimensional data but only the data within the bucket. With the KeplerMapper library, the color gradient of the node comes from a defined palette and is based on the distribution of members in each bucket but follows the same procedure of clustering within the bucket. A Mapper generates clusters from a clustering algorithm, places all the points into a bucket based on the lens projection, creates vertices based on the points within the bucket through which the color of the node is determined by the largest cluster of data within that bucket, and then the edges or “nerves” are developed when two or more covers intersect with each other and share common points. If the covers overlap without sharing any points, there will not be a connecting nerve between the vertices. For more details and visual illustrations see (Saul, 2017).

Chapter 8  Data visualization

223

The resulting graphs can be somewhat informative to the overall shape of the data. For example, with the iris set, with certain clustering algorithms and lens parameters, it is possible to create a connected graph that resembles the image of an iris. The goal is to prompt a visualization that can be mapped to a realization of what the datasets represent (such as brain neuron cells in connection with Parkinson’s Disease datasets). Though they are not perfect representations of the physical features of the data’s subject, one should be able to see that the shapes of the graphs are very similar to the original reference object. Given that the graphs are created by utilizing different clustering algorithms, the type of clustering algorithm applied, as well as the type of lens, does influence the overall topological graph representation of the data and the connections of the vertices. We display the results of the visualization techniques (Figs. 8.5 e 8.7) on the sample data sets used in the previous section: Ecoli, Parkinson’s disease, and breast tissue. In Figs. 8.5 e 8.7 the parameters used for each clustering algorithm were the default parameters except for DBSCAN for the Parkinson’s and breast tissue figures. For Parkinson’s, in order for a graph to be created, the epsilon value was increased to 0.7, and the minimum sample values were decreased to 2. For the breast tissue, the minimum samples were simply decreased to 2. Spectral clustering had issues with the Parkinson’s data when projected through the KNN lens, but, by rare chance, the seed for

FIG. 8.5 Visualization of E.coli data (8 clusters/7 attributes/336 samples) using different settings of k and clustering algorithms for the Mapper topological data analysis tool.

224

Computational Learning Approaches to Data Analytics in Biomedical Applications

FIG. 8.6 Visualization of Parkinson’s disease data (2 clusters/22 attributes/195 samples) using different settings of k and clustering algorithms for the Mapper topological data analysis tool.

FIG. 8.7 Visualization of the breast tissue data (6 clusters/9 attributes/106 samples) using different settings of k and clustering algorithms for the Mapper topological data analysis tool.

Chapter 8  Data visualization

225

randomness was able to develop a single graph. The cover that was used for each of the results had 100 hypercubes per dimension with the percentage overlap equal to five. A higher overlap percentage and hypercube count increases the number of nodes and connected nodes, which can help display more robust figures if configured correctly.

8.4 Visualization for neural network architectures For a gradient-free approach, Caudell et al. (Caudell, Xiao, & Healy, 2003) introduced a framework that enables users to construct a hierarchical network of structures in eLOOM and perform a real-time simulation of the model in Flatland. The eLOOM graph consists of modules (can be considered as the output neuron), nodes (neuron), nodes arrays (collection of neurons, or layers), a connection (weight or model parameter), and a connection array (collection of weights). A module includes all the connections to its nodes. A module can further connect with other modules to form a hierarchical structure. A node has two components: input and output. During simulation, the input from the node is processed to produce outputs, which is later used as the input for its parent module. Each step that transfers the input to the output in the node and between the node and module is parameterized by the graph’s connection. The high-level structure of Flatland is the universe, which includes the transformation graph, database of the objects in the graph, and a reference to the vertex that is currently acting as a root. This root is used for a graphical rendering of the camera’s viewpoint. A graph is generated by reading a text file. Each node in the graph is associated with an object. Each object contains a draw function and a sound function. The draw function contains the code to draw the object in a local coordinate system. This system handles the placement and orientation of the object. Sound objects contain the code to generate a sound trigger for the object. The local coordinate system further enables the global position-tracking technology. This lets the user “fly” freely in the simulation while observing each component of the process. Some approaches for understanding and visualizing Convolutional Neural Networks (CNNs) have been developed in the literature. Visualization for deep network enables a better interpretation of its training process, which is commonly criticized as a black-box. This section briefly surveys some of these approaches and their related work. Consider an image x ˛ RCHW, where C ¼ 3 is the number of channels and the height (H) and width (W ), as a set of pixel xi,j at position i,j and a classification function f:RCHW/Rþ. The function f indicates the certainty of the types of objects in the image. This function can be learned and represented by a deep neural network. Throughout this section, it is assumed that the neural network consists of multiple layers of neurons with an activation value computed as: alþ1 j

X ¼s zij þ blþ1 j i

for zi;j ¼ ali wijl;lþ1

!

(8.21)

226

Computational Learning Approaches to Data Analytics in Biomedical Applications

Activation value alþ1 of neuron j at layer lþ1 is the value of a predefined activation j function (commonly set as Relu) over the summation of all the connections zi,j from the previous layer with the bias term blþ1 j . The connection value is defined as the dot product between the activation value and the connection weight wijl;lþ1 . Approaches for the visualization of a deep neural network focus on learning an additional transformation function hp ¼ H (p,f,x) that calculates the influence of pixel p over the final prediction f. In (Simonyan, Vedaldi, & Zisserman, 2014), this function is defined as the norm , over the partial derivative of the output with respect to channel c of pixel p: 

   v    hp ¼  f ðxÞ  vap;c c˛C

(8.22)

This quantity measures how much small local changes in the pixel value affect the network output. Large values denote pixels which largely affect the classification function f. Partial derivatives are calculated by running the backpropagation algorithm throughout the layers of the network. The backpropagation formula of input from two consecutive layers li and liþ1is: vf vaðlþ1Þ vf ¼ ðlÞ vaðlÞ vaðlþ1Þ vz

(8.23)

In (Zeiler & Fergus, 2014), hp is defined as a deconvolution layer which maps the network output back to the pixel input using the backpropagation rule:   RðlÞ ¼ mdec Rðlþ1Þ ; sðl;lþ1

(8.24)

Here, R(l), R(lþ1) are backward signals as they are backpropagated from one layer to the previous layer, mdecc is a predefined function that may be different for each layer and q is the set of parameters connecting two layers of neurons. Optimization is over the following objective function using gradient descent: N X L X i¼1

xðiÞ  RðlÞ

(8.25)

l¼1

Yosinski, Clune, Nguyen, Fuchs, & Lipson (2015) designed a similar approach that aims to reconstruct the input image given its output with a different loss function: N X L X i¼1

xðiÞ  RðlÞ s

(8.26)

l¼1

ðlÞ

Where Rs is defined as a parameterized regularization function that penalizes the activation output in various ways. The authors introduced four different regularization schemes:  L2 decay: penalizes large values and is calculated as r(x)¼(1d)x. L2 decay tends to prevent a small number of extreme pixel values from dominating the prediction.  Gaussian blur: penalizes high frequency information and is calculated as GðxÞ ¼ x2

1 ffi 2s2 pffiffiffiffiffiffiffiffi e 2 ps2

Chapter 8  Data visualization

227

 Clipping pixels with small norm: calculates the norm of each and then sets any pixels with a norm less than a certain threshold to zero.  Clipping pixels with small contribution: sets the value of each pixel to zero if the calculated contribution is less than a certain threshold. The contribution is L P xðiÞ $ vRðlÞ computed as s , the summation of the element wise product of x and l¼1

the gradient over all three channels.

8.5 Summary An improved data visualization method is a significant tool for enhancing cluster analysis. This chapter has primarily discussed dimensionality reduction and topological data analytic methods to visualize data. Graph-based methods were also discussed to promote interactive and integrated machine learning. The neural network-based visualization methods were centered around those suited for deep learning techniques. After discussing the different methods, this chapter also provided examples of some of the datasets to help the user choose the appropriate technique for specific situations.

References Belkin, M., & Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6), 1373e1396. https://doi.org/10.1162/089976603321780317. Borg, I., & Groenen, P. J. F. (1997). Modern multidimensional scaling (1st ed.) https://doi.org/10.1007/ 978-1-4757-2711-1. Brito da Silva, L. E., & Wunsch, D. C. (2018). An information-theoretic-cluster visualization for selforganizing maps. IEEE Transactions on Neural Networks and Learning Systems, 29(6), 2595e2613. https://doi.org/10.1109/TNNLS.2017.2699674. Caudell, T. P., Xiao, Y., & Healy, M. J. (2003). eLoom and Flatland: specification, simulation and visualization engines for the study of arbitrary hierarchical neural architectures. Neural Networks, 16(5e6), 617e624. https://doi.org/10.1016/S0893-6080(03)00105-9. Chen, W., Guo, F., & Wang, F. (2015). A survey of traffic data visualization. IEEE Transactions on Intelligent Transportation Systems, 16(6), 2970e2984. https://doi.org/10.1109/TITS.2015.2436897. Cherkassky, V., & Mulier, F. M. (1998). Learning from data: Concepts, theory, and methods (1st ed.). Wiley. Comon, P. (1994). Independent component analysis, A new concept? Signal Processing, 36(3), 287e314. https://doi.org/10.1016/0165-1684(94)90029-9. Dua, D., & Taniskidou, K. (2017). Visualization of commonly used biomedical data sets from the UCI machine learning repository. Retrieved June 1, 2019, from UCI Machine Learning Repository website: http://archive.ics.uci.edu/ml. Duda, R. O., Hart, P. E., & Stork, D. G. (2000). Pattern classification (2nd ed.). Wiley. Hinton, G. E., & Roweis, S. T. (2003). Stochastic neighbor embedding. In Advances in neural information processing systems (pp. 857e864).

228

Computational Learning Approaches to Data Analytics in Biomedical Applications

Holzinger, A., & I., J. (2014). Knowledge discovery and data mining in biomedical Informatics: The future is in integrative, interactive machine learning solutions. In , Lecture notes in computer science: Vol. 8401. Interactive knowledge discovery and data mining in biomedical Informatics (pp. 1e18). Springer-Verlag Berlin Heidelberg. Hyva¨rinen, A. (1999). Survey on independent component analysis. Neural Computing Surveys, 2, 94e128. Hyva¨rinen, A., Karhunen, J., & Oja, E. (2001). Independent component analysis (1st ed.). Wiley. Hyva¨rinen, A., & Oja, E. (2000). Independent component analysis: Algorithms and applications. Neural Networks, 13(4e5), 411e430. https://doi.org/10.1016/S0893-6080(00)00026-5. Jain, A. K., Duin, R. P. W., & Mao, J. (2000). Statistical pattern recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1), 4e37. https://doi.org/10.1109/34.824819. Jeong, D. H., Zeimeiwicz, C., Fisher, B., Ribarskey, W., & Chang, R. (2009). iPCA: An interactive system for PCA-based visual analytics. Computer Graphics Forum, 28(3), 767e774. https://doi.org/10.1111/j. 1467-8659.2009.01475.x. Jolliffe, I. T. (1986). Principal component analysi (1st ed.) https://doi.org/10.1007/978-1-4757-1904-8. Jouppi, N. P., Borchers, A., Boyle, R., Cantin, P., Chao, C., Clark, C., et al. (2017). In-datacenter performance analysis of a tensor processing unit. ACM SIGARCH Computer Architecture News, 45(2), 1e12. https://doi.org/10.1145/3140659.3080246. Jutten, C., & Herault, J. (1991). Blind separation of sources, part I: An adaptive algorithm based on neuromimetic architecture. Signal Processing, 24(1), 1e10. https://doi.org/10.1016/0165-1684(91) 90079-X. Keim, D. A. (2002). Information visualization and visual data mining. In IEEE transactions on visualization and computer graphics. https://doi.org/10.1109/2945.981847. Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43(1), 59e69. https://doi.org/10.1007/BF00337288. Liu, S., Maljovec, D., Wang, B., Bremer, P., & Pascucci, V. (2017). Visualizing high-dimensional data: Advances in the past decade. IEEE Transactions on Visualization and Computer Graphics, 23(3), 1249e1268. https://doi.org/10.1109/TVCG.2016.2640960. van der Maaten, L., & Kinton, G. (2008). Visualizing Data using t-SNE. Journal of Machine Learning Research, 9, 2579e2605. Manukyan, N., Eppstein, M. J., & Rizzo, D. M. (2012). Data-driven cluster reinforcement and visualization in sparsely-matched self-organizing maps. IEEE Transactions on Neural Networks and Learning Systems, 23(5), 846e852. https://doi.org/10.1109/TNNLS.2012.2190768. Merkl, D., & Rauber, A. (1997). Alternative ways for cluster visualization in self-organizing maps. Workshop on Self-Organizing Maps, 106e111. Myatt, G. J., & Johnson, W. P. (2011). Making sense of data III: A practical guide to designing interactive data visualizations. https://doi.org/10.1002/9781118121610. NIH. (2019). Federal interagency traumatic brain injury research (FITBIR) Informatics system. Retrieved June 1, 2019, from National Institute of Health website: https://fitbir.nih.gov/. Pampalk, E., Rauber, A., & Merkl, D. (2002). Using smoothed data histograms for cluster visualization in self-organizing maps. In Internation conference on artificial neural networks (ICANN) (pp. 871e876). https://doi.org/10.1007/3-540-46084-5_141. Plumbley, M. D. (2003). Algorithms for nonnegative independent component analysis. IEEE Transactions on Neural Networks, 14(3), 534e543. https://doi.org/10.1109/TNN.2003.810616. Roweis, S. T., & Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embedding. Computer Vision, 290(5500), 2323e2326. https://doi.org/10.1126/science.290.5500.2323.

Chapter 8  Data visualization

229

Saul, N. (2017). What is a mapper? Understanding of mapper from the ground up. Retrieved from https:// sauln.github.io/blog/mapper-intro/. Saul, N., & van Veen, H. J. (2017). MLWave/kepler-mapper, 186f https://doi.org/10.5281/zenodo.1054444. Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Deep inside convolutional networks: Visualising image classification models and saliency maps. In International conference on learning representations. Tang, J., Liu, J., Ming, Z., & Mei, Q. (2016). Visualizing large-scale and high-dimensional data. In International conference on world wide web (pp. 287e297). https://doi.org/10.1145/2872427. 2883041. Tapan, M. S. Z., & Siong, T. C. (2008). AC-ViSOM: Hybridising the modified adaptive coordinate (AC) and ViSOM for data visualization. International Symposium on Information Technology, 3, 1e8. https:// doi.org/10.1109/ITSIM.2008.4632006. Tasdemir, K. (2010). Graph based representations of density distribution and distances for selforganizing maps. IEEE Transactions on Neural Networks, 21(3), 520e526. https://doi.org/10.1109/ TNN.2010.2040200. Tasdemir, K., & Merenyi, E. (2009). Exploiting data topology in visualization and clustering of selforganizing maps. IEEE Transactions on Neural Networks, 20(4), 549e562. https://doi.org/10.1109/ TNN.2008.2005409. Tenenbaum, J. B., De Silva, V., & Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500), 2319e2323. https://doi.org/10.1126/science.290.5500. 2319. Ultsch, A. (1993). Self-organizing neural networks for visualisation and classification. Information and Classficiation, 307e313. https://doi.org/10.1007/978-3-642-50974-2_31. Ward, M. O., Grinstein, G., & Keim, D. (2015). Interactive data visualization: Foundations, techniques, and applications. Boca Raton, FL: CRC Press. Xu, R., & Wunsch, D. C. (2009). Clustering. IEEE Press/Wiley. https://doi.org/10.1002/9780470382776. Xu, L., Xu, Y., & Chow, T. W. S. (2010). PolSOM: A new method for multidimensional data visualization. Pattern Recognition, 43(4), 1668e1675. https://doi.org/10.1016/j.patcog.2009.09.025. Yin, H. (2002). ViSOM - a novel method for multivariate data projection and structure visualization. IEEE Transactions on Neural Networks, 13(1), 237e243. https://doi.org/10.1109/72.977314. Yosinski, J., Clune, J., Nguyen, A., Fuchs, T., & Lipson, H. (2015). Understanding neural networks through deep visualization. ICML Deep Learning Workshop. Zafonte, R. D., Bagiella, E., Ansel, B. M., Novack, T. A., Friedewald, W. T., Hesdorffer, D. C., et al. (2012). Effect of citicoline on functional and cognitive status among patients with traumatic brain injury. JAMA, 308(19), 1993e2000. https://doi.org/10.1001/jama.2012.13256. Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. Computer Vision, 8689, 818e833. https://doi.org/10.1007/978-3-319-10590-1_53. Zhu, Z., Heng, B. H., & Teow, K.-L. (2017). Interactive data visualization to understand data better: Case studies in healthcare system. International Journal of Knowledge Discovery in Bioinformatics, 4(2), 1e10. https://doi.org/10.4018/978-1-5225-1837-2.ch002.

9 Data analysis and machine learning tools in MATLAB and Python 9.1 Introduction In this chapter, we discuss machine learning (ML) and data processing tools and functions in MATLAB and Python. We review available functions and methods (particularly in MATLAB) to demonstrate biomedical data analysis techniques. We traverse the entire path of data analysis, from the initial loading of raw data to implementation of ML algorithms, and the final phase of knowledge inference. We describe the various phases of data processing and analysis. This includes loading, cleansing and imputation, preprocessing, dimensionality reduction, variables selection, clustering, classification, data visualization, and finally models’ evaluation and validation. This chapter is designed for beginner or intermediate programmers of MATLAB and/ or Python. We recommend that readers who have never used MATLAB or Python gain basic skills before reading this chapter. For MATLAB, the authors recommend the MATLAB documentation (MathWorks, 2012), or (Hanselman & Littlefield, 2011) for Python, the language documentation (Foundation, 2019) should suffice.

9.2 Importing data Data need to be loaded in convenient formats. The process of data loading and importing depends on the tools that are available in each programming platform and the data file types. We discuss the tools for each platform separately.

9.2.1

Reading data in MATLAB

MATLAB supports various types of industry standard formats and other custom formats, in addition to the native MAT-files. Some of the supported file formats are as follows: 1. Text formatsdincluding any white space delimited numbers, any text format, or a mixture of strings and numbers and XML-extended markup language using dlmread, textscan, and xmlread, respectively. 2. Spreadsheet formatsdwhich are all Excel worksheets including xlsx, xlsb, and xlsm. xlsread and xlswrite are used to read and write such files, respectively. 3. Scientific data format: (CDF) Common Data Formats using cdfread and cdfwrite for reading and writing such files. Computational Learning Approaches to Data Analytics in Biomedical Applications. https://doi.org/10.1016/B978-0-12-814482-4.00009-7 Copyright © 2020 Elsevier Inc. All rights reserved.

231

232

Computational Learning Approaches to Data Analytics in Biomedical Applications

4. Image formats such as windows bitmap (BMP), flexible image transport system (FITS) and FTS, graphics interchange format (GIF), hierarchical data format (HDF), icon image (ICO), joint photographic expert group (JPEG), including JPG, JP2, JPF, JPX, J2C, J2K, portable bitmap (PBM), portable network graphics (PNG), portable bitmap (PBM), paintbrush (PCX), portable graymap (PGM), tagged image file format (TIFF) includes TIF, portable any map (PNM), portable pixmap (PPM), Sun Raster (RAS), and X window dump (XWD). All these image formats are read and written using imread and imwrite, respectively. 5. MATLAB also supports audio and video formats. These formats will not be covered in this chapter. However, interested readers can use VideoReader, Videowriter, auread, and auwrite. The following sections discuss the commonly used data importing tools in MATLAB and their properties. The functions are grouped by the data format utilized in storing the imported data in the MATLAB workspace.

9.2.1.1 Interactive import function In an interactive setting, the simplest way to import data into the MATLAB workspace is by using the uiimport function from the GUI. This function is located on the Home tab of the main MATLAB menu (this is in MATLAB2012 and higher only) as shown in Fig. 9.1. The uiimport function provides users with the ability to save data as cell array, numerical matrix, vector variables, or other data formats. Furthermore, uiimport supports missing data handling in different approaches. This interactive import function reads all data types supported in MATLAB. The function can also be activated from the command line by simply typing uiimport.

9.2.1.2 Reading data as formatted tables The readtable function is used to read data from text files, data stores (see Section 9.2.3), and spreadsheets as formatted tables. This function reads column-oriented data from a file and creates a table from it. It reads delimited text files, such as those with extensions txt, dat, and csv, as well as spreadsheet files xls, xlsm, xlsx, xlsb, xltm, xltx, and ods. The readtable function creates a table object in the MATLAB environment as a structure that has variables corresponding to each column in the input data file. The user can access the variables’ names and the table’s properties by using the Tab key after the name of the table variable to activate the MATLAB autocomplete feature, as illustrated in Fig. 9.2.

FIG. 9.1 MATLAB GUI showing the interactive import location.

Chapter 9  Data analysis and machine learning tools in MATLAB and Python

233

File name & locaƟon Columns names in the table or variables names automaƟcally generated names because of the seƫng

Name-value input arguments

FIG. 9.2 Executing the readtable Function.

The table object in MATLAB offers many benefits such as the ability to apply a single function simultaneously on its variables (i.e., columns), changing the names of variables, easily adding new variables, and providing statistics about the variables in the table. The name value arguments can be used to format the table entries. Furthermore, tables are used frequently with clustering and classification tools, as shown in Sections 9.3 and 9.4. The following command is used to create a table (named T) from a space separated file and import the first column as a character array (i.e., string), the second as an unsigned integer uint32, the next two columns as double-precision floating point numbers, and the last one as a character. >> T= readtable(filename, 'Delimiter',' ','format', '%s%u%f%f%s')

The default delimiter value for the readtable function is comma-separated, and the default format is double-precision floating point numbers. However, in MATLAB 2017a and later versions, a new function (detectImportOptions) has been added. It detects the format of the input file automatically so that format can be used to define the table entries. As illustrated in the example below, assuming a file named ‘patients.xlsx’ (Bache & Lichman, 2013) exists, its columns’ (i.e., variables) format can be detected first and then that can be used to import their contents properly. >> opt=detectImportOptions('patients.xlsx'); >> T=readtable('patients.xlsx',opt);

More examples are available in the Al-Jabery, (2019). To read a specific sheet from an Excel file as a table in MATLAB, the sheet’s name should be specified into the readtable command as illustrated below: >> Sheet_table=readtable(xlsfilename,'Sheet','Sheet_name');

where Sheet_name is a character variable. The readtable function, unlike the xlsread function (see Section 9.2.1.4), needs the sheet’s name not its numeric index. Accessing elements in tables is done using the table name, followed by a dot, and then the variable name (T :Var1 as in Fig. 9.2). To access more than one variable in the

234

Computational Learning Approaches to Data Analytics in Biomedical Applications

table, curly braces are used. For example, to read the data from variables var1, var2, and var5 from table T (created in prior examples) and store them in an array var_array, either one of the following code lines can be used: >> var_array=T{:,{'var1', 'var2', 'var5'}}; % Note that these two instructions >> var_array=T{:,{[1,2,5]}; % are equivalent and provide the same output

The output of these commands is a data or cell array based on the data type of these variables. If the selected variables are numeric, the output will be a numeric array, but if they were a character array the output would be a cell array (i.e., an array that contains more than one type of variable). If the highlighted curly braces are replaced with parentheses, these two commands will return a table. In summary, loading data in a table format in the MATLAB workspace has the following advantages over storing them in other data formats: 1. Ability to read data from files and store them as one organized structure. 2. Easier access and manipulation of data variables in the table. 3. Ability to implement functions in parallel over multiple variables at the same time, for example, bsxfun. 4. Flexibility to update the table at runtime.

9.2.1.3 Reading data as cellular arrays Cellular data are heterogeneous data that are indexed numerically. Cellular arrays are used to store different types of data in the same array. (There is another type of data structure in MATLAB known as “Structure” that is similar to cellular arrays. The main difference is that its contents are indexed by the name of the field not their indices.) The advantage of using cell arrays is that it handles the nonnumeric data entries in datasets, which is very common in biomedical datasets. The primary function for reading data as a cellular array is the textscan function. This function is used after opening the text file and specifying its format. The function returns the contents of the file as a cell array, as shown in the textscan demonstration example (Fig. 9.3). Using a variable to store the file’s id (i.e., fid) is recommended so it can be reused to read the file. In newer versions of MATLAB (i.e., 2017a and later), the options can be specified automatically using the detectImportOptions function as described in Section 9.2.1.2. The output of the textscan function is a (1  d) cell array, where d is the number of dimensions in the data extracted from the file. The textscan function can also be used to read other text-based files such as comma-separated, Excel, etc. Other functions, such as spreadsheet readers, are used to extract data into cell arrays. For detailed code examples, see Al-Jabery, (2019).

Chapter 9  Data analysis and machine learning tools in MATLAB and Python

235

FIG. 9.3 Importing data into a cell array using textscan.

9.2.1.4 Reading data as numerical arrays and matrices The most common data type employed for reading data into MATLAB workspace is the numerical data array. This data type is obtained from using almost any data importing function. The following section reviews these functions, specifically xlsread, dlmread, csvread, and importdata. 9.2.1.4.1 xlsread function The xlsread function is used for reading data from Microsoft Excel spreadsheets (i.e., files with .xls, .xlsx) in MATLAB. This function reads data from the spreadsheet and returns a numeric array. It is also able to read non-numeric data if configured properly as shown in the example below. CodeEx.1: reading spreadsheets using xlsread. This example demonstrates different scenarios for reading spreadsheets using xlsread. Case 1: Reading the contents of the spreadsheet as a numerical array into MATLAB. >> num_array=xlsread(filename);

where “filename” is a character string variable e.g. ’C:\my documents\data.xlsx’. Case 2: Reading a specific worksheet from the spreadsheet file. >>num_array=xlsread(filename,sheetid);

where sheetid is either an integer representing the sheet number of the sheet name as a character variable. Case 3: Reading a specific range of data from a specific sheet. >> num_array=xlsread(filename,sheetid,Range);

where Range is a character array that specifies the range in the worksheet. For example, if the range variable is ‘B1:F8’ then the function reads the data in the rectangular range between B1 and F8 from the specified worksheet.

236

Computational Learning Approaches to Data Analytics in Biomedical Applications

Case 4: Reading three types of contents from the spreadsheet: numeric, text, and mixed. >> [num_array,textcell,misc_cell]=xlsread(filename);

In this case, the xlsread function reads the numeric contents into the num_array. The textcell is a cell array that contains all the text fields, and misc_cell is a cell array that contains both the numeric and text data. Case 5: Opening spreadsheet files interactively. >> ___=xlsread(filename,-1);

This command opens the specified spreadsheet to select data interactively. This is only available in a Windows environment. Case 6: Applying functions on the data in a spreadsheet file. >>[num_array,txtcell,misc_cell,fn_op]=xlsread(filename,sheetid,Range,'',fn);

where fn_op is the output returned from applying the function fn on the data in the specified worksheet. This operation is only available in a Windows environment as well. Case 7: Reading spreadsheet files on machines that do not have the Microsoft Excel application. >>num_array=xlsread(filename, 'basic');

9.2.1.4.2 Functions for reading in data from text files as numerical arrays/matrices Datasets are often available in the form of text files. MATLAB has several functions designed for reading data from text files. These functions are uiimport, readtable, textscan, dlmread, importdata, and csvread. In previous sections, uiimport, readtable, and textscan were discussed. This section explores dlmread, csvread, and importdata. dlmread function The dlmread function reads the data from ASCII-delimited numeric files into the MATLAB workspace as a numeric array. The user can specify, if desired, the type of delimiter, how many lines and columns to read from the file, and the bounded range of the data as illustrated in the following examples. These examples were applied on the Parkinson’s disease dataset downloaded from the UCI ML repository (Bache & Lichman, 2013). CodeEx.2: Reading text files using dlmread. Case 1: Load all the data in the file. >> loaded_data=dlmread(filename);

The dlmread function automatically detects the type of delimiter and treats double spaces as a single delimiter (Hanselman & Littlefield, 2011). This command works only if

Chapter 9  Data analysis and machine learning tools in MATLAB and Python

237

FIG. 9.4 Different uses of dlmread function.

the file contains only numerical characters. For text files that contain headings and nonnumeric content, the following examples are applicable. Case 2: Specifying start offset. >> loaded_data = dlmread(filename, ',', Ri,Ci);

where Ri and Ci are the offset row and column numbers, respectively, which specify the location that the dlmread function will start reading from. This feature allows users to skip headers in data files. Case 3: Reading a rectangular section of the file. >> loaded_data = dlmread(filename, ',',[Ri, Ci, Rend, Cend]);

In this command, the function only reads the data bounded by [Ri, Ci, Rend, Cend] as demonstrated in Fig. 9.4. The dlmread also imports complex numbers as a complex number field (Hanselman & Littlefield, 2011). See (Al-Jabery, 2019). csvread function This is specifically used for text data files that are comma-separated value (CSV). This function is a special case of the dlmread function. The csvread function provides all the options that dlmread provides, except there is no need to specify the delimiter. The csvread combines both the offset and the rectangular section in one command (i.e., Case 2 and Case 3) from dlmread. It also reads complex numbers the same way dlmread does. Fig. 9.5 illustrates the execution of the csvread function. importdata function This is a powerful MATLAB fucntion that generalizes beyond text files. It loads data from various types of files (MAT-files, text-based, images, and audio files) as a numeric array into the MATLAB workspace.

FIG. 9.5 The execution of different commands using csvread.

238

Computational Learning Approaches to Data Analytics in Biomedical Applications

FIG. 9.6 Importing images and displaying them in MATLAB using imread and image respectively.

The variouse syntaxes of the importdata function are illustrated in CodeEx.3. CodeEx.3: Using Importdata function in Matlab. >> M= >> M= >> M= ASCII

importdata(filename); % loads data from file into M importdata('-pastespecial'); % loads data from the clipboard into M importdata('__',delimiter,headerlines); % delimiter: is the column %separator in file, headerlines: number of header lines.

The importdata function determines the delimiter and/or the number of header lines if they are written as output arguments (MathWorks, 2012), as shown below. >> [M,delimiter,No_headerlines]=importdata(input arguments);

9.2.1.4.3 Reading images in MATLAB The imread function reads in image data files as numerical array into MATLAB workspace. Images in MATLAB are processed as a data matrix associated with a colormap matrix. In general, there are three types of image matrices: indexed, intensity, and true color (RGB) images. The imread function reads the image specified by the filename and automatically detects the image format from its contents. The function returns the image data as a numerical array, the associated colormap and the image transparency (the image transparency is only applicable for PNG, CUR and ICO image formats). Fig. 9.6 illustrates an example for reading and displaying an arbitrary MRI picture. The image function is used to display the corresponding image to the screen, as depicted in Fig. 9.6.

9.2.2

Reading data in Python

Python (Somers, 2008) is currently the most popular programming language in the field of data analysis, probably attributed to the free license and the vast amount of available libraries. The advantage of Python lies in the powerful tools (e.g., modules, data

Chapter 9  Data analysis and machine learning tools in MATLAB and Python

239

structures, libraries, and functions) it offers to programmers and data analysts. In this section, we provide an overview of the libraries that are integrated and can be added easily to Python, such as Pandas, SciPy, etc., as well as basic functions and libraries used for data reading operations. Jupyter Notebook was utilized as the Interface development environment (IDE) for all the examples presented in this chapter. Readers can also apply the provided code examples in other IDEs such as Enthought Canopy, Netbeans, or even any text editor such as Notepadþþ.

9.2.2.1 Overview of external libraries and modules for Python This section reviews and explores the most popular libraries that are integrated into the Python environment for data processing and ML in general. These libraries are: a) Numpy: The numerical Python Numpy module is the foundational package for scientific computing in Python. It represents the base for other important libraries such as ‘pandas’ and ‘Scikit-learn’. Numpy provides a powerful N-dimensional array object ndarray(), broadcasting functions, integration tools with C/Cþþ and FORTRAN, and miscellaneous abilities such as linear algebra, Fourier transform, and random number generators. Numpy provides convenient data structures to read and store arbitrary types of data which accelerates the process of integration with databases. It also shares many concepts with MATLAB, which gives a level of comfort for MATLAB programmers who use Python. Though Numpy is usually available with most Python installation packages, it has to be imported before its functions can be utilized. b) Matplotlib The Matplotlib library allows Python programmers to generate 2D or 3D plots of arrays. Matplotlib generates plots with very similar commands to those available in MATLAB, yet it is independent of MATLAB and used in the Python environment. This library was built upon the Numpy extension of Python. This package allows Python users to generate and represent their results in nicely organized figures, similar to MATLAB figures. c) SciPy Scientific Python module SciPy consists of collections of powerful functions and mathematical algorithms that were built on the Numpy extension of Python. It provides more power to the interactive Python session by adding high-level commands and classes for visualizing and modifying data. SciPy allows Python to compete with data analysis systems such as MATLAB, R-Lab., Octave, and SciLab. It also provides Python with the ability to benefit from applications written in other programming languages and to use databases subroutines. The SciPy library consists of many sub-packages, and each is dedicated to a specific branch of science. Some of these packages are: cluster contains various clustering algorithms, ‘io’ deals with inputs and outputs operations, ‘ndimage’ is for processing multi-dimensional images, ‘optimize’ covers optimization and root-

240

Computational Learning Approaches to Data Analytics in Biomedical Applications

finding procedures, ‘stats’ which includes statistical distribution and functions, ‘integrate’ covers integration and ordinary differential equation solvers and others. d) Pandas Pandas library https://pandas.pydata.org/ provides Python with efficient and powerful tools for data analysis by combining the high-performance array computing power from Numpy with the flexibility of data manipulation from databases and spreadsheets. Pandas consists of labeled array data structures, indexing objects enabling simple and multi-level indexing, integrated tools for datasets transformation, import and export data tools, efficient data storage tools, and moving window statistics. In some levels, the user can see that ‘pandas’ data structures are an emulation of those provided by MATLAB. This can be observed through the column-wise operation that can be implemented on data frames in ‘pandas’. Pandas provides spreadsheets and csv file readers. e) Scikit-Learn Scikit-learn is Python’s ML module that contains valuable tools for ML algorithms implementation. These packages enable non-specialists to implement the different ML algorithms without needing to dive deep into the basic coding of the algorithms. The tools developed in this library were designed upon Numpy, Pandas, and Matplotlib. In general, Scikit-learn consists of six major sections: classification, regression, clustering, dimensionality reduction, model selection, and data processing (Pedregosa et al., 2012).

9.2.2.2 Opening files in Python To open files in Python, the user utilizes the “os” interface library to communicate with the operating system regarding the location of the file. The ‘os’ package is essential in files for I/O operations. CodeEx.4 illustrates how to browse through the folders in Python. CodeEx.4: changing work directory and I/O file operations in Python.

Note that in the os.chdir() command the double backslash is mandatory for the folder path. Once the file location is identified, the file can then be opened using the built in command open(). The default mode for opening files is reading mode as indicated by the mode flag, which is set to ‘r’ as shown below.

Chapter 9  Data analysis and machine learning tools in MATLAB and Python

241

The opening mode of the file can be changed to either write ‘w’, read in binary ‘b’, exclusive creation ‘x’ for new files, append at the end of the file ‘a’, open in text mode ‘t’, or open for updating (reading or writing) ‘þ’. The example below illustrates opening the file in append mode.

It is very important to close the file, once done with it, using the close() function.

To avoid any errors during the code execution, it is recommended to use (try.except) with files operation. Note: the ‘sys’ module must be imported before using this statement. CodeEx.5: The use of try.except statement with opening files in Python.

Another useful approach is to use Python’s file handling capabilities via the with statement, so that the opened file will be closed automatically after finishing the list of commands.

242

Computational Learning Approaches to Data Analytics in Biomedical Applications

9.2.2.2.1 Reading text files in Python The section explores the process of reading data from different types of files in Python. As shown in the previous section, it is recommended to open files using error handling (try.except) and with statements. For reading the files, there are two functions: read() and readline(). The read() function reads the entire file if no numeric input is specified, while the readline() function reads the file line by line. To demonstrate the use of these functions, we employ a simple three line text file ‘test.txt’ shown in Fig. 9.7. CodeEx.6: demonstrates the different types of reading operations.

Other functions for modifying, writing, and editing text files in Python include write(), writeline(), and seek(), respectively. For more coding examples on the use of these functions, visit the (Al-Jabery, 2019). These functions are ideal for small data files, but, for large amounts of data, using these functions will be tedious and require a tremendous amount of parsing operations to

FIG. 9.7 The contents of a simple text file used in the Python example.

Chapter 9  Data analysis and machine learning tools in MATLAB and Python

243

extract the data. Therefore, Python allows the use of the JavaScript Object Notation (JSON) (Ong et al., 2013) module to perform the different parsing operations. For more information about the JSON module, the reader can visit Python documentation (Somers, 2008). Beside JSON, there is the regular expression module that provides many facilities for extracting information from datasets. The regular expression module is imported using the following command in the Python environment: >>> import re

9.2.2.2.2 read_csv() function This utilizes the Pandas library to read data from csv files. It considers data that is comma-separated by default. This function can also be modified to read data with other delimiters such as semicolons, spaces, or any other delimiters. The code example below demonstrates the reading of activity recognition data from healthy older people using a battery-less wearable sensor (Shinmoto Torres, Ranasinghe, Shi, & Sample, 2013). This dataset is in csv format. CodeEx.7: Reading csv files in Python.

An interesting and useful feature in the imported data is that each column of the data is stored as a data frame object. This allows the user to plot and manipulate the data in a column-wise manner. Fig. 9.8 illustrates the visualization feature in data frames. 9.2.2.2.3 Other read functions Pandas has some additional read data functions. The read_table() function (McKinney & Team, 2015; Pandas, 2015) is an efficient method for parsing data from text files into a data frame. It does have the same input and output parameters as read_csv(). Code examples are available in (Al-Jabery, 2019). The read_excel() function can be applied to read in Microsoft Excel spreadsheets (besides the xlrd API). This function reads spreadsheets as data frames or a dictionary of data frames similar to the read_csv() function. The read_excel() function has many options that can specify the sheet name, number of header lines to be ignored, rows to

244

Computational Learning Approaches to Data Analytics in Biomedical Applications

FIG. 9.8 Example of plotting data in Python.

be skipped, footer rows, and other parameters. The user can choose to provide the function list of the columns’ names to be used or a dictionary of functions to be applied to the columns. For more information, see the Pandas read_excel() documentation. CodeEx.8 illustrates a basic dataset (Bache & Lichman, 2013) reading operation using the read_excel() function. CodeEx.8: reading Excel files in Python.

9.2.3

Handling big data in MATLAB

The MATLAB functions discussed thus far in this chapter were all designed to load data into the memory directly. These functions cannot handle the rapid increase in the datasets’ sizes. Big data is becoming ubiquitous for biomedical applications, as new

Chapter 9  Data analysis and machine learning tools in MATLAB and Python

245

technologies in medical imaging, colon scoping, CT scanning and even health records generate massive amounts of data (Margolis et al., 2014). A large dataset can be defined as a dataset that exceeds the size of the memory available or takes a very long time to be processed (De Mauro, Greco, & Grimaldi, 2015). Data store is a MATLAB repository for storing, managing and distributing very large datasets (Inmon, Zachman, & Geiger, 1997). MATLAB provides datastore and tall functions to generate data stores and tall arrays. These data structures make the data available for users. We explore functions that create and manipulate big data in MATLAB in this section.

9.2.3.1 How to create data stores in MATLAB The datastore function creates a repository for data collections that are larger than the available memory. It allows MATLAB to read and process data files that are stored on auxiliary memory. All the files in a data store must be the same type. The data store can contain one or more files. The general form of the command to create a data store repository is: >> ds=datastore(source,'type',value);

where source is either a single file name or a directory that contains more than one file. If the source is a single file, there is no need to specify the type of data. The variable ‘type’ specifies the data type that will be stored in the data repository. The different types of data supported by the data store are shown in Table 9.1. The ‘type’ property in the datastore function is mandatory unless the specified folder consists of only one type of file or the data store is only one file. Fig. 9.9 illustrates the effect of using different values for the ‘type’ property with the datastore function when it is applied to the same directory. The datasets used in this example (Cole & Fanty, n.d.; Janosi, Steinbrunn, Pfisterer, & Detrano, 1988; Tsanas, Little, McSharry, & Ramig, 2010), were obtained from the UCI data repository (Bache & Lichman, 2013). The examples illustrate the importance of specifying the data type of the data store so that it is loaded correctly with the appropriate data files. Table 9.1 Value of ‘Type’ ’tabulartext’ ’image’ ’spreadsheet’ ’keyvalue’ ’file’ ’tall’ ’database’

Different values for the (‘type’) property in the datastore function. Description Text files containing tabular data. The encoding of the data must be ASCII or UTF-8. Image files in a format such as JPEG or PNG. Acceptable files include imformats formats. Spreadsheet files containing one or more sheets. Key-value pair data contained in MAT-files or Sequence files with data generated by MapReduce. Custom format files, which require a specified read function to read the data. MAT-files or Sequence files produced by the write function of the tall data type. Data stored in the database. Requires Database ToolboxÔ . Requires specification of additional input argument when using the type parameter.

246

Computational Learning Approaches to Data Analytics in Biomedical Applications

FIG. 9.9 The effect of using different values for the type property with the datastore function and the general structure of the data stores in MATLAB.

To load data from the data store into the MATLAB workspace, there are several tools that can be used. Data stores’ structures have several properties such as hasdata, Readsize, and others. These properties are useful in determining the proper way of reading the contents of data stores. The data store properties can be seen through the

Chapter 9  Data analysis and machine learning tools in MATLAB and Python

247

FIG. 9.10 Accessing the list of properties of the data store ds that was created in the example in Fig. 9.9.

autocomplete facility in MATLAB 2016a and higher, as illustrated in Fig. 9.10. Through these properties, the process of reading from the data stores can be configured in the most compatible form with the remaining sections of the code. For example, the user can set the size of the data block to be read at a time, check whether all the data has been read or not and other features as discussed in (Al-Jabery, 2019). MATLAB data stores will be revisited in Section 9.3 to discuss their missing data handling properties. There are several functions for data reading and data management in MATLAB data stores. These functions are: read, readall, preview, partition, numpartitions, hasdata, and reset. These functions are discussed in the following section as they are very useful in utilizing the datastores in MATLAB (Fig. 9.10). 9.2.3.1.1 read function The read function reads a specific block of data from MATLAB data stores. The size of the block is specified using the ReadSize property, as shown in Fig. 9.11. If a subsequent call

FIG. 9.11 The ReadSize property of the MATLAB data store specifies the number of lines that the read function reads.

248

Computational Learning Approaches to Data Analytics in Biomedical Applications

for the read function is executed, then it reads the data from the end of the previous block, iteratively, until there is no remaining data (i.e., the hasdata flag becomes false). 9.2.3.1.2 readall function Unlike the read function, the readall function reads the entire data from the data store and loads it into the memory (similar to the functions discussed in Section 9.2.1). This function is not recommended for reading large data files (note that the term large is specified by the available memory on the active machine). The general syntax of the readall function is: >> data = readall(ds);

9.2.3.1.3 hasdata function The hasdata function determines whether data is still available for reading or not in the data store. It returns a true or false output. This function clearly takes its value from the hasdata property of the datastore object, which was mentioned earlier. This function is integrated with the read function to read the contents of a data store sequentially and without overloading the memory. The general syntax of this function is shown in the example illustrated in Fig. 9.12. The figure also shows the final results of executing the read function. 9.2.3.1.4 partition function The partition function divides data stores into sub-data stores based on a prespecified numeric value or by using the number of files included into the data store itself. The general syntax of this function is: >> sub_ds = partition(ds, p_mod, id);

where sub_ds is the resulting data store, ds is the original data store, p_mod is a variable that specifies the partitioning type, and id is an identification variable that specifies the sub-partition that will be retrieved by the function (id can be an integer or a file name).

FIG. 9.12 The syntax and implementation of the hasdata and the read functions.

Chapter 9  Data analysis and machine learning tools in MATLAB and Python

249

If p_mod is numeric, then the data store will be divided into p_mod number of sub-data stores. However, if p_mod is set as ‘Files’(as shown below), then the data store will be divided according to the files it contains (i.e., the data store will be divided into sub-data stores that each contain only one file). >> sub_ds = partition(ds, ‘files’, id);

The partition function is important when using parallel programming techniques, which reduces the MATLAB execution time. The parallel pool MATLAB toolbox allows some actions to be performed on specific partitions of the data stores, accelerates the search process by assigning a processing unit (i.e., worker) for each partition, and reduces the calculation time. For code examples about the partition function, visit the code repository. 9.2.3.1.5 numpartitions function The numpartitions function can perform one of two actions: it either determines how many partitions a datastore has, or it specifies the number of partitions the datastore is partitioned to, based on the pool of parallel workers available. The syntax of the numpartitions function is: >> Pn=numpartitions(ds, no_workers);

Where Pn is an integer output that specifies the data store’s number of partitions. The datastore ds will be divided using the available parallel workers (i.e., processors) specified by the no_workers variable. Note that the input argument no_workers is optional. If not specified, the function numpartitions will return the default number of partitions available in ds, as illustrated in the example in Fig. 9.13. After reading or partitioning a datastore, its original form and settings can be restored by using the reset function. Its syntax is shown below: >> reset(ds);

FIG. 9.13 The execution of the numpartitions function on the datastore ds created in Fig. 9.9.

250

Computational Learning Approaches to Data Analytics in Biomedical Applications

This function resets the datastore ds to the state it was in before any data was read from it. This is very important to remember after performing any operation on the data stores in MATLAB, if it is desired for the read operation to return to start at the first block of the data.

9.2.3.2 Tall arrays Any array that has more rows than the available memory is called a Tall array. Tall arrays are used to work with out-of-memory data that is backed by a data store. MATLAB datastores enable the user to work with large datasets in small chunks that individually fit in the memory instead of loading the entire dataset into the memory at once. Tall arrays extend this capability so the user can work with out-of-memory data using common functions. Tall arrays can be of any type numeric, cellular, categorical, strings, datetimes, durations, or calendar durations. MATLAB functions work with tall arrays in the same way they work with in-memory arrays, with the exception that tall arrays remain unevaluated until requested via the gather function. This deferred evaluation allows users to work faster with large datasets because MATLAB combines the queued operations when possible to achieve the minimum number of passes through the data. To create a tall array, use the tall function. The execution of this function depends on its input. The function takes two types of inputs: data stores and arrays. In the case of data store, if the input was tabular data then the function will return a tall table. Otherwise the output will be a tall cell array. The function converts an in-memory array into a tall array. >> TA=tall(ds); % input is a datastore >> TA=tall(A); % input is an in-memory array

As mentioned earlier, to perform the queued operations on tall arrays use the gather function. This function takes an unevaluated tall array as its input and returns the output as an in-memory array. The function takes single and multiple tall arrays as its input, as shown in the example below. >>Y=gather(TA); % single tall array >>[Y1,Y2,Y3,..,Yn]=gather(TA1,TA2,TA3,..,TAn);% multiple tall arrays

TAi ð1  i  n Þ represents operations that have been queued on tall arrays. However, if it is unclear whether the output of gather fits into the memory or not, it is recommended to use gather(head(X)) or gather(tail(X)). These two functions perform the full calculations but bring only a small portion of the result into the memory.

9.3 Data preprocessing To accelerate implementation and enhance performance of ML algorithms, usually a set of data preprocessing techniques have to be conducted first (see Chapter 2). This section

Chapter 9  Data analysis and machine learning tools in MATLAB and Python

251

illustrates the functions that are available in MATLAB for data prepreprocessing. For further information, see Julia Evans (n.d.), McKinney and Team (2015), Open-Source (2016), Shanila (2018).

9.3.1

Missing values handling

One of the most important issues in data processing is how to handle missing values in datasets. Luckily, programming platforms and Python provide multiple methods for handling missing values. There are several functions in MATLAB for data processing. Some functions are dedicated to handling missing values, while others have parameters that perform data cleansing during the reading operation. The data cleansing operation during data importing is performed either through the GUI, as in the uiitool, or through the property-value input arguments, as in MATLAB datastores. In Python, the Numpy provides several functions for handling the missing values that will be discussed briefly in the following sections.

9.3.1.1 Handling missing values during reading This process is performed in MATLAB when using the uiitool, datastore, readtable, xlsread, and most of the reading functions available in MATLAB. The following examples show how to process missing data in readtable() and datastore, assuming the user is working on a file named ‘data.txt’. >> X=readtable('data.txt','TreatAsEmpty',{'N/A','-'});

This command instructs MATLAB to define any data entry that is either ’N/A’ or ’-’ as a missing value which is referred to NAN in MATLAB. >>ds=tabularTextDatastore('data.txt','TreatAsMissing','NA','MissingValue',0);

This command illustrates how to preprocess the data in data stores. First, it treats all undefined entries as missing values. Then, it replaces all missing values with a zero. Furthermore, some MATLAB functions can overcome the existence of missing values during execution. This is achieved by setting the ‘omitnan’ flag to true. This is necessary when applying functions on data that may contain missing values. All statistical calculation functions, such as sum, mean, std, . etc., have this flexibility. Fig. 9.14 illustrates an example using nanflag.

9.3.1.2 Finding and replacing missing values MATLAB also provides several functions for processing missing values in data. These functions are ismissing to find the missing value indices in data, fillmissing to replace

252

Computational Learning Approaches to Data Analytics in Biomedical Applications

FIG. 9.14 The effect of the nanflag on functions.

missing values with a specified value, isnan to find NA values in data and rmmissing which removes missing values from data. There are functions to create missing values such as missing and standardizedMissing. The following example illustrates simple code lines for removing missing values using MATLAB (Sharma & Martin, 2009). CodeEx.9: Finding and replacing missing values in MATLAB. >> >> >> >> >>

idx= find(isnan(X)); % finds missing values in data vector X idx= find(~isnan(X)); % finds non-missing values in data vector X X=X(idx); % keeps only the non-missing values X(isnan(X))=[]; % removes missing values from vectors Y(any(isnan(Y)'),:)=[];% removes any row from matrix Y that has missing value(s)

However, that does not mean these are the only available options for handling missing values. Several customized functions were developed to implement a specific imputation process. (See Chapter 2 for more details on data cleansing and data imputation.) These functions are available in Al-Jabery (2019). Alternately, Python has several data processing functions too. These functions were built under the Numpy library. CodeEx.10 shows the processes of finding, replacing, imputing, and removing missing values in the Python environment. The dataset (Tsanas et al., 2010) used here is publicly available at the UCI ML repository (Bache & Lichman, 2013).

Chapter 9  Data analysis and machine learning tools in MATLAB and Python

253

CodeEx.10: Data preprocessing in Python.

After removing the missing values:

Replacing all zeros with NaN:

However, it is useful to convert the imported data into an array (especially for those who also work with MATLAB). This can be achieved by using Numpy’s ndarray function, as illustrated in CodeEx. 11. CodeEx.11 below shows how to import ‘numpy’ and how to create a multidimensional array (i.e., ndarray).

254

Computational Learning Approaches to Data Analytics in Biomedical Applications

To impute the original missing values with the mean imputation method, the .fillna() attribute is needed (see Chapter 2 for details): CodeEx.12: Data imputation and replacement in Python.

As shown in the previous coding examples, Python keeps track of the original indices of the dataset without the need for additional coding. To determine the existence of missing values in datasets, the isnan() method is required, as shown in CodeEx.13. CodeEx.13: Finding missing values using Numpy library.

Furthermore, scikit-learn (Pedregosa et al., 2012) provides an imputation embedded class imputer(). This replaces the missing values in data with the mean of the columns they were found in, as shown in CodeEx.14. CodeEx.14: Data imputation using sklearn.imputer function.

After applying the imputation process to the dataset, all the nan values have been erased from the dataset.

Chapter 9  Data analysis and machine learning tools in MATLAB and Python

9.3.2

255

Normalization

Data normalization is essential for implementing ML algorithms and improving their results as discussed in Chapter 2. This section presents the functions that perform normalization in MATLAB. The main function for data normalization in MATLAB is the normalize function which was released in MATLAB R2018a. This function provides the ability to use all types of data normalization tools. The general form of the normalize function is shown below: >>norm_data=normalize(input_data);

By default, this function returns the vector wise z-score for all data points. Fig. 9.15 (below) shows the execution of the normalize function and the different results based on the method used. However, as this function is only available in the latest MATLAB versions, this section discussed the basic function for normalizing data that is used in earlier versions of MATLAB and gave examples on that in Al-Jabery (2019). These functions are integrated as methods in the normalize function.

9.3.2.1 z-score The zscore function returns the z-score (Sopka, 1979) of the data. In other words, it converts the data vectors so they have a mean of zero and a unity standard deviation. Its general format is >> normalized_data=zscore(data)

FIG. 9.15 Demonstration of using the normalize function in MATLAB R2018a.

256

Computational Learning Approaches to Data Analytics in Biomedical Applications

This book’s authors have developed a general normalization function in an earlier version of MATLAB, which is listed in codex.15. CodeEx.15: Normalization in MATLAB. function [norm_data]=normalize_dset(raw_data) % this function normalized dataset based on min-max [n,ftrs]=size(raw_data); norm_data=zeros(n,ftrs); for i=1:ftrs tmp=raw_data(:,i); if sum(tmp)>0 ftrmin=min(tmp); ftrmax=max(tmp); norm_data(:,i)=(tmp-ftrmin)/(ftrmax-ftrmin); else disp(['feature: (',num2str(i), ') is an empty feature']); end end end

In Python, scikit-learn library provides the ‘preprocessing’ module which contains two functions to perform data normalization and standardization using ‘normalize()’ and ‘scale()’, respectively. The ‘normalize()’ function rescales the real valued variables into the range [0,1]. On the other hand, standardization means shifting the distribution of each variable to have a mean of zero and a unity standard deviation. CodeEx.16 illustrates the implementation of these two functions. CodeEx.16: Preprocessing data in Python using Normalize and Scale functions.

9.3.3

Outliers detection

Outliers have already been defined and discussed in Chapter 2. This section discusses the tools available for detecting and processing outliers. The detection and handling of outliers is vital because of their effect on data analysis and especially clustering. MATLAB provides the isoutlier and filloutliers functions for finding and replacing outliers, respectively. These are default functions in MATLAB. The process of detecting outliers is demonstrated in CodeEx.17 below using a customized function that detects the outliers and returns them along with their indices.

Chapter 9  Data analysis and machine learning tools in MATLAB and Python

257

CodeEx.17: An embedded function for outliers’ detection in MATLAB. function [outliers, oinds] =find_ols(datain,th) %% this function was designed to find outliers % datain: Labels input data array % th: is the length of cluster to be considerred outliers %outliers: data samples that have been detected as outliers % oinds: outliers indices cls=unique(datain);% determine number of clusters o_ids=[]; oinds=[]; for i=1:length(cls) inds=find(datain==cls(i)); if length(inds)> lbls=kmeans(data,k)

where lbls is an (n x 1) column vector of labels that indicates the cluster index for each sample in the (n x m) dataset data, and k is the number of clusters that the kmeans algorithm will separate the data into. There are several options in the kmeans function’s input parameters in MATLAB. The most important ones are the distance metric and the replicates. The distance metric can take any of the following values: ‘sqeuclidean’ (squared Euclidean distance), ‘cityblock’ (sum of absolute differences), ‘cosine’ (treats points as vectors), ‘correlation’ (treats points as sequence of values), and ‘hamming’ (which is only for binary data points). The replicates option allows the kmeans function to repeat the clustering process using newly initialized cluster centroids. Other parameters, such as the maximum number of iterations ‘MaxIter’, are also available for this function. The following code example demonstrates the effect of using different options (as shown in Fig. 9.16) on the resulting clusters from using kmeans on a plant’s pollutants dataset (Bagheri, Al-jabery, Wunsch, & Burken, 2019).

Chapter 9  Data analysis and machine learning tools in MATLAB and Python

259

CodeEx.19: Clustering using kmeans with multiple options in MATLAB. %% example read dataset and cluster them using kmeans %1- using default settings cl_idx=kmeans(normalized_data,2); gscatter(pcad(:,1),pcad(:,2),cl_idx,'rb','++') %2- using tuned options figure; cl_idx=kmeans(normalized_data,2,'Distance','correlation','Replicates',5); gscatter(pcad(:,1),pcad(:,2),cl_idx,'rb','++') 5

5

1 2

4

4

3

3

2

2

1

1

0

0

–1

–1

–2

–2

–3 –4

–2

0

2

4

Output 1: Default Settings

6

8

–3 –4

1 2

–2

0

2

4

6

8

Output 2: Tuned Parameters

FIG. 9.16 The effect of changing kmeans input parameters. Output 1: Default Settings. Output 2: Tuned Parameters.

Furthermore, kmeans can also find the centroids’ positions, the sum of the points-tocentroid distances, and the distances from each point to every centroid. In MATLAB R2016 and later, the kmedoids function becomes available which has the same input and output parameters as kmeans, but it performs the k-medoids algorithm instead of kmeans. See Chapter 4 for a discussion on the k-medoids algorithm. In Python, there is a class function called sklearn.cluster.KMeans. This function performs the k-means clustering on the data. CodeEx.20 illustrates the k-means implementation in Python (Fig. 9.17). CodeEx.20: Clustering using kmeans in Python. The output of this code is illustrated in Fig. 9.17.

260

Computational Learning Approaches to Data Analytics in Biomedical Applications

15 10 5 0 –5 –10

–200

–100

0

100

200

300

FIG. 9.17 Visual representation of the clusters resulting from applying the k-means algorithm to the Parkinson’s (Tsanas et al., 2010) dataset in Python (output of CodeEx.20).

9.4.1.2 Gaussian mixture model Another clustering approach is the Gaussian mixture model (GMM), which fits several n-dimensional normal distributions to the data (Fig. 9.18). To perform GMM clustering, the GMM model must be created first using the fitgmdist function as shown in the following command: >> gmm=fitgmdist(data,k); % k: is the number of distribution

Then, the data should be clustered and labels should be generated for each point using the cluster function as shown in the following command: >> lbls=cluster(gmm,data);

CodeEx.21: The applications of the GMM on the Parkinson’s dataset (Tsanas et al., 2010). tbl=readtable('parkinsons_updrs.csv');% read dataset data=tbl{:,4:end}; norm_data=normalize_dset(data);% normalize dataset gmm=fitgmdist(norm_data,2,'RegularizationValue',0.1);% generate the gm model g=cluster(gmm,norm_data);% generate labels [exp,pcad]=pca(norm_data);% generate pca gscatter(pcad(:,1),pcad(:,2),g,'rb','++')% visualize data

The regularization value shown in the example is necessary to avoid the illconditioned covariance situation in data.

Chapter 9  Data analysis and machine learning tools in MATLAB and Python

261

FIG. 9.18 n-dimensional normal distributions for the data.

Python performs the GMM analysis using the mixture class from scikit-learn library. In this class, there is the GaussianMixture function, which is quite similar to the fitgmdist function described previously. The GaussianMixture function estimates the parameters of a Gaussian mixture distribution. The function takes several input parameters, as illustrated in Table 9.2. CodeEx.22: The implementation of GaussianMixture in Python. import numpy as np from sklearn import mixture import pandas as pd data=pd.read_excel('parkinsons_updrs.csv') gmm = mixture.GaussianMixture(n_components=2,covariance_type='full') gm_op=gmm.fit(data)

Table 9.2

Important input parameters for GaussianMixture class function.

Parameters

Syntax

Type

Values

Number of mixture components Covariance type Covariance threshold Maximum number of iterations Number of initialization Weights initialization method

‘n_components’ ‘covariance_type’ ‘tol’ ‘max_iter’ ‘n_init’ ‘init_params’

integer string float integer integer string

default 1 ‘full’, ‘tied’, ‘diag’, or ‘spherical’ default 0.001 default 100 default 1 ‘kmeans’, or ‘random’

262

Computational Learning Approaches to Data Analytics in Biomedical Applications

9.4.1.3 Hierarchical clustering There are three steps to generate clusters hierarchically in MATLAB: 1. Calculate the pairwise distances between all pairs of data samples using the pdist function. >> Y = pdist(X,'cosine');

2. Determine the hierarchical structure of the data by using the linkage function. >> Z = linkage(Y,'ward');

3. Specify the cut level and generate labels using cluster. >> lbls = cluster(Z,'maxclust',int);

Steps 1 and 2 can be combined in one line: >> Z= linkage(X,'ward','cosine');

CodeEx.23 represents a function that summarizes the hierarchical clustering process in MATLAB, and Figs. 9.19 and 9.20 show the dendrograms for a neuroimaging dataset (Obafemi-Ajayi et al., 2017) using Complete and Ward linkage, respectively. CodeEx.23: Implementation of Hierarchical clustering in MATLAB. function [labels]=find_HIC(inpdata,par,k) % Hierarchical clustering % inpdata: is the full dataset (n x m) % par: type of linkage % k the cut level % labels: is a (k x n) vector contains the labels of each data point Y=pdist(inpdata,'euclid'); % find pairwise distance z=linkage(Y,par); figure; dendrogram(z,79) % show dendrogram if strcmp(par,'ward') ylim([0 10]); else ylim([0 0.4]); end title(['Hierarchical clustering for the data before removing any outliers using ',par,... 'linkage']); c=k; % k=3 in this example T = cluster(z,'maxclust',c); labels=T; end

Chapter 9  Data analysis and machine learning tools in MATLAB and Python

FIG. 9.19 Dendrogram of a neuroimaging dataset using Complete linkage.

FIG. 9.20 Dendrogram of a neuroimaging dataset using Ward linkage.

263

264

Computational Learning Approaches to Data Analytics in Biomedical Applications

Scikit-learn provides Python users with all types of clustering tools, including hierarchical clustering object, which is known as AgglomerativeClustering(). CodeEx.24 illustrates the process of importing and using hierarchical clustering in Python.

AgglomerativeClustering() object supports three types of linkage: complete, ward, and average. The differences between these linkages were discussed in Chapter 3. The required input parameters are number of clusters and linkage, as shown in codex.x. There are several optional parameters, such as (1) affinity, which specifies the metric used for computing the linkage, the default value for this option is ‘euclidian’; (2) connectivity, which specifies the connectivity constraints and (3) Compute_full_tree, which specifies the level of the tree that clustering should stop at.

9.4.1.4 Self-organizing map In addition to the material in this section, the free book (Kohonen, 2014) is essential reading for those interested in using SOFM with MATLAB. A SOM is a competitive NN that is designed to visualize high-dimensional data structures easily. See Chapters 3 and 9 for details on SOM. SOMs are basically clustering and dimensionality reduction tools. In MATLAB, the SOM function can be accessed from the neural network clustering app on the app tab in the MATLAB main menu or by typing nctool in the command window which will open the same app. Then, the user should interactively load the data, transpose it so that each observation is a column, and then train the network. This section shows how to construct and visualize SOM using commands only. The MATLAB function below shows a typical procedure for creating, training, and visualizing a SOM. CodeEx.25: SOM implementation in MATLAB. function somnet=create_SOM(TR_data,sz, No_epochs) %% this function creates, train and plot SOM % inputs:% TR_data training data, sz =[m n] SOM dimensions, % No_epochs:number of times to repeat SOM training % outputs:- somnet is the self-organizing map NN ready to be used on new data net = selforgmap(sz); net.epochs = No_epochs; X = TR_data'; net = train(net,X); plotsomhits(net,X) somnet=net; end

Chapter 9  Data analysis and machine learning tools in MATLAB and Python

265

The neighbor weight distances and the weight planes for each input can also be visualized by using the functions plotsomnd and plotsomplanes, respectively. These functions are similar to the plotsomhits function (used in the create_SOM demonstration function illustrated previously). They all take an NN object and transposed data as their inputs. Now, the resulting net from the previous function can be used to predict new data. This is performed using the following command: >> prds = netobj(Xnew); % where netobj is a SOM NN

The output, prds, has m*n rows for a map with an m-by-n arrangement of neurons (i.e., 100 rows if SOM has 10 x 10 arrangements) and has columns equal to the number of observations in Xnew. Consequently, every element of each column is 0, except one, which has the value of 1. Furthermore, the neuron number for each observation can be identified using the vec2ind function, as shown in Fig. 9.21. In Python, there are several packages that provide SOM implementation. Python multivariate pattern analysis (PyMVPA) is the most popular package that provides a SOM object (Hanke et al., 2009). After installing PyMVPA, the following codex can be executed to demonstrate an example of implementing SOM algorithm in Python. This package provides the SimpleSOMMapper() function which takes the dimensions of the SOM, number of epochs, and learning rate as input parameters. Then the SOM object is created with attributes similar to those of the net object in Matlab example discussed previously. CodeEx.26 shows the process of creating an SOM object and training it. CodeEx.26: SOM implementation example in Python. import matplotlib as pl from mvpa2.suite import * som = SimpleSOMMapper((20, 30), 400, learning_rate=0.05) som.train(Datatrain) pl.imshow(som.K, origin='lower')# to show the SOM mapped = som(Datatrain)

FIG. 9.21 SOM predictions and neurons identifications.

266

Computational Learning Approaches to Data Analytics in Biomedical Applications

9.4.2

Prediction and classification

Supervised learning problems have a definite specific goal in mind, typically predicting the output of an unknown input pattern based on the experience acquired from labeled input patterns. This requires multiple iterations of building and evaluating different models through choosing the module to use and then training with the data. Once the module is trained, a performance evaluation needs to be conducted. This is typically done by evaluating how accurate the module was in predicting the training data. Then, the module can be updated either by changing it completely and trying a new module or by making adjustments to the current module’s parameters. (See Chapter 4 for an advanced discussion.) The module becomes ready to predict new data when a satisfactory level of performance is achieved. The first part of this section focuses on the tools required to perform the overall process of supervised learning, while the second part discusses the different functions that perform classification algorithms. This section will focus mainly on MATLAB functions, but it will open the doors and provide tips on the equivalent tools in Python. For code examples in Python and MATLAB, see Al-Jabery (2019).

9.4.2.1 Machine learning workflow This part discusses the tools and functions required to implement the high-level computational approaches on any dataset. 9.4.2.1.1 Data Preparation The data preparation process here has a different meaning than the one discussed in Chapter 2. In this section, data preparation means the process of partitioning the data into training, validation, and testing subsets. This process is necessary to enable the model to predict reasonably new observations. The main function used for data preparation is cvpartition. This function allows the dataset to be divided into a training and testing subset in such a way that the training and testing data points hold an accurate representation of the complete dataset. The following is the general form of the cvpartition function: >> c = cvpartition(y,'HoldOut',p);

where c is a structure variable that contains information about the partition, y is a categorical vector that contains the classes of each observation, ‘HoldOut’: is an optional property indicating that the partition should divide the observations into a training and a test set, and p is a fraction of the data held out of the complete data to form the test set (between 0 and 1). The following function is presented to divide a dataset into 70% for training and 30% for testing so that the response classes are represented in approximately the same proportion in each partition.

Chapter 9  Data analysis and machine learning tools in MATLAB and Python

267

CodeEx.27: Data partitioning tool in MATLAB. function [traininginds,testinds]=prepare_data(response,p) %% this function returns the indices for the training and test datasets % response: is categorical vector contains classes of observations, e.g. %response=datatable.class % p: the percentage of test set in decimal format e.g. %35 = 0.35 cvpm=cvpartition(response,'HoldOut',p); traininginds=training(cvpm); testinds=test(cvpm); end

The previous function produces the indices for the training and test samples, and then the complete subsets can be determined from those generated indices. Note that it is recommended that the data be stored as a table object. However, when applying the function to matrices or any nontabulated data object, the response input for prepare_data will simply be an integer representing the number of samples in the dataset (i.e., n). The cvpartition function is also useful in partitioning the datasets into folds for cross validation, as will be discussed later in the cross-validation section. In Python, the scikit-learn library provides us with all the necessary functions and tools for data processing and ML as discussed earlier. The cross_validation and model_selection packages both have the function train_test_split, which is an equivalent function to the cvpartition discussed previously. To import this function into the Python environment, use the following line: from sklearn.cross_validation import train_test_split #or from sklearn.model_selection import train_test_split

This function is widely used with fitting and cross-validation applications. It takes four input parameters: the predictors, the labels as a single column, and partitioning rate. The following code segment illustrates the use of train_test_split() function. >>[X_train, X_test, y_train, y_test] = train_test_split(X , y, test_size=0.33, random_state=42)

More code examples are developed for this chapter and available in the code repository. Also, readers are advised to visit scikit-learn documentation for intensive literature and demonstration examples. 9.4.2.1.2 Fitting and predicting tools MATLAB provides several fitting functions that take the form fitc* (where * identifies the type of algorithm, such as discr, knn, nb, .etc.) for implementing the learning algorithms, and the predicting operation is performed in MATLAB using the predict function. The following section reviews the different functions in MATLAB and Python that perform the most common classification algorithms. Note that it is assumed that the training data are stored as a table for all MATLAB operations. At the end of this section, a

268

Computational Learning Approaches to Data Analytics in Biomedical Applications

Table 9.3 Different classification algorithms and their functions’ syntax in MATLAB and Python. 1 2 3 4 5 6 7

Algorithm name

MATLAB

Python/scikit-learn

k-Nearest neighbors (k-NN) classifier Naïve Bayes NB classifier Fit discriminant analysis classifier Train binary support vector machine classifier Fit binary classification decision tree for multiclass classification Fit Linear Classification Model to high-dimensional data Fit multiclass models for support vector machines or other classifiers

fitcknn fitcnb fitcdiscr fitcsvm fitctree fitclineara fitcecoc

NearestNeighbors GaussianNB LinearDiscriminantAnalysis svm DecisionTreeClassifier LinearRegression SVC, NuSVC, and LinearSVC

fitclinear does not accept tables in its input parameters; it only accepts numerical arrays.

a

generalized function will be presented that can be used for all classification functions in MATLAB. The general form of the following functions in MATLAB is fitc*, and all of them require the same input parameters. Therefore, the general command has been provided for all of them below, and the reader just needs to replace the * in the general command with the appropriate syntax from Table 9.3. >> mdl = fitc*(tbl,response); % or >> mdl = fitc*(dataX,Y);

mdl is the k-NN classifier model, tbl is the data table that contains all the observations, response is a text variable that represents the name of the responses column in the table tbl, dataX is the predictor of the dataset as a numerical array, and Y is the response vector. The fitcsvm trains, or cross validates, the support vector machine for two-class (binary) classification on low to moderate dimensional datasets. This function supports mapping the predictor data using variable kernel functions. Despite fitcsvm supporting only binary classification on a low to a medium number of dimensions, these two limitations were compensated by using fitclinear and fitcecoc, respectively. The following section shows how to use fitcsvm for the classification of multiclass support vector machines. After partitioning the data into training and test sets and creating the classification model, it is now time to test the model. The predict function generates a vector that contains the classes for each data observation in its input parameter. This function takes a prefitted classification model and data array or table as its input parameters, as illustrated in the following example. >> predictedgroup = predict(mdl,dataset);

In Python, all the classifiers functions listed in Table 9.3 are available in the scikitlearn library. There are many code examples on the use of each of these functions available in the scikit-lear documentation. Furthermore, a dedicated multifunction example is developed for this chapter and is available in Al-Jabery (2019).

Chapter 9  Data analysis and machine learning tools in MATLAB and Python

269

9.4.2.2 Multiclass support vector machines To perform a multiclass SVM classification in MATLAB, the templateSVM and fitcecoc functions can be used as in the following code segment: >> svmtmp = templateSVM('KernelFunction','polynomial'); >> mdlsvm = fitcecoc(Trainingdatatbl,'group variable name', 'Learners',svmtmp);

In Python, SVC, NuSVC, and LinearSVC are the functions designed for performing multiclass SVM classification. SVC and NuSVC are similar with slightly different set of input parameters. But LinearSVC is dedicated for linear kernel implementation of SVM. More examples are available in Al-Jabery (2019).

9.4.2.3 Neural network classifier Chapter 4 discussed the different NN classifiers, and this section will illustrate the implementation of feedforward NN classifiers in MATLAB. A feedforward NN is an arrangement of interconnected neurons that maps input to responses. In this arrangement, the number of neurons in the input layer is equal to the number of predictors (i.e., variables). The number of neurons in the output layer is equal to the number of response classes. Fig. 9.22 depicts a typical feedforward NN classifier. A feedforward NN can be created interactively in MATLAB using the NN pattern recognition app. To open the NN pattern recognition app in MATLAB, use the command nprtool. This section will discuss creating an NN using commands only. The NN requires a numerical response into the target vector. Therefore, in the case when the responses are categorical, the dummyvar function can be used to convert them to numerical vectors. This function performs the data encoding operations discussed in Chapter 2. >> targets = dummyvar(V_cat);

Input Layer

Hidden Layer

Output Layer

x1

x2

x3

FIG. 9.22 Feedforward NN classifier architecture (MATLAB, 2013).

270

Computational Learning Approaches to Data Analytics in Biomedical Applications

The demonstration function, in CodeEx.28, creates, trains, and plots the confusion matrix in addition to finding the validation error for pattern recognition NN. For more information on confusion matrices, please see Section 9.6.2. CodeEx.28: Building a FFNN in MATLAB. function [val_err,FFNN]=create_FFNN(data,tg,p_val,p_test,h) %% this function creates a typical FFNN. Inputs: X is full data set; tg: responses, h: %number of neurons in hidden layer, p_val: validation %ratio (i.e., 0.15), p_test: test %ratio FFNN is a pattern recognition , val_err is validation error net = patternnet(h); % specify neurons in hidden layers X=data';tg=tg'; net.divideParam.trainRatio = 1-p_val-p_test; net.divideParam.valRatio = p_val; net.divideParam.testRatio = p_test; [net,tr] = train(net,X,tg); scoreTest = net(X(:,tr.testInd)); tgtest=tg(tr.testInd); [~,yPred] = max(scoreTest); plotconfusion(tgtest,yPred); val_err=100*nnz(yPred ~= double(tgtest))/length(tgtest); FFNN=net; end

The output from CodeEx.28 is shown in Fig. 9.23. The number of hidden layers and the number of neurons in each layer can be specified using the fitnet function (similar to the patternnet function used in the demonstration function) by using a vector as its input. For example, the following command creates an NN with two hidden layers with 5 and 6 neurons, respectively, and views the schematic diagram of the NN using the view command. The NN schematic diagram is shown in Fig. 9.24.

23 35.9%

5 7.8%

82.1% 17.9%

2

7 10.9%

29 45.3%

80.6% 19.4%

76.7% 23.3%

85.3% 14.7%

81.2% 18.8%

2

1

1

Output Class

Confusion Matrix

Target Class FIG. 9.23 Typical confusion matrix from create_FFNN.

Chapter 9  Data analysis and machine learning tools in MATLAB and Python

271

FIG. 9.24 The schematic diagram of the configured NN using the view command.

>> FFnet = fitnet([5 6]); >> view(FFnet)

Python also provides a convenient NN classifier module named MLPClassifier(), which is available in the scikit-learn library. This module implements a multilayer perceptron algorithm that trains using backpropagation (Hecht-Nielsen, 1989). CodeEx.29 illustrates the implementation of MLPClassifier(). CodeEx.29: Implementation of a multilayer perceptron classifier in Python.

9.4.2.4 Performance evaluation and cross-validation tools To evaluate the performance of the model, MATLAB provides the loss and resubLoss functions. Both of these functions calculate the classification error of a model. loss calculates the validation error, while resubLoss calculates the training error. The two functions are called in the following format: >> training_error = resubLoss(mdl); >> test_error=loss(mdl,testdata);

To create a model with cross validation in MATLAB, provide one of the following options in the fitting function. >> mdl = fitc*(data,'responseVarName','optionName','optionValue')

272

Computational Learning Approaches to Data Analytics in Biomedical Applications

Table 9.4

Cross-validation options.

Option name

Option value

Type of validation

’CrossVal’ ’Holdout’ ’KFold’ ’Leaveout’

’on’ scalar from 0 to 1 k (scalar) ’on’

10-fold cross validation Holdout with the given fraction reserved for validation k-fold cross validation Leave-one-out cross validation

If a partition has been created using the cvpartition function, it can also be provided as an input to the fitting function as shown below. Table 9.4 lists the different parameter input combinations. >> part = cvpartition(response_vector,'KFold',k); >> kfoldmdl = fitcknn(data,'responseVarName','CVPartition',part);

To calculate the error of a cross-validated model in MATLAB, use the kfoldLoss function. >> cross_val_err = kfoldLoss(kfoldmdl);

CodeEx.31 represents a general function that is used to classify scalable datasets using any of the fitting functions discussed previously. This function is generalized to allow the user to choose the type of classification model and validation to be implemented. It may not be ideal, but it summarizes all the discussed techniques and functions. The code was written and its functionality was verified so that it is able to handle any type of data as an input. To call this function, use the following command for kfold SVM: >> [model,kfold_err] = classify_fit_demo(data,'Class','SVM','kfold')

In Python, cross-validation is performed using cross_val_score function, as illustrated in CodeEx.30. CodeEx.30: Cross-validation in Python. from sklearn.model_selection import cross_val_score from sklearn import svm clf = svm.SVC(kernel='linear', C=1) scores = cross_val_score(clf, data_variables, target, cv=5) scores print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

CodeEx.31 summarizes all the functions discussed in this section. To call the general function, use the first line of the code. CodeEx. 31: Demonstration for all fitting and cross-validation tools discussed in this section.

Chapter 9  Data analysis and machine learning tools in MATLAB and Python

>> [model,training_err,test_err]=classify_fit_demo(data,' Class',' SVM','test') function [mdl,training_err,test_err]=classify_fit_demo(data, response, fitc, validation_type) %% This function is a generalized form of all fitting functions in MATLAB it summarizes %section 9.4.2. Inputs: data: is the complete dataset, fitc: a character variable that %represents the type of fitting function, response: index of the vector variable if data %is a table, validation type: type of validation method kfold or train-test subset %variable is just the name of the response variable the index of the response variable in %the table. outputs:- mdl: generated classification model, training_err: in case of using %normal partitioning, test_err: either kfold or test for feedback and comments email: %[email protected] if istable(data) % find the location of the response variable names=data.Properties.VariableNames; respid=find_response(names,response); data.class=categorical(data{:,respid});%store class vector at the end of the table data(:,respid)=[];%delete class vector from original location cls=data.class; inpdata=data{:,1:end-1};% create a numeric array from the table else % if dataset fed as an array cls=data(:,response); data(:,response)=[]; end if strcmpi(validation_type,'kfold') k=' '; while ~isnumeric(k) k=input('Please, enter the number of folds you want to divide the data to:'); end pts=cvpartition(cls,'KFold',k); switch fitc case 'SVM' if length(unique(cls)) mdl=fitctree(data,'Class'); >> methods(mdl) Methods for class ClassificationTree: compact loss compareHoldout margin crossval predict cvloss predictorImportance edge prune >> bar(mdl.predictorImportance)

resubEdge resubLoss resubMargin resubPredict surrogateAssociation

view

9.4.3.2 Sequential features selection The mechanism of the sequential feature selection approach is summarized by the following pseudocode: 1. 2. 3. 4. 5. 6.

Start Build a fitting model Choose reduced sets of predictors Calculate prediction error perr Add new predictor Pri Recalculate prediction error perr_new

FIG. 9.25 Levels of importance for the predictors in the heart disease dataset calculated using the predictorimportance method of the classification tree function.

276

Computational Learning Approaches to Data Analytics in Biomedical Applications

7. If perr_new < perr:  Set perr ¼ perr_new;  Go to (5); Else  Remove Pri;  Go to (5). 8. Select subset of predictors with minimum prediction error 9. End The sequential feature selection sequentialfs function requires an error function that builds a model and calculates the prediction error. The error function needs to have the following structure:  Four inputs represent the training and test predictors (matrix) and response (vector).  The output is a scalar value that represents the prediction error. CodeEx.32: General template for an error function (MATLAB, 2013): function error = errorFun(x_train,y_train,x_test,y_test) % Create fitting model mdl = fitc*( x_train,y_train); % * means any fitting function % Calculate the misclassified observations Y_pred = predict(mdl, x_test); error = nnz(y_pred ~= y_test); end

The sequentialfs returns a logical vector that specifies whether a predictor is selected or not. The syntax of this function is >> selected = sequentialfs(errorFun,data_predictors,response);

There are several examples in Al-Jabery (2019) that show the effect of a features reduction. Some of the discussed classification and fitting tools are very sensitive to the training data (e.g., fitctree). Therefore, ensemble learning can be used to enhance the performance of any of the discussed classification functions. The main function for ensemble learning in MATLAB is used to fit an ensemble of learners for classification and regression and called fitensemble. The authors recommend exploring this function to get full details about the process of improving classifiers in MATLAB.

9.4.4

Features reduction and features selection tools in Python

Feature selection and extraction in Python is performed using the feature_selection and feature_extraction functions. Both of these are classes from the scikit-learn library. This section illustrates two of the most popular techniques for feature selection and their implementation in Python. There are several other methods available too.

9.4.4.1 Removing features with low variance The simplest approach for selecting features in Python using the VarianceThreshold function. This function requires a threshold for elimination as its input. CodeEx.33 illustrates how to implement this function.

Chapter 9  Data analysis and machine learning tools in MATLAB and Python

277

CodeEx.33: Feature selection using VarianceThreshold function. from sklearn.feature_selection import VarianceThreshold def Var_th(datainp,Th): sel = VarianceThreshold(threshold=(Th * (1 - Th))) return sel.fit_transform(datainp)

9.4.4.2 Recursive feature elimination Recursive feature elimination (RFE) is selecting features by selecting smaller and smaller subsets of features recursively using an external estimator that assigns weights to features. Scikit-learn has the RFE class module that is designed for this purpose. The estimator trained on initial set of features. Then the features are ranked based on their importance score that was obtained from either the coef_ or the feature_importance attributes. The features with the lowest rank are pruned from the current set of features. This process recursively repeated until the desired number of features is obtained. CodeEx.34: Feature selection using recursive feature selection algorithm in Python. from sklearn.svm import SVC from sklearn.datasets import load_digits from sklearn.feature_selection import RFE import matplotlib.pyplot as plt def recursive_select(all_ftrs,target,estimatormdl,N): X=all_ftrs y=target rfe = RFE(estimator= estimatormdl, n_features_to_select=N, step=1) rfe.fit(X, y) ranking = rfe.ranking_.reshape(digits.images[0].shape) return ranking #building estimator module Mdlest=SVC(kernel="linear", C=1) #calling the function and chosing 5 features only Ranked=recursive_select(datainp,data_labels,Mdlest,5)

More examples on features selection in Python are available in Al-Jabery (2019).

9.5 Visualization Visualizing high-dimensional datasets in a meaningful and visually coherent way is challenging for data scientists. Chapter 8 provided a theoretical understanding of the different approaches available for data visualization. This section offers an overview of the utilization of some of the most convenient data visualization functions that are available in MATLAB and Python. To visualize high-dimensional data effectively, dimensionality reduction techniques can be used such as multidimensional scaling and principal component analysis (PCA). This section reviews these two techniques and the functions available to implement them.

9.5.1

Multidimensional scaling

The implementation of multidimensional scaling in MATLAB is performed by calculating the pairwise distances between all observations first and then performing the

278

Computational Learning Approaches to Data Analytics in Biomedical Applications

multidimensional scaling. These two steps are performed in MATLAB using the pdist and cmdscal functions, respectively. The following code example illustrates the calculation of multidimensional scaling for an arbitrary dataset.

9.5.1.1 Pairwise distance calculation function pdist The pdist function in MATLAB calculates the Euclidean (by default) distance between pairs of observations in the input dataset. The syntax of this function is as shown below: >> D = pdist(dataset,Distance,distanceparameter)

where Distance specifies the distance method (default is ‘euclidean’). distanceparameter is only required when the Distance is either ‘seuclidean’, ‘minkowski’, or ‘mahalanobis’. The most common methods used are ‘euclidean’, ‘cityblock’, and ‘correlation’. D is the distance or dissimilarity vector containing the distance between each pair of observations (MathWorks, 2012; MATLAB, 2013). Finding the distance vector D is essential in finding the multidimensional scaling of the dataset. A very similar function with similar input parameters is available in scikit-learn too, which is pairwise_distances(). To import this function, use the following code line. from sklearn.metrics.pairwise import pairwise_distances

9.5.1.2 Perform multidimensional scaling In MATLAB, the multidimensional scaling is calculated using the cmdscale function and the dissimilarity vector D. The cmdscale function returns two matrices: the reduced dataset (X) and the eigenvalues (E) of the matrix X. As illustrated in the syntax for this function: >> [X,E] = cmdscale(D)

The eigenvalues E can be used to determine if the reduced dataset X reasonably represents the original data points. If the first k eigenvalues are significantly larger than the rest, then the dataset is well approximated by the first k variables (i.e., columns) from X. The pareto function (discussed in the next section) can be used to show the eigenvalues of the reduced dataset. Furthermore, if the user wants to reduce a dataset to a specific number of dimensions, he can use the midscale function, as illustrated below: >> Xd = mdscale(D, numDims)

In Python, scikit-learn provides the MDS class, which performs both metric and nonmetric multidimensional scaling (Kruskal, 1964). See Chapter 9 for detailed discussion on this topic. The following code lines show how to import and use the MDS class in Python. CodeEx.35: Multidimensional scaling in Python. from sklearn import manifold mds = manifold.MDS(n_components=2, max_iter=3000, eps=1e-9,random_state=seed, dissimilarity="precomputed", n_jobs=1)) D = euclidean_distances(inpdata) pos = mds.fit(D).embedding_

Chapter 9  Data analysis and machine learning tools in MATLAB and Python

9.5.2

279

Principal component analysis

As discussed in Chapter 8, PCA is a common method for dimensionality reduction. MATLAB and Python both have functions to calculate the PCA of data observations. For example, MATLAB has the pca function. Its syntax is illustrated below: >> [coeff,X,pcavar] = pca(data)

The output of interest throughout this section is X, which represents the reduced dataset. Python provides a similar function through the scikit-learn library. A demonstration example using this function is presented in Section 9.5.1.

9.5.3

Visualization functions

This section mainly discusses the specialized data and clusters’ visualization functions such as gscatter, pareto, etc. Details and examples on the standard and popular graphic functions in MATLAB such as plot, mesh, plot3d, etc., are available in many references. Table 9.5 lists the MATLAB visualization functions that have been used in this chapter for data analysis and clusters’ visualization. All the functions listed in Table 9.5 are demonstrated in code examples available in Al-Jabery (2019). Most of these functions were demonstrated in previous examples in this chapter. However, CodeEx33 illustrates the dimensionality reduction techniques and some of the visualization functions. The resulting figures from the demonstration example are shown below in Fig. 9.26AeE. Table 9.5 MATLAB visualization functions for data analysis and clusters presentation. Function

Description

bar(Y) Pareto(Y) scatter(x,y) scatter3(x,y,z) gscatter(x,y,g)

Generates a bar graph with one bar for each element in Y. Displays the values in vector Y as bars in descending order. Creates a scatter graph showing the data points specified by x and y as circles. Creates a 3-D scatter graph. Creates a scatter graph in which x and y are grouped by g, where x and y are vectors having the same size as g. g is a categorical vector that represents the groups of observations. Each data point in the same group will have a similar color that is different from those in other groups. C is categorical, plots a histogram with a bar for each category in C. Creates a histogram plot in polar coordinates by sorting the values in theta into equally spaced bins. Specifies the values in radians. Creates a graph where each plot in it represents a group instead of plotting all observations in the dataset X. g is a categorical variable. Creates a heatmap chart from the table tbl. The x input indicates the table variable to display along the x-axis. The y input indicates the table variable to display along the y-axis. Generates a dendrogram plot of the hierarchical binary cluster tree.

histogram(C) polarhistogram(C) parallelcoords(X,‘Group’,g) heatmap(tbl,x,y)

dendrogram(tree)

Continued

280

Computational Learning Approaches to Data Analytics in Biomedical Applications

Table 9.5 MATLAB visualization functions for data analysis and clusters presentation.dcont’d Function

Description

pie(C) & pie3(C)

Draws a pie chart (2-D or 3-D) using the data in C. Each slice of the pie chart represents an element in C. Views and sets current color map. Prints a summary of table T or a categorical array. Plots cluster silhouettes for the n-by-p data matrix X, with clusters defined by clust. Will be discussed in Section 9.7.

colormap(map) summary(T) silhouette(X,clust)

CodeEx.36: A demonstration example for some of the data visualization tools in MATLAB. %% Import & intialize data data = readtable('BreastTissue.csv'); data.Class = categorical(data.Class); % Get numeric columns and normalize them stats = data{:,3:end}; labels = data.Properties.VariableNames(3:end); statsNorm = zscore(stats); % Reconstruct coordinates [pcs,scrs,~,~,pexp] = pca(statsNorm); % show the pca importance pareto(pexp); title('PCA visualization'); %% calculate multidimensional scaling D=pdist(statsNorm); [X,E]=mdscale(D,3); figure; pareto(E); title('Eigenvalues of mdscale'); %% Group data using GMM gmModel = fitgmdist(statsNorm,2,'Replicates',5,... 'RegularizationValue',0.02); [grp,~,gprob] = cluster(gmModel,statsNorm); %% View data % Visualize groups figure; scatter3(scrs(:,1),scrs(:,2),scrs(:,3),20,grp,... 'filled','MarkerEdgeColor','k') view(110,40) xlabel('PCA1'); ylabel('PCA2'); zlabel('PCA3'); title('3-D visualisation of the resulted clusters'); % Visualize group separation gpsort = sortrows(gprob,1); figure; plot(gpsort,'LineWidth',4) xlabel('Point Ranking');ylabel('Cluster Membership Probability') legend('Cluster 1','Cluster2') title('Membership distribution of data samples') figure; parallelcoords(stats,'Group',grp,'Quantile',0.25) title('Clusters visualization using parallelcoords'); legend('Cluster1','Cluster2');

Chapter 9  Data analysis and machine learning tools in MATLAB and Python

281

FIG. 9.26 (A) PCA components visualized using the pareto function. (B) Output from multidimensional scaling visualized using pareto function. (C) Cluster representation using parallel coordinates. (D) Visualizing Clusters using the top three most significant PCAs. (E) Membership distribution for all samples in the dataset.

282

Computational Learning Approaches to Data Analytics in Biomedical Applications

FIG. 9.26 Cont’d.

Python provides similar visualization tools in its matplotlib (Hunter & Dale, 2012) library that was discussed earlier. A similar demonstration example in Python is available in Al-Jabery (2019). For detailed information on visualization tools in Python, look at the matplotlib documentation.

9.6 Clusters and classification evaluation functions Chapter 7 illustrates several cluster evaluation metrics. This section covers the implementation of popular tools and functions for clusters’ evaluation and classification evaluation approaches as well, with the support of code examples whenever it is necessary.

9.6.1

Cluster evaluation

One of the challenges experienced in clustering is determining the optimum number of clusters. As shown in Section 9.4, the number of clusters has to be provided for each of the discussed clustering algorithms (hierarchical, kmeans, and Gaussian distribution model). Therefore, it is necessary to have functions for determining the optimum

Chapter 9  Data analysis and machine learning tools in MATLAB and Python

283

number of clusters. This is achieved by evaluating the resulting groups’ homogeneity, compactness, and dissimilarity from each other. For example, the silhouette plots show the silhouette value of each observation, grouped by cluster. Clustering results where most of their observations have high silhouette (P. J. Rousseeuw, 1987) values are preferred. To determine the silhouette values of each observation in a cluster and plot the silhouette figure, MATLAB provides the silhouette function, and its syntax is shown below. >> [SI,f] = silhouette(data,grp,metric)

The function returns the silhouette index for each observation (n x 1) vector SI and silhouette plot as the figure handle in f, and grp is an (n x 1) vector that contains the labels for each observation in the data. Metric is a character variable that specifies the distance metric used when calculating the silhouette index (i.e., ‘Euclidean’, ‘cityblock’, ‘Jaccard’ . etc.). Fig. 9.27 shows the silhouette plot for the BreastTissue (Bache & Lichman, 2013) dataset cluster using Gaussian model clustering into two groups. As discussed in Chapter 7, there are many validation indices other than the silhouette index to evaluate clustering schemes such as Davis-Bouldin, Calinski-Harabasz, etc. (More details are available in Chapter 7.) Therefore, MATLAB provides the evalcluster function, which is an efficient function that can use any of these validation indices in cluster evaluation. As shown in its syntax below, the evalclusters function can be used to determine the optimum number of clusters based on a specific validation index. >> eva = evalclusters(data,clustering,validationindex,'KList',k-range)

For example, to determine the optimum number of clusters for the kmeans algorithm, use the following code: >> eva = evalclusters(data,'kmeans','CalinskiHarabasz','KList',[2:6]) >> eva.OptimalK 2

FIG. 9.27 Silhouette function demonstration.

284

Computational Learning Approaches to Data Analytics in Biomedical Applications

However, one validation index is not enough to decide if the clustering scheme is favorable because each validation index was designed to measure specific properties of clusters (compactness, dissimilarity, etc. See Chapter 7 for more discussion). Therefore, the authors developed the following function as a demonstration for generating multiple evaluations for multiple clustering schemes. CodeEx.37: A general approach that summarizes the clustering evaluation process. function val_indices=find_internal_index(lblin,datainp) %% this function designed to find the internal validation indices: CHI, SI, DB for %multiple clustering criterias % inputs: lblin: matrix of labels (n x k) each column represents clustering %criteria, datainp: input data. % outputs: 3 x k array each entry represents a validation index for the %corresponding clustering scheme val_indices=[]; for i=1:size(lblin,2) lbl=lblin(:,i); evdb=evalclusters(datainp,lbl,'DaviesBouldin'); DBI=evdb.CriterionValues; evSI=evalclusters(datainp,lbl,'silhouette'); SI=evSI.CriterionValues; evCH=evalclusters(datainp,lbl,'CalinskiHarabasz'); CHI=evCH.CriterionValues; val_indices=[val_indices,[DBI;SI;CHI]]; end end

Scikit-learn provides Python users with plenty of evaluation functions that are available under the sklearn.metrics module, which includes score functions, performance metrics and pairwise metrics, and distance computations. For example, the silhouette index is calculated using the silhouette_score() function. The general form of this function is shown below: sklearn.metrics.silhouette_score(X, labels, metric=’euclidean’, sample_size=None, random_state=None)

This function returns the mean silhouette coefficient over all samples. To obtain the values for each sample, use silhouette_samples.  ski & Harabasz, 1974) can be calculated using the Calinski-Harabasz index (Calin calinski_harabasz_score functiondthe general form of the function is shown below. calinski_harabaz_score(X, labels)

where X is the dataset variables and labels are the labels of all the samples. A complete list of all evaluation and validation indices are available in Fabisch et al. (2019). Python and more MATLAB code examples on clusters’ evaluation are available in Al-Jabery (2019).

Chapter 9  Data analysis and machine learning tools in MATLAB and Python

9.6.2

285

Classification models evaluation

There are several evaluation metrics for classifiers such as accuracy, precision, recall, and the F-1 score (see Chapter 4). However, this section explores the available functions for determining the accuracy of classification models in MATLAB. The resubLoss, loss, and kfoldLoss functions have already been discussed in previous sections of this chapter. Therefore, this section focuses on the confusionmat function.

9.6.2.1 Confusion matrix confusionmat This function finds the distribution of all the predicted responses and shows how they compare to their true classes (Fig. 9.28). The function returns the confusion matrix and the group names. The confusion matrix can be visualized using the heatmap function, as illustrated in Fig. 9.29. The authors have

FIG. 9.28 confusionmat demonstration.

FIG. 9.29 Confusion matrix visualization using the heatmap function.

286

Computational Learning Approaches to Data Analytics in Biomedical Applications

also provided a general function that calculates the accuracy, precision, recall, and F-1 score for binary classifiers. See the MATLAB function below. CodeEx.38: Classification evaluation MATLAB function. function [Tp,Tn,Fp,Fn,ef]=evaluate_test(tstop,target) %% this function is to evaluate the results obtained from training or testing % tstop: the predicted response, target: the original response. [m,n]=size(target); [m1,n1]=size(tstop); Tn=0;Tp=0;Fn=0;Fp=0; if m1~=m disp('Error in selected data dimension mismatch'); return; else for i=1:m switch tstop(i) case -1 if target(i)==-1 Tn=Tn+1; else Fn=Fn+1; end case 1 if target(i)==1 Tp=Tp+1; else Fp=Fp+1; end otherwise disp('Unexpected class shown!'); return; end end end ef=(Tp+Tn)/(Fp+Fn+Tp+Tn);% over all accuracy %% This part to show the results disp(['Tp= (',num2str(Tp),') Tn= (',num2str(Tn),') Fp= (',... num2str(Fp),') Fn= (',num2str(Fn),')']); Pr=Tp/(Tp+Fp);% classifier precession R=Tp/(Tp+Fn);% recall F1=2*Pr*R/(Pr+R);% F1 score measure disp('===============Classifier Evaluation================'); disp(['Accuracy= (',num2str(ef),') Precision= (',num2str(Pr),') Recall= (',num2str(R),') F1-score= (',num2str(F1),')']); end

Python also has the samedif not more powerfuldtools to perform all the previously discussed evaluations through scikit-learn. The function confusion_matrix calculates the confusion matrix. The following code example shows a method for calculating and visualizing a confusion matrix in Python. CodeEx.39 demonstrates the calculation and visualization of confusion matrix in Python.

Chapter 9  Data analysis and machine learning tools in MATLAB and Python

287

CodeEx.39: Classification evaluation example in Python. import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn import svm from sklearn.cross_validation import train_test_split from sklearn.metrics import confusion_matrix # import data variables and targets X = pd.read_csv(‘inputdata.csv’) y = pd.read_csv(‘targets.csv’) # Split the data into a training set and a test set X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) # Run classifier classifier = svm.SVC(kernel='linear', C=0.01) y_pred = classifier.fit(X_train, y_train).predict(X_test) def plot_confusion_matrix(cm, title='Confusion matrix', cmap=plt.cm.Blues): plt.imshow(cm, interpolation='nearest', cmap=cmap) plt.title(title) plt.colorbar() plt.tight_layout() plt.ylabel('True label') plt.xlabel('Predicted label') # Compute confusion matrix cm = confusion_matrix(y_test, y_pred) np.set_printoptions(precision=2) print('Confusion matrix') print(cm) plt.figure() plot_confusion_matrix(cm) plt.show()

In spite of the simplicity of calculating the accuracy, precision, recall, and F1_score (listed in codex.x), scikit-learn has its own embedded functions for calculating them. These functions and more can be found in the scikit-learn documentation.

9.7 Summary This chapter presented the necessary tools for implementing most of the algorithms and approaches presented in the previous chapters. It explicitly illustrates the MATLAB functions and tools that are useful in data analysis, cleansing, clustering, and ML. This chapter can be used as a technical guide for beginner- and some intermediate-level programmers in MATLAB. The chapter contains examples and demonstrations that illustrate the functions and approaches discussed previously. It also provides glimpses into Python’s data analysis abilities by providing the required resources for the Python libraries and functions that are equivalent to the MATLAB functions provided.

288

Computational Learning Approaches to Data Analytics in Biomedical Applications

This chapter describes the journey of data points from the initial raw data repository to the final level of applying and extracting intelligent inference using ML algorithms. It describes the data loading process and the different functions for importing data in MATLAB and Python. Then, it presents the required and necessary data processing functions through coded examples and genuine code applied on biomedical datasets. It discusses the various clustering and classification functions available in MATLAB and Python. Then, it shows the necessary evaluation and validation metrics and their implementation. This chapter also uses coded examples to discuss the necessary visualization functions and libraries in MATLAB and Python. The final section of this chapter was dedicated to the evaluation process for clustering and classification models. This chapter includes an online code repository that contains all the code written in the chapter plus additional examples and functions written in MATLAB and Python.

References Al-Jabery, K. (2019). ACIL group/Computational_Learning_Approaches_to_Data_Analytics_in_ Biomedical_Applications$ GitLab. https://git.mst.edu/acil-group/Computational_Learning_Approaches_ to_Data_Analytics_in_Biomedical_Applications. Bache, K., & Lichman, M. (2013). UCI machine learning repository. Irvine: University of California Irvine School of Information (School of Information and Computer Sciences). Bagheri, M., Al-jabery, K., Wunsch, D. C., & Burken, J. G. (2019). A deeper look at plant uptake of environmental contaminants using intelligent approaches. Science of The Total Environment, 651, 561e569. https://doi.org/10.1016/j.scitotenv.2018.09.048. Breunig, M. M., Kriegel, H.-P., Ng, R. T., & Sander, J. (2000). LOF: Identifying density-based local outliers. ACM, 1e12. https://doi.org/10.1145/335191.335388.  ski, T., & Harabasz, J. (1974). A dendrite method for cluster Analysis. Communications in StatisticsCalin theory and Methods, 3(1), 1e27. Cole, R., & Fanty, M. (n.d.). UCI machine learning repository: ISOLET data set. Retrieved from https:// archive.ics.uci.edu/ml/datasets/isolet. De Mauro, A., Greco, M., & Grimaldi, M. (2015). What is big data? A consensual definition and a review of key research topics. AIP Conference Proceedings, 1644, 97e104. https://doi.org/10.1063/1.4907823. Fabisch, A., Passos, A., Gollonet, A. S., Joly, A., Gorgolewski, C., Cournapeau, D., et al. (2019). API reference: Metrics. Retrieved from http://scikit-learn.org/stable/modules/classes.html#modulesklearn.metrics. Foundation, P. S. (2019). Python documentation. Retrieved from https://docs.python.org. Hanke, M., Halchenko, Y. O., Sederberg, P. B., Hanson, S. J., Haxby, J. V., & Pollmann, S. (2009). PyMVPA: A Python toolbox for multivariate pattern analysis of fMRI data. Neuroinformatics, 7(1), 37e53. https://doi.org/10.1007/s12021-008-9041-y. Hanselman, D., & Littlefield, B. (2011). Mastering MATLAB. The MATLABÒ Curriculum Series. xviii þ 638. Hecht-Nielsen, R. (1989). Theory of the backpropagation neural network. Proceedings Of The International Joint Conference On Neural Networks, 1, 593e605. https://doi.org/10.1109/IJCNN.1989. 118638.

Chapter 9  Data analysis and machine learning tools in MATLAB and Python

289

Hunter, J., & Dale, D. (2012). Matplotlib release 1.2.0 firing, Michael Droettboom and the ma I user’s guide 1. Inmon, W., Zachman, J., & Geiger, J. (1997). Data stores, data warehousing and the Zachman framework: Managing enterprise knowledge. Janosi, A., Steinbrunn, W., Pfisterer, M., & Detrano, R. (1988). UMI machine learning repository: Heart disease data set. Julia Evans. (n.d.). Pandas cookbook. Kohonen, T. (2014). MATLAB implementations and applications of the self-organizing map. Retrieved from http://docs.unigrafia.fi/publications/kohonen_teuvo/. Kruskal, J. B. (1964). Nonmetric multidimensional scaling: A numerical method. Psychometrika, 29(2), 115e129. https://doi.org/10.1007/BF02289694. Liu, F. T., Ting, K. M., & Zhou, Z. H. (2008). Isolation forest. In Proceedings e IEEE International Conference on Data Mining, ICDM (pp. 413e422). https://doi.org/10.1109/ICDM.2008.17. Margolis, R., Derr, L., Dunn, M., Huerta, M., Larkin, J., Sheehan, J., et al. (2014). The National Institutes of Health’s big data to knowledge (BD2K) initiative: Capitalizing on biomedical big data. Journal of the American Medical Informatics Association, 21(6), 957e958. https://doi.org/10.1136/amiajnl-2014002974. MathWorks. (2012). 9781420034950.

Matlab

documentation.

Matlab

documentation.

https://doi.org/10.1201/

MATLAB. (2013). Machine learning with MATLAB. MATLAB. (2016). version 9.0 (R2016a). Natick, Massachusetts: The MathWorks Inc. MATLAB. (2017). version 9.2 (R2017a). Natick, Massachusetts: The MathWorks Inc. McKinney, W., & Team, P. D. (2015). Pandas e powerful Python data analysis toolkit. Pandas e Powerful Python Data Analysis Toolkit. Obafemi-Ajayi, T., Al-Jabery, K., Salminen, L., Laidlaw, D., Cabeen, R., Wunsch, D., et al. (2017). Neuroimaging biomarkers of cognitive decline in healthy older adults via unified learning. In 2017 IEEE Symposium Series on Computational Intelligence (SSCI) (pp. 1e9). https://doi.org/10.1109/ SSCI.2017.8280937. Ong, S. P., Richards, W. D., Jain, A., Hautier, G., Kocher, M., Cholia, S., & Ceder, G. (2013). Python Materials Genomics (pymatgen): A robust, open-source python library for materials analysis. Computational Materials Science, 68, 314e319. Open-Source. (2016). Python data analysis library. Pandas. (2015). Python data analysis librarydpandas. Python Data Analysis Library. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2012). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825e2830. Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics. https://doi.org/10.1016/0377-0427(87) 90125-7. Rousseeuw, P., & Driessen, K. (1999). A fast algorithm for the minimum covariance. Technometrics, 41(3), 212e223. Shanila, P. (2018). Python file I/O e Python write to file and read file e DataFlair. Sharma, G., & Martin, J. (2009). MATLABÒ: A language for parallel computing. International Journal of Parallel Programming, 37. https://doi.org/10.1007/s10766-008-0082-5. Somers, D. A. (2008). Welcome to the plant genome. The Plant Genome Journal, 1(1), 1. https://doi.org/ 10.3835/plantgenome2008.06.0007ed.

290

Computational Learning Approaches to Data Analytics in Biomedical Applications

Sopka, J. J. (1979). Introductory functional analysis with applications (Erwin Kreyszig). SIAM Review https://doi.org/10.1137/1021075. Tsanas, A., Little, M. A., McSharry, P. E., & Ramig, L. O. (2010). Accurate telemonitoring of Parkinsons disease progression by noninvasive speech tests. IEEE Transactions on Biomedical Engineering, 57(4), 884e893. https://doi.org/10.1109/TBME.2009.2036000. Shinmoto Torres, R. L., Ranasinghe, D. C., Shi, Q., & Sample, A. P. (2013). Sensor enabled wearable RFID technology for mitigating the risk of falls near beds. In 2013 IEEE International Conference on RFID, RFID 2013 (pp. 191e198). https://doi.org/10.1109/RFID.2013.6548154.

Index Note: ‘Page numbers followed by “f” indicate figures, “t” indicate tables and “b” indicate boxes’. A Acute lymphoblastic leukemia (ALL), 61e62 Acute myeloid leukemia (AML), 61e62, 162e163 Adaptive resonance theory (ART), 54e56, 62e63, 70 BARTMAP, 83e84 Fuzzy ART, 81e82, 81f Fuzzy ARTMAP, 82e83 Adjusted rand index (ARI), 196e197 Advanced subspace clustering, 38e43 Agglomerative clustering, 33 Alternative hypothesis, 130 Anomaly detection, 21 Ant colony optimization (ACO), 48e49 Autism spectrum disorder (ASD), 125e126, 140e141, 140t, 153, 155e156 Automated analysis, 1 Autoregressive moving average (ARMA), 79 Average Spearman’s Rho (ASR), 178e179 B Backpropagation (BP), 4, 226, 271 error minimization, 101e102 multilayer perceptron, 101, 102f neuron output, 101 offline training, 103 online training, 103 output of neuron, 101 total error, 101 Backpropagation through time (BPTT), 103e105, 104f Basic subspace clustering, 37 density-based subspace clustering, 38 grid-based subspace clustering, 37 window-based subspace clustering, 37e38

Bayesian ART, 54e56 Bayesian priors, 119 Benjamini-Hochberg (BH) approach, 166e167 Beta-values, 158 BiCAMWI, 173 Biclustering, 73e74, 74t assumptions, 168 Biclustering ARTMAP (BARTMAP), 175e176 Biclustering based on PAttern Mining Software (BicPAMS), 171 Cheng and Church (CC) algorithm, 169e170 evolutionary-based biclustering methods (EBMs), 172e175 flexible overlapped biclustering (FLOC), 170 factor analysis for bicluster acquisition (FABIA), 170e171 iterative signature algorithm (ISA), 170 neural network-based architecture, 169 order-preserving submatrices algorithm (OPSM), 170 Penalized Plaid Model (PPM), 171 QUBIC, 171 UniBic, 171 Biclustering ARTMAP (BARTMAP), 83e84 external validation measures, 180 Fuzzy ART, 175e176, 175f internal validation measures Average Spearman’s Rho (ASR), 178e179 mean squared residue score (MSR), 177e178 Scaling Mean Squared Residue (SMSR), 178

291

292

Index

Biclustering ARTMAP (BARTMAP) (Continued ) Special Cases for VET, 180 Submatrix Correlation Score (SCS), 179 Transposed Virtual Error, 179 leukemia data, 176, 177f structure, 176, 176f Biclustering based on PAttern Mining Software (BicPAMS), 171 BiHEA, 172 Bioinformatics, 73e74 BIRCH, 68e69 Bivariate data, 127 BP. See Backpropagation (BP) BPTT. See Backpropagation through time (BPTT) Brute-force solutions, 2 Bumphunter, 160e161 C Calinski-Harabasz (CH) index, 43, 192e195, 194t Categorical variables, 126e127 Causal inferences, 125 Cheng and Church (CC) algorithm, 169e170 Citicoline Brain Injury Treatment Trial (COBRIT), 18, 46 CLARANS, 68 Classification and regression tree, 118e120 Classification error rate (CER), 43 Classification models evaluation, 285e287, 285f CLIQUE, 72e73 CLUPOT, 36 Cluster evaluation tools, 282e284, 283f individual features hypothesis testing, 130e134 one-way ANOVA, 134e136 2-sample t-test, 130e133 x2 test for independence, 137e138, 137f, 137te138t multivariate analysis of features, 138e139 Clustering accuracy, 196 Clustering algorithms, 4, 29, 30f, 85te86t

arbitrary cluster shapes, 87 cluster number, 87 computational technology-based clustering, 48e52 density-based clustering, 35e36 dimensionality, 87 fuzzy clustering, 46e48 good visualization and interpretation, 88 hierarchical clustering, 33e35 high-dimensional data clustering, 72e74 kernel learning-based clustering, 63e67 large-scale data clustering, 67e71 mixed data types, 88 neural network-based clustering, 52e63 order insensitivity, 87 parameter reliance, 87 robustness, 87 scalability, 87 sequential data clustering, 74e80 squared error-based clustering, 44e46, 45f, 47f subspace clustering, 36e43 Clustering feature (CF), 68e69 Cluster labeling, 64 Cluster validation indices (CVIs), 4, 189 compactness, 192 ensemble validation paradigm, 205e207, 206f evaluation, 198e199, 198t experimental results and analysis, 199e204, 201fe204f, 205t external validation indices, 196e197 internal validation indices, 194e195 iris dataset, 192e193, 192f, 193t related works, 190e191 separateness, 192 statistical methods Spearman’s rank correlation coefficient, 197 three-factor analysis of variance (ANOVA), 197 Comb-p, 160e162 Complete-case/list-wise deletion methods, 9 Compression set (CS), 68e69 Computational complexity, 2

Index

Computational Intelligence, 1, 7e9 Conditional mean imputation, 13e14 Condition-Based Evolutionary Biclustering (CBEB) algorithm, 172 Confusion matrix confusionmat, 285e287 Convolutional neural networks (CNNs), 110 detection/non-linearity, 112 feature extraction, 111 feature mapping, 111 fully connected layers, 113 subsampling (features pooling), 112 Correlation analysis, 141e143, 141b, 142f Correlation indices, 19e20 CpG islands, 157e159 Critical value, 131 “Curse of dimensionality”, 72 CVIs. See Cluster validation indices (CVIs) D Data curation, 3e4 Data features, 29e30 Data imputation algorithms, 7e12 Data preparation, 8f data imputation and missing values algorithms, 7e12 detecting and removing redundant features, 19e20 Pearson correlation, 20 Spearman correlation, 20e21 domain experts, 23 feature enumeration ASCII conversion, 15 binary feature transformation, 18 categorical data representation, 18e19 numerical encoding, 16 support vector machine (SVM), 15, 16te17t feature selection and extraction, 23e24 imputation methods, 12 horizontal checking, 13 multiple imputation, 15 single imputation methods, 13e14 vertical checking, 13 initial cleansing, 7 normalizing data, 22e23

293

outlier detection, 21e22, 22f recoding categorical features, 21 Data preprocessing, 7, 250e251 data preparation, 7e24 example, 24e25 missing values handling, 251 finding and replacing, 251e254 reading, 251, 252f normalization, 255e256, 255f outliers detection, 256e258 Data quality, 7 Data visualization, 209 dimensionality reduction, 209e221 neural network architectures, 225e227 topological data analysis, 221e225 Davies-Bouldin (DB) Index, 40, 43, 195 DBCLASD, 70e71 DBSCAN, 36, 70e71 Decision rule, 131 Decoupled EKF (DEKF), 106 Deep belief networks (DBNs), 110, 113e114, 113f DENCLUE, 70e71 Denclust, 36 Density-based clustering, 35e36, 36f Density-based subspace clustering, 38 Dependent variables (DV), 126 Differentially methylated regions (DMRs), 160 Dimensionality reduction, 209e210 linear projection algorithms, 210e211 nonlinear projection algorithms, 211e221 Discard set (DS), 68e69 Disease susceptibility locus (DSL), 163e164 Divide-and-conquer strategy, 69e70 Divisive clustering algorithms, 34e35 DMRcate, 160e161 DNA methylation (DNAm), 4, 155 analysis, 159e162 clustering applications, 162e163 epigenome wide association studies (EWAS), 156e157 technology beta-values, 158 Illumina methylation array, 157e158 M-values, 158

294

Index

3D subspace clustering, 38e43 Dunn’s index, 195 E Ellipsoid ART, 54e56 EMADS algorithm, 68e69 ENCLUS, 72e73 Enumeration ASCII conversion, 15 binary feature transformation, 18 categorical data representation, 18e19 numerical encoding, 16 support vector machine (SVM), 15, 16te17t Epigenome wide association studies (EWAS), 156e157 Error minimization, 101e102 Euclidean distance, 31e32, 212 Euclidean space, 63e64 EvoCluster algorithm, 51e52 Evolutionary-based biclustering methods (EBMs), 172e175 Evolutionary Biclustering based in Expression Patterns (Evo-Bexpa), 173 Expectation maximization (EM) algorithm applications, 11e12 expectation step (E step), 11e12 maximization step (M step), 11e12 Explanatory variables, 126 Extended FLVQ family (EFLVQ-F), 57e58 Extended Kalman filter (EKF), 106 External validation indices adjusted rand index (ARI), 196e197 clustering accuracy, 196 Jaccard index, 197 F Factor analysis for bicluster acquisition (FABIA), 170e171 False negative, 132 False positive, 131e132 Family-based association test (FBAT) analysis adjustments, for small sample size, 167 autism spectrum disorder (ASD) sample, 164

disease susceptibility locus (DSL), 164 implementation and analysis, 167e168 multiple testing, 166e167 quality control filtering, 165 Feature-based sequence clustering, 78e79 Feature extraction, 23e24 Feature selection, 23e24 Federal Interagency Traumatic Brain Injury Research (FITBIR), 18 Fisher’s exact test (FET), 138 Flexible overlapped biclustering (FLOC), 170 Functional MRI (fMRI) time series, 67 Fuzzy ART, 56, 81e82, 81f Fuzzy ARTMAP, 82e83 Fuzzy clustering, 46e48 Fuzzy c-Means (FCM) algorithm, 46e47 Fuzzy LVQ (FLVQ), 57e58 G Gated recurrent unit (GRU), 109 Gaussian ART, 54e56 Gaussian mixture model (GMM), 260e261, 261f, 261t Gaussian radial basis function (RBF) kernels, 63e64 GenClust, 51e52 GENECLUSTER, 61e62 Gene expression data analysis, 168e181 Gene filtering methods, 62e63 Generalized Lloyd algorithm, 44 Generalized LVQ (GLVQ), 57e58 Genetic algorithms (GAs), 48e49 Genomic data analysis biclustering, gene expression data analysis, 168e181 DNA methylation (DNAm), 155e163 single nucleotide polymorphism (SNP) analysis, 163e168 types, 153e154, 154f Genotype, 4 Gibbs sampling, 114 GLVQ-F, 57e58 Graphics processing units, 1 Grid-based clustering algorithms, 71 Grid-based subspace clustering, 37

Index

H Hidden Markov model (HMM), 79e80 Hierarchical clustering, 33e35, 35f High-dimensional data clustering, 72e74 High-performance computing, 1 High-throughput sequencing, 155 Hot Deck imputation, 14 Human learning curve, 2e3 Hypotheses, 130, 133t I Illumina methylation array, 157e158 Imputation methods, 12 horizontal checking, 13 multiple imputation, 15 single imputation methods, 13e14 vertical checking, 13 Independent component analysis (ICA), 210e211 Independent variables (IV), 126 Interactive PCA (iPCA) system, 210 Internal validation indices Calinski-Harabasz (CH) index, 194e195 Davies-Bouldin (DB) index, 195 Dunn’s index, 195 Silhouette index (SI), 194 Xie-Beni (XB) index, 195 Isomap, 212 Iterative signature algorithm (ISA), 170 J Jaccard index, 197 K Karhunen-Loe´ve transformation, 210 k-dimensional clustering, 39, 40f Kendall rank correlation, 19e20 Kernel-K-means algorithm, 65e66 Kernel learning-based clustering, 63e67 Kernel PCA (KPCA), 160 k-means algorithm, 44 K-modes algorithm, 45e46 K-NN kernel-based density, 36

295

L Lagrange optimization method, 210 Laplacian Eigenmap (LE), 212 Large-scale data clustering, 67e71 LargeVis, 214e217 Learning vector quantization (LVQ), 57e58 Linear projection algorithms independent component analysis (ICA), 210e211 principal component analysis (PCA), 210 Lloyd algorithm, 44 L1 norm, 32 L2 norm, 31 Locally linear embedding (LLE), 212 Long short-term memory (LSTM), 103, 110f forget gate, 109 gated recurrent unit (GRU), 109 input gate, 108 logistic sigmoid, 108e109 output gate, 109 vanilla LSTM, 108, 108f Lp norm, 32 M Machine learning (ML), 5, 7e9 clustering Gaussian mixture model (GMM), 260e261, 261f, 261t hierarchical clustering, 262e264, 263f k-means, 258e259, 259fe260f self-organizing map, 264e265, 265f features reduction and features selection tools, 274e276 prediction and classification machine learning workflow, 266e268, 268t multiclass support vector machines, 269 neural network classifier, 269e271, 269fe271f performance evaluation and crossvalidation tools, 271e274, 272t Magnetic resonance imaging (MRI), 60 Manhattan distance, 32 Mapper algorithms, 221e222 MATLAB, 5, 33, 104

296

Index

MATLAB (Continued ) data stores, 245e250, 245t, 247f features reduction and features selection tools built-in feature selection method, 275 sequential features selection, 275e276 hasdata function, 248, 248f numpartitions function, 249e250, 249f partition function, 248e249 readall function, 248 read function, 247e248, 247f reading data audio and video formats, 232 cellular arrays, 234, 235f csvread function, 237, 237f dlmread function, 236, 237f formatted tables, 232e234, 233f image formats, 232 importdata function, 237, 238f interactive import function, 232, 232f reading images, 238 scientific data format (SDF), 231 spreadsheet formats, 231 text formats, 231 xlsread function, 235e236 Tall arrays, 250 visualization functions, 279e282, 279te280t, 281f multidimensional scaling, 277e278 principal component analysis (PCA), 279 Matplotlib, 239 Maximum likelihood (ML), 11e12 Mean imputation, 13e14 Mean squared residue score (MSR), 177e178 Medoid, 45e46 Memory-intensive solutions, 2 Messenger RNA (mRNA), 153 Microarrays, 155 Minkowski distance, 32 “Missing at random”, 10e11 “Missing completely at random”, 10e11 Missing values (MVs), 7e9 handling, 9 maximum likelihood (ML), 11e12

removal methods, 9e10 utilization methods, 10e11 Model-based sequence clustering, 79e80, 80f Multidimensional scaling, 211e212 Multidimensional scaling (MDS), 212 Multi-objective Evolutionary Algorithmbased (MOEA) framework, 174 Multiple imputation, 15 Multi-stream EKF, 106 Multi-streaming, 107 M-values, 158 N Neocognitron, 110 Neural network architectures, 225e227 Neural network-based clustering, 52e63, 53f Neuron, 58e59 Nonlinear projection algorithms, 211e212 Isomap, 212 LargeVis, 214e217 self-organizing map (SOM), 217e220 T-distributed stochastic neighbor embedding (t-SNE), 213e214 Non-volatile memory technology, 2 Normalizing data, 22e23 NSL-KDD, 24e25 Null hypothesis, 130 Numpy, 239 O Offline training, 103 One-way ANOVA, 135f, 146e147, 147f, 147b and assumptions, 134e135 global test, 135 Tukey pairwise comparisons, 136, 136t Online training, 103 Order-preserving submatrices algorithm (OPSM), 170 Ordinal variables, 126e127 Outlier detection, 21e22, 22f P Pandas, 240 Parametric statistical methods, 127e128, 128f Particle-pair optimizer (PPO), 51e52

Index

Particle swarm optimization (PSO), 48e49 Parzen window, 65e66 Pearson correlation, 19e20, 143 Pearson correlation coefficient, 32, 191 Penalized Plaid Model (PPM), 171 Point-Biserial correlation, 19e20 Polythetic, 34 Population inferences, 125e126 Possibilistic c-means clustering (PCM), 48 Principal component analysis (PCA), 160, 210, 279 Probe lasso, 160e161 PROCLUS, 72e73 “Product-moment correlation”, 20 Proximity-based sequence clustering, 75e78 Proximity measures cosine similarity measure, 32 data features, 29e30 Euclidean distance, 31 Manhattan distance, 32 Minkowski distance, 32 numerical data, 29e30 Pearson correlation coefficient, 32 similarity metric, 31 standard score, 31e32 time series, 29e30 triangle inequality, 30 zero mean and unit variance, 31e32 z-score, 31e32 Python, 2e3, 5, 33, 104 external libraries and modules Matplotlib, 239 Numpy, 239 Pandas, 240 Scikit-learn, 240 SciPy, 239 features reduction and features selection tools recursive feature elimination (RFE), 277 removing features, low variance, 276e277 opening files other read functions, 243e244, 244f read_csv() function, 243 reading text files, 242e243, 242f reading data, 238e239

297

Q Quality control filtering, 165 Quantitative variables, 126e127 QUBIC, 171 R Random forest, 118e120 Random sampling, 68 Random selection, 14 Recurrence, 105 Recurrent connections, 105 Recurrent neural networks (RNN), 105e107, 105f, 107f Recursive feature elimination (RFE), 277 Regression imputation, 13e14 Removal methods, 9e10 Restricted Boltzmann machines (RBMs), 114 Retained set (RS), 68e69 Rival penalized competitive learning (RPCL), 56e57 RNN. See Recurrent neural networks (RNN) ROCK, 68 R-square value, 43 S Saccharomyces cerevisiae, 62e63 2-Sample t-test, 130e133, 132t, 144, 144f, 145b, 146f Sampling distribution, 131 Scaling Mean Squared Residue (SMSR), 178 Scikit-learn, 240 SciPy, 239 Self-organizing feature maps (SOFMs), 58e59 Self-organizing map (SOM) graph-based method, 218 image-based methods, 217e218 projection-based method, 218e219 SOM-IT, 219, 220f Self-organizing maps, 58e59 Sequence kernel association test (SKAT), 160 Sequential data clustering applications, 74e75

298

Index

Sequential data clustering (Continued ) characteristics, 74e75 feature-based sequence clustering, 78e79 model-based sequence clustering, 79e80, 80f proximity-based sequence clustering, 75e78 Sigmoid kernels, 63e64 Significance level, 131 Silhouette Index (SI), 43, 194 Single imputation methods mean imputation, 13e14 random selection, 14 substitution, 14 weighted K-nearest neighbors (KNN) imputation, 14 Single nucleotide polymorphism (SNP) analysis, 163 association studies, 163e164 family-based association test (FBAT) analysis adjustments, for small sample size, 167 autism spectrum disorder (ASD) sample, 164 disease susceptibility locus (DSL), 164 implementation and analysis, 167e168 multiple testing, 166e167 quality control filtering, 165 sklearn.cluster.KMeans, 259 Sliced inverse regression (SIR), 160 Soft competitive learning, 54 SOM-IT, 219, 220fe222f Spearman correlation, 19e21 Spearman correlation coefficient, 191 Squared error-based clustering, 44e46, 45f, 47f Statistical applications, 128e139, 129f Statistical inference causal inferences, 125 population inferences, 125 Statistical software tools, 139e148 autism spectrum disorder (ASD), 140e141, 140t

cluster evaluation of individual features, 148, 149t one-way ANOVA, 146e147, 147f, 147b 2-sample t-test, 144, 144f, 145b, 146f three cluster results, 146 two cluster results, 143 correlation analysis, 141e143, 141b, 142f STING, 71 Stouffer-Liptak-Kechris (SLK) correction, 161e162 Submatrix Correlation Score (SCS), 179 Subspace clustering, 36e37 advanced subspace clustering, 38e43 basic subspace clustering, 37e38 Sum-of-squared-error criterion, 44 Supervised PCA (SPCA), 160 Support vector clustering (SVC), 64 Support vector machine (SVM), 15, 16te17t T Tall arrays, 250 T-distributed stochastic neighbor embedding (t-SNE), 213e214, 215f TensorFlow, 2e3 Test statistic, 130e131 Three-factor analysis of variance (ANOVA), 197 Topological data analysis, 221e225, 222fe224f Transmission disequilibrium test (TDT), 165e166 Traumatic brain injury (TBI), 18, 46 Tukey pairwise comparisons, 136, 136t Two-layer feedforward neural network, 53 Type I error, 131e132 Type II error, 132 U Unconditional mean, 13e14 UniBic, 171 Unidimensional clustering, 40e42, 40f Univariate data, 127 Unsupervised learning, 189 Utilization methods, 10e11

Index

V Variational autoencoders (VAE), 114e118, 114f Vector quantization (VQ), 44 W WaveCluster, 71 Weighted complete-case approach, 9e10 Weighted K-nearest neighbors (KNN) imputation, 14

299

Weka, 2e3 Welch’s t-test, 130e131 Window-based subspace clustering, 37e38 Winner-take-all (WTA), 54 Winner-take-most (WTM), 54 X Xie-Beni (XB) index, 195 x2 test for independence, 137e138, 137f, 137te138t