Deep Learning for Data Analytics: Foundations, Biomedical Applications, and Challenges [1 ed.] 0128197641, 9780128197646

Deep learning, a branch of Artificial Intelligence and machine learning, has led to new approaches to solving problems i

2,122 337 18MB

English Pages 218 [212] Year 2020

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Deep Learning for Data Analytics: Foundations, Biomedical Applications, and Challenges [1 ed.]
 0128197641, 9780128197646

Table of contents :
Deep Learning for Data Analytics
Copyright
Contents
List of contributors
Preface
1 Short and noisy electrocardiogram classification based on deep learning
1.1 Introduction
1.2 Basic concepts
1.2.1 Cardiac cycle
1.2.2 Electrocardiogram
1.2.3 The QRS wave
1.3 Theory related to electrocardiogram analysis
1.3.1 Discrete wavelets transform
1.3.2 Continuous wavelet transform
1.3.3 Convolutional neural network
1.3.3.1 Convolutional layer
1.3.3.2 Pooling layer
1.3.3.3 Fully connected layer
1.3.4 Database
1.4 Methodology
1.4.1 Preprocessing
1.4.2 Classification based on deep learning
1.4.3 Decision fusion
1.4.4 Training the convolutional neural network model
1.4.5 Performance parameter
1.5 Results and discussion
1.6 Conclusion
References
2 Single-layer convolution neural network for cardiac disease classification using electrocardiogram signals
2.1 Introduction
2.2 Related works
2.3 Methodology
2.3.1 Convolutional neural network
2.3.2 Network architecture
2.4 Experimental result and analysis
2.4.1 Data set description
2.4.1.1 Arrhythmia
2.4.1.2 Myocardial infarction
2.4.2 Arrhythmia disease classification using proposed convolutional neural network
2.4.3 Arrhythmia classification using support vector machine
2.4.4 Myocardial infarction disease classification using the proposed convolutional neural network
2.4.4.1 Comparison of the proposed work against the literature
2.5 Conclusion
References
3 Generalization performance of deep autoencoder kernels for identification of abnormalities on electrocardiograms
3.1 Introduction
3.2 Autoencoder
3.3 Deep autoencoder
3.3.1 Extreme learning machine autoencoder
3.3.2 Deep extreme learning machine autoencoder
3.4 Deep analysis of coronary artery disease
3.5 Conclusion
References
4 Deep learning for early diagnosis of Alzheimer’s disease: a contribution and a brief review
4.1 Introduction
4.2 Literature review
4.2.1 Alzheimer’s disease binary classification
4.2.2 Alzheimer’s disease binary classification using a deep learning approach
4.3 Methods
4.3.1 Data acquisition and preprocessing
4.3.2 Convolutional neural network training and feature extraction
4.3.3 Training and classification with other algorithms
4.4 Experiments and results
4.4.1 Experimental settings
4.4.2 Classification results
4.5 Conclusion
Acknowledgment
References
5 Musculoskeletal radiographs classification using deep learning
5.1 Introduction
5.2 Related works
5.3 Data set description and challenges
5.3.1 Description of the data set
5.3.2 Challenges faced
5.4 Proposed methodologies
5.4.1 Data preprocessing
5.4.2 Inception
5.4.3 Xception
5.4.4 VGG-19
5.4.5 DenseNet
5.4.6 MobileNet
5.5 Statistical indicators
5.6 Experimental results and discussions
5.6.1 Finger radiographic image classification
5.6.2 Wrist radiographic image classification
5.6.3 Shoulder radiographic image classification
5.7 Conclusion
References
6 Deep-wavelet neural networks for breast cancer early diagnosis using mammary termographies
6.1 Introduction
6.2 Related works
6.3 Breast thermography
6.3.1 Breast thermographic images acquisition protocol
6.3.1.1 Room preparation
6.3.1.2 Patient preparation
6.3.1.3 Images acquisition
6.4 Deep-wavelet neural network
6.4.1 Filter bank
6.4.2 Downsampling
6.4.3 Synthesis block
6.5 Classification
6.5.1 Experimental results and discussion
6.5.1.1 Lesion detection
6.5.1.2 Lesion classification
6.6 Conclusion
Acknowledgments
References
7 Deep learning on information retrieval and its applications
7.1 Introduction
7.2 Traditional approaches to information retrieval
7.2.1 Basic retrieval models
7.2.2 Semantic-based models
7.2.3 Term dependency-based models
7.2.4 Learning to rank–based models
7.3 Deep learning approaches to IR
7.3.1 Representation learning-based methods
7.3.1.1 Deep neural network–based methods
7.3.1.2 Convolutional neural network–based methods
7.3.1.3 Recurrent neural network–based methods
7.3.2 Methods of matching function learning
7.3.2.1 Matching with word-level similarity matrix
7.3.2.2 Matching with attention model
7.3.2.3 Matching with transformer model
7.3.2.4 Combining matching function learning and representation learning
7.3.3 Methods of relevance learning
7.3.3.1 Based on global distribution of matching strengths
7.3.3.2 Based on local context of matched terms
7.4 Discussions and analyses
7.5 Conclusions and Future Work
References
Further reading
8 Electrical impedance tomography image reconstruction based on autoencoders and extreme learning machines
8.1 Introduction
8.2 Related works
8.3 Materials and methods
8.3.1 Electrical impedance tomography problems and reconstruction
8.3.2 EIT image reconstruction techniques
8.3.3 Autoencoders
8.3.4 Extreme learning machines
8.3.5 Proposed reconstruction method
8.3.6 Proposed experiments
8.4 Results and discussions
8.5 Conclusion
Acknowledgments
References
9 Crop disease classification using deep learning approach: an overview and a case study
9.1 Introduction
9.1.1 Literature survey
9.2 Overview of the convolutional neural network architectures
9.3 Architecture of SqueezeNet
9.4 Implementation
9.5 Results and discussion
9.6 Conclusion
Acknowledgment
References
Index

Citation preview

DEEP LEARNING FOR DATA ANALYTICS

DEEP LEARNING FOR DATA ANALYTICS Foundations, Biomedical Applications, and Challenges Edited by

HIMANSU DAS KIIT Deemed to be University, Bhubaneswar, India

CHITTARANJAN PRADHAN KIIT Deemed to be University, Bhubaneswar, India

NILANJAN DEY Techno International New Town (Formerly known as Techno India College of Technology), Kolkata, India

Academic Press is an imprint of Elsevier 125 London Wall, London EC2Y 5AS, United Kingdom 525 B Street, Suite 1650, San Diego, CA 92101, United States 50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom Copyright © 2020 Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress ISBN: 978-0-12-819764-6 For Information on all Academic Press publications visit our website at https://www.elsevier.com/books-and-journals

Publisher: Mara Conner Acquisitions Editor: Chris Katsaropoulos Editorial Project Manager: Emma Hayes Production Project Manager: Sruthi Satheesh Cover Designer: Mark Rogers Typeset by MPS Limited, Chennai, India

Contents List of contributors Preface

1. Short and noisy electrocardiogram classification based on deep learning Sinam Ajitkumar Singh and Swanirbhar Majumder 1.1 Introduction 1.2 Basic concepts 1.3 Theory related to electrocardiogram analysis 1.4 Methodology 1.5 Results and discussion 1.6 Conclusion References

2. Single-layer convolution neural network for cardiac disease classification using electrocardiogram signals P. Gopika, C.S. Krishnendu, M. Hari Chandana, S. Ananthakrishnan, V. Sowmya, E.A. Gopalakrishnan and K.P. Soman 2.1 Introduction 2.2 Related works 2.3 Methodology 2.4 Experimental result and analysis 2.5 Conclusion References

3. Generalization performance of deep autoencoder kernels for identification of abnormalities on electrocardiograms Gokhan Altan and Yakup Kutlu 3.1 Introduction 3.2 Autoencoder 3.3 Deep autoencoder 3.4 Deep analysis of coronary artery disease 3.5 Conclusion References

ix xiii

1 1 3 4 9 15 17 17

21

21 23 24 25 34 34

37 37 39 42 47 57 59

v

vi

Contents

4. Deep learning for early diagnosis of Alzheimer’s disease: a contribution and a brief review

63

Iago Richard Rodrigues da Silva, Gabriela dos Santos Lucas e Silva, Rodrigo Gomes de Souza, Maíra Araújo de Santana, Washington Wagner Azevedo da Silva, Manoel Eusébio de Lima, Ricardo Emmanuel de Souza, Roberta Fagundes and Wellington Pinheiro dos Santos 4.1 Introduction 63 4.2 Literature review 65 4.3 Methods 67 4.4 Experiments and results 70 4.5 Conclusion 75 Acknowledgment 75 References 75

5. Musculoskeletal radiographs classification using deep learning N. Harini, B. Ramji, S. Sriram, V. Sowmya and K.P. Soman 5.1 Introduction 5.2 Related works 5.3 Data set description and challenges 5.4 Proposed methodologies 5.5 Statistical indicators 5.6 Experimental results and discussions 5.7 Conclusion References

6. Deep-wavelet neural networks for breast cancer early diagnosis using mammary termographies Valter Augusto de Freitas Barbosa, Maíra Araújo de Santana, Maria Karoline S. Andrade, Rita de Cássia Fernandes de Lima and Wellington Pinheiro dos Santos 6.1 Introduction 6.2 Related works 6.3 Breast thermography 6.4 Deep-wavelet neural network 6.5 Classification 6.6 Conclusion Acknowledgments References

7. Deep learning on information retrieval and its applications Runjie Zhu, Xinhui Tu and Jimmy Xiangji Huang 7.1 Introduction 7.2 Traditional approaches to information retrieval 7.3 Deep learning approaches to IR

79 79 81 83 85 91 91 96 97

99

99 101 102 108 114 118 121 121

125 125 127 131

vii

Contents

7.4 Discussions and analyses 7.5 Conclusions and Future Work References Further reading

8. Electrical impedance tomography image reconstruction based on autoencoders and extreme learning machines

146 149 149 153

155

Juliana Carneiro Gomes, Jessiane Mônica S. Pereira, Maíra Araújo de Santana, Washington Wagner Azevedo da Silva, Ricardo Emmanuel de Souza and Wellington Pinheiro dos Santos 8.1 Introduction 8.2 Related works 8.3 Materials and methods 8.4 Results and discussions 8.5 Conclusion Acknowledgments References

155 157 160 164 168 169 169

9. Crop disease classification using deep learning approach: an overview and a case study

173

Krishnaswamy Rangarajan Aravind, Prabhakar Maheswari, Purushothaman Raja and ´ Cezary Szczepanski 9.1 Introduction 9.2 Overview of the convolutional neural network architectures 9.3 Architecture of SqueezeNet 9.4 Implementation 9.5 Results and discussion 9.6 Conclusion Acknowledgment References Index

173 176 180 182 182 193 193 193 197

List of contributors Gokhan Altan Department of Computer Engineering, Iskenderun Technical University, Hatay, Turkey S. Ananthakrishnan Center for Computational Engineering and Networking (CEN), Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Coimbatore, India Maria Karoline S. Andrade Department of Biomedical Engineering, Federal University of Pernambuco, UFPE, Recife, Brazil Iago Richard Rodrigues da Silva Polytechnic School of Pernambuco, University of Pernambuco, UPE, Recife, Brazil Washington Wagner Azevedo da Silva Department of Biomedical Engineering, Federal University of Pernambuco, UFPE, Recife, Brazil Valter Augusto de Freitas Barbosa Department of Biomedical Engineering, Federal University of Pernambuco, UFPE, Recife, Brazil Manoel Eusébio de Lima Center for Informatics, Federal University of Pernambuco, UFPE, Recife, Brazil Rita de Cássia Fernandes de Lima Department of Mechanical Engineering, Federal University of Pernambuco, UFPE, Recife, Brazil Maíra Araújo de Santana Department of Biomedical Engineering, Federal University of Pernambuco, UFPE, Recife, Brazil Ricardo Emmanuel de Souza Department of Biomedical Engineering, Federal University of Pernambuco, UFPE, Recife, Brazil Rodrigo Gomes de Souza Center for Informatics, Federal University of Pernambuco, UFPE, Recife, Brazil Wellington Pinheiro dos Santos Department of Biomedical Engineering, Federal University of Pernambuco, UFPE, Recife, Brazil Roberta Fagundes Polytechnic School of Pernambuco, University of Pernambuco, UPE, Recife, Brazil Juliana Carneiro Gomes Polytechnic School of Pernambuco, University of Pernambuco, UPE, Recife, Brazil E.A. Gopalakrishnan Center for Computational Engineering and Networking (CEN), Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Coimbatore, India P. Gopika Center for Computational Engineering and Networking (CEN), Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Coimbatore, India

ix

x

List of contributors

M. Hari Chandana Center for Computational Engineering and Networking (CEN), Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Coimbatore, India N. Harini Center for Computational Engineering and Networking (CEN), Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Coimbatore, India C.S. Krishnendu Center for Computational Engineering and Networking (CEN), Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Coimbatore, India Yakup Kutlu Department of Computer Engineering, Iskenderun Technical University, Hatay, Turkey Prabhakar Maheswari School of Mechanical Engineering, SASTRA Deemed University, Thanjavur, India Swanirbhar Majumder Department of IT, Tripura University, Agartala, India Jessiane Mônica S. Pereira Polytechnic School of Pernambuco, University of Pernambuco, UPE, Recife, Brazil Purushothaman Raja School of Mechanical Engineering, SASTRA Deemed University, Thanjavur, India B. Ramji Center for Computational Engineering and Networking (CEN), Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Coimbatore, India Krishnaswamy Rangarajan Aravind School of Mechanical Engineering, SASTRA Deemed University, Thanjavur, India Gabriela dos Santos Lucas e Silva Department of Biomedical Engineering, Federal University of Pernambuco, UFPE, Recife, Brazil Sinam Ajitkumar Singh Department of ECE, NERIST, Nirjuli, India K.P. Soman Center for Computational Engineering and Networking (CEN), Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Coimbatore, India V. Sowmya Center for Computational Engineering and Networking (CEN), Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Coimbatore, India S. Sriram Center for Computational Engineering and Networking (CEN), Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Coimbatore, India ´ Cezary Szczepanski Lukasiewicz Research Network

Institute of Aviation, Warsaw, Poland

List of contributors

Xinhui Tu School of Computer Science, Central China Normal University, Wuhan, China Jimmy Xiangji Huang School of Information Technology, York University, Toronto, Canada Runjie Zhu Department of Computer Science & Engineering, York University, Toronto, Canada

xi

Preface In recent years, deep learning methods have rapidly advanced in their technological development and their practical applications in different diversified research areas. It provides the maximum utilization of computing power of the GPUs to provide better performance. Deep learning encompasses different computational models of several processing layers to gain knowledge and represents the data with numerous levels of abstraction. Deep learning provides more flexible and better performance due to recent advancements in learning algorithms for deep architecture. Deep learning has an edge in contemporary use due to its diversity of applications in different research areas. These include domain-specific information such as medical image recognition, biomedicine, and object tracking that contains useful information about the problems. Different companies analyze these large volumes of data for their business prospects and also to use in decision-making processes that impact the existing and future technologies. Deep learning algorithms extract the high level of complex ideas and process these complex ideas to relatively simpler ideas formulated in the preceding level in the hierarchy. The main advantage of deep learning is it allows the users to analyze and learn a large volume of data, making it a valuable tool for data science where raw data is largely unlabeled and uncategorized. Specifically, its use in data science and engineering applications is in high demand. Data science deals with the data mining and data analytics approaches of the supervised and unsupervised data sets in different applications. Mostly, it is used for prediction, data analysis, and data visualizations. Similarly, almost all the engineering applications including biomedical engineering, aerospace engineering, thermal engineering, communication engineering, and more use the deep learning approaches for the analysis and visualization purposes. The interest of the research community in deep learning methods is growing fast, and much architecture has been proposed in recent years to address several problems, often with an outstanding performance. This book will focus on advanced models, architectures, and algorithms in deep learning for data science and engineering applications. Hence, it is expected that the development of deep learning theories and applications would further influence the field of data science and its different application domains. In Chapter 1, the authors have been proposed a modified preprocessing and unique classification technique based on deep learning for the Electrocardiogram signal. Chapter 2 aims to reduce the computational complexity of the deep learning architecture for cardiac disease classification by using the feature extracted data. The aim of Chapter 3 is to provide an overview of how the training time and generalization capability for deep learning algorithms including deep belief networks and

xiii

xiv

Preface

extreme learning machine autoencoder kernels. In Chapter 4, the authors present a brief review of the diagnosis of Alzheimer’s disease based on magnetic resonance imaging analysis using deep learning. They also present a deep architecture for early diagnosis of Alzheimer’s disease based on implicit feature extraction for the classification of magnetic resonance images. This model aims to classify Alzheimer’s disease patients against a group of patients without the disease. In Chapter 5, the effectiveness of various CNN-based pre-trained models for the detection of abnormalities in radiographic images has experimented and their performances are compared using standard statistical measures. In Chapter 6, the authors introduced the deep-wavelet neural network (DWNN) as a feature extraction method for image representation. DWNN is a deep learning tool based on the Mallat algorithm for wavelet decomposition at multiple levels. The authors applied the DWNN to the task of breast lesion detection and classification in thermographic images. In Chapter 7, a novel way of classifying the existing information retrieval models are introduced, along with their recent improvements and developments. The approach is the first one to classify the existing work according to how they generate the features and ranking functions. In Chapter 8, the authors proposed a new approach: the use of Autoencoders, a deep neural network with unsupervised training, to work like an intelligent filter, denoising the electrical potentials data. In Chapter 9, the authors explored architecture for evaluating the accuracy in disease classification using the PlantVillage dataset. Also, an overview of the deep learning architecture with basic building layers are briefly discussed. The SqueezeNet resulted in the best classification accuracy of 98.49% with original color images. Himansu Das1, Chittaranjan Pradhan1 and Nilanjan Dey2

2

1 KIIT Deemed to be University, Bhubaneswar, India Techno International New Town (Formerly known as Techno India College of Technology), Kolkata, India

CHAPTER ONE

Short and noisy electrocardiogram classification based on deep learning Sinam Ajitkumar Singh1 and Swanirbhar Majumder2 1

Department of ECE, NERIST, Nirjuli, India Department of IT, Tripura University, Agartala, India

2

1.1 Introduction Cardiac abnormalities are the signs of disorder of the heart. Abnormalities include arrhythmias, coronary artery disease, mitral valve prolapse, and congenital heart disease. The study of heart characteristics is one of the necessary measures in assessing the cardiovascular system. Electrocardiogram (ECG) and phonocardiogram (PCG) are two existing problems in which the former produces due to the electrical movements of the heart while the latter produces due to the routine motion of the heart sounds. In the available literature, several researchers have employed different approaches that assist in evaluating the morphological characteristics of the ECG signals; they signify various cardiovascular abnormalities by analyzing the ECG [14]. The different approaches include support vector machines (SVM) [5], multilayer perceptron (MLP) [6], learning vector quantization (LVQ) [7], high order statistic [8], and K nearest neighbors (KNN) [9]. Arrhythmia is the most common of the diseases associated with heart abnormalities. As a result, most of the literature has dealt with arrhythmia classification [1,2,5,79]. The most efficient approach for predicting arrhythmia is the exploration of ECG signals [10]. The study of specific characteristics of ECG recording like beats, morphological and statistical features gives meaningfully correlated clinical data that further helps in predicting ECG pattern. Automated ECG classification is a complicated task as the features associated with morphological and temporal characteristics differ for different subjects under various conditions. Diagnosis of cardiovascular abnormalities using ECG recording has a definite drawback as the ECG signal varies from person to person, and an abnormal ECG signal has different morphological characteristics for related disorders. However, two distinct sets of disorders may have a similar characteristic on an ECG signal. These can cause a problem in the diagnosis of heart abnormalities

Deep Learning for Data Analytics. DOI: https://doi.org/10.1016/B978-0-12-819764-6.00002-8

© 2020 Elsevier Inc. All rights reserved.

1

2

Sinam Ajitkumar Singh and Swanirbhar Majumder

using the ECG signal [1012]. The anomalies of the heartbeat have to be detected after carefully analyzing the ECG signal. Consequently, the steps for analyzing the ECG signal, which mainly include wearable healthcare devices and bedside monitoring, take longer duration and involve a complicated procedure. Many researchers have applied wavelet transform for the preprocessing and extraction of features. Yu and Chen [13] used statistical features by decomposing the recorded ECG signal based on discrete wavelet transform (DWT) and classified six classes based on probabilistic neural network (PNN). Thomas et al. [14] used dual-tree complex wavelet transform (DTCWT) for extracting the features from the QRS complex and had classified the features with an artificial neural network based on multilayer back propagation. Kumari and Devi [15] obtained a morphological feature vector based on the wavelet coefficients and independent component analysis (ICA) followed by arrhythmia classification using an SVM classifier. As per available literature on the research, some of the researchers have used single-lead ECG recordings for detecting sleep apnea based on PhysioNet 2000 challenge. Based on the database provided, some of the researchers had employed a decision fusion method that helps in achieving a high classification performance. For instance, Li et al. [16] applied the decision fusion method by combining two binary classifiers [SVM and artificial neural network (ANN)] and achieved a classification accuracy of 85% in per-minute segments and 100% in per-recording segments for detecting obstructive sleep apnea (OSA) using a single-lead ECG recording. The authors in Ref. [17] analyzed OSA detection by comparing the CNN-based deep learning model with a decision fusion model and reported that the CNN model based on a deep learning algorithm was found to be superior in performance to the decision fusion model. Using the above approaches, a deep learning algorithm based on the transfer learning approach has been employed for predicting heart abnormalities from an ECG recording. Hence, this chapter proposes an alternative approach for classifying the heart abnormalities using an ECG recording based on a deep learning algorithm that helps to improve the performance of the model. The preprocessing steps have been implemented by applying wavelet transform followed by noise removal algorithm. The filtered ECG signal has been further preprocessed by applying a continuous wavelet transform (CWT) that results in the conversion of the data to 2D scalogram images. The scalogram images are used to train the different convolutional neural networks (CNNs) based on deep learning for the classification. In this chapter, the detailed information about the basic idea correlated to the cardiovascular system and ECG signal has been discussed in Section 1.2. Section 1.3 deals with the study of various approaches associated with the heart abnormality analysis and details about the database. The study of the proposed method of using deep learning is discussed in Section 1.4, followed by a result and discussion in Section 1.5.

Short and noisy electrocardiogram classification based on deep learning

1.2 Basic concepts 1.2.1 Cardiac cycle The heart, as we know, is a muscular organ found in the thoracic cavity. It has four chambers with the purpose of delivering oxygen to tissues. The left ventricle has a distinct, thick muscular wall as it helps to pump the blood across the body and into the circulation system, whereas the right ventricle has a thin muscular wall as it helps to pump the blood across the lungs. The cardiac cycle is split into four different periods: two relaxation phases and two contraction phases. The oxygenated blood moves through the left atria, past the mitral valves, and into the left ventricles during the ventricular filling phase. The atrial contraction (recorded as a P wave) starts during the firing of the S-A node and results in the filling of ventricles. During the starting period of ventricular contraction, the pressure is generated across the ventricles. This phase is the second cardiac phase, also known as the isovolumetric contraction period. Blood is discharged from the heart only when the ventricular pressure surpasses the aortic pressure. Blood will flow across the semilunar valves until the pressure gradient in the arteries surpasses the contracting ventricles pressure during the third phase (ventricular ejection period). The QRS complex wave represents an electrical activity for both the isovolumetric contraction period and the ventricular ejection period. The final phase is also known as an isovolumetric relaxation period that is marked as the resting phase during ventricular repolarization. In the ECG signal, an isovolumetric relaxation period is represented by the T wave.

1.2.2 Electrocardiogram An ECG signifies the electrical impulse generated due to cardiac activities. The ECG signal carries valuable information for the cardiologist to achieve a comprehensive analysis of the subject. ECG is the typical conventional tool concerning the diagnosis of cardiac abnormality. The cardiologist extracts the ECG recording by placing the electrode to the patient’s body. The most popular tools for recording the ECG signal are the Holter machine. A cardiologist employs a Holter device on the cardiac patient that requires consistent monitoring to determine the abnormal heartbeat for a day. The beat of the cardiac cycle may be easily computed by computing the waves. The sinoatrial (SA) node is a collection of cells found at the right atrium. It helps to generate electrical impulses that regulate the flow of blood in the body.

1.2.3 The QRS wave Due to the limited information provided by the PhysioNet 2016 challenge about the ECG database collection, this chapter centers on characterizing the features based on the

3

4

Sinam Ajitkumar Singh and Swanirbhar Majumder

QRS complex. The QRS complex indicates the ventricular depolarization [18] of the heart. The physical characteristic of the QRS wave has been expressed by the sequence of ventricular contractions. Ventricular contraction can be interpreted in two steps: septum contraction followed by contraction of the ventricular wall since the Purkinje fibers are positioned just underneath the endocardium, activation further expanded to the epicardium. Ventricular contraction occurs first at the septum. Routine septal contractions activate from the left to the right side. Hence this results in the generation of small septal R waves in the lead V1 and Q wave in lead V6. Contraction develops simultaneously both to the right and the left of the ventricular wall.

1.3 Theory related to electrocardiogram analysis 1.3.1 Discrete wavelets transform The discrete wavelet transformation is a commonly used tool for signal processing. The benefit of employing DWT is that it gives a better time resolution and frequency resolution respectively at high frequency and low frequency. The DWT can be used for determining characteristics based on the time-frequency domain, since it has excellent localization ability. It allows generating multidomain features by decomposing ECG signals based on different levels using the mother wavelet function. Wavelet transform is a powerful approach for expressing the time and frequency characteristics of any given information signal [18]. Wavelet transform presents a two-dimensional representation of corresponding time-frequency features. Fig. 1.1 represents the level four decomposition of the signal using a wavelet transform. The coefficient of the wavelet transform is computed by applying a series of high-pass and low-pass filters. Passing a signal over a high-pass filter provides a detailed coefficient and filtering a sample with a low-pass filter yields an approximation coefficient. The detailed and approximation coefficient has been generated using the below: a½n 5

N X

  x p h½2n 2 p

(1.1)

  x p g½2n 2 p

(1.2)

p52N

d ½n 5

N X p52N

where a[n] and d[n] denote approximation and a detailed coefficient respectively of the signal x[n]. Here, h[n] and g[n] denote the impulse response of the low-pass filter and high-pass filter respectively, while fs represents the sampling frequency of the signal.

Short and noisy electrocardiogram classification based on deep learning

Figure 1.1 The level four decomposition of the signal using a wavelet transform.

1.3.2 Continuous wavelet transform The CWT represents a full description of the signal by expansion and transformation using a mother wavelet. The CWT has an advantage over DWT while computing a nonstationary signal since CWT can employ all the scales while estimating a coefficient. Ergen et al. [19] examined the comparative analysis of the PCG signal based on different wavelets. They have concluded that PCG analysis using a Morlet wavelet provides effective outcomes for examining time-frequency representation. The decomposition of CWT proceeds after compression and transformation of the mother wavelet function over the signal. The CWT yields a high resolution based on timefrequency representation [20]. The CWT can be classified as many types according to its properties, such as complex, noncomplex, biorthogonal, and orthogonal. The equation representing CWT can be expressed as   ð 1 N t2f Tfx ðe; f Þ 5 pffiffi xðt Þ[ dt (1.3) e 2N e

5

6

Sinam Ajitkumar Singh and Swanirbhar Majumder

where [ðtÞ represents a mother wavelet, and e and f denote the scale and shift parameters respectively. In literature, many researchers have used the Morlet wavelet function as an effective means for analyzing the time-frequency characteristics of any nonstationary signal [2123]. The equation representing the Morlet wavelet function has been illustrated below: 1 t 2 j2πfc t [ðt Þ 5 pffiffiffiffiffi e2ðcÞ e cπ

(1.4)

where the center frequency of the wavelet has been denoted by fc.

1.3.3 Convolutional neural network The convolution neural network has been designed using requisite layers like convolution layer and pooling layer. The CNN network has formed without the concern of order and the number of convolutional layers and pooling layers [24]. The typical CNN is shown in Fig. 1.2.

Figure 1.2 The typical block diagram of a convolutional neural network.

7

Short and noisy electrocardiogram classification based on deep learning

1.3.3.1 Convolutional layer This layer estimates the convolutional process for the given signal using an adequate filter to derive the low-level features. The size of the low-level features depends on the size of the kernel as the kernel slides over the complete input signal step by step and computes the dot product between the values of the kernel filters with the value of the input signal, which provides an activation map or a set of low-level features. The equation for deriving the activation maps/low-level features has been shown below: Low level feature 5 input signal 3 kernel

(1.5)

For instance, the agreeable kernel size for analyzing a 2D image of size 25 3 25 3 2 will be k 3 k 3 2 where k 5 3, 5, 7, and so on. However, the size of the kernel needs to be smaller than the size of the input. The kernel masks move over the complete signal in a step-by-step procedure and estimates the dot product between the two and finally forms low-level features or activation maps. Fig. 1.3 illustrates the formation of an activation map using input and kernel filters. 1.3.3.2 Pooling layer The layer just after the convolutional layer is the pooling layer. The pooling layer serves to reduce the size of the activation map that results in the generation of medium-level features. It provides an input for the next subsequent layers if the model consists of deeper layers. The pooling layer loses some data which in turn helps to reduce the chance of overfitting since the complexity of the model decreases. The preferred window size is selected for sliding over the entire input to provide the

Figure 1.3 Convolutional layer.

8

Sinam Ajitkumar Singh and Swanirbhar Majumder

medium-level features based on a reliable subsampling approach. The most acceptable subsampling approach adopted is the averaging approach or maximum value approach [25]. The maximum subsampling approach has been chosen in this study because it provides unique performance features. The example for computing medium-level features using a maximum subsampling approach is illustrated in Fig. 1.4. 1.3.3.3 Fully connected layer Generally, the output of the final pooling layer is connected with the fully connected layers. The fully connected layers deliver a high-level logical operation to provide high-level features. This layer also yields the final decision. This study adopts a dropout layer approach for eliminating the overfitting of the model.

1.3.4 Database In the literature related to the PhysioNet 2016 challenge, all the studies were employed on the PCG signal for analyzing the heart abnormality. The basis for selecting the ECG over PCG is that the analysis of heart abnormality can be extended by increasing its classification performance using the noisy ECG signal. PhysioNet/ Computing in Cardiology Challenge 2016 [26] provides us with a database that consists of six sets of PCG and a few sets of ECG. Most of the researchers employ ECG for the segmentation of a PCG signal for the extraction of features. The primary objective of the challenge is to help the researchers to generate novel methods for classification of the heart abnormalities from the PCG recordings collected from different circumstances. The PhysioNet/CinC 2016 challenge provides 409 ECG recordings for data set A with a sampling frequency of 2 KHz. The length of the ECG recording varies from 12 to 37 seconds. The ECG recordings have been obtained from both abnormal as well as healthy subjects in different clinical places. Some of the ECG recordings

Figure 1.4 Pooling layer.

Short and noisy electrocardiogram classification based on deep learning

were highly corrupted by noise during recording. Initially, we remove 67 noisy ECG recordings from the given database manually. To analyze the ECG recording uniformly, we have used the first 12 seconds from the ECG recording in this study. To bypass the overfitting of the model, the training data and the validating data were made mutually exclusive to each other. The electrical activities of the heart (ECG recording) were collected from different subjects and stored as a.dat format. The detailed information about the database is explained in Ref. [27]. The detailed information regarding the ECG recording was not mentioned clearly in the PhysioNet database provided for this challenge. We have manually scrutinized the entire ECG recording but found out that the ECG recordings had been obtained from different leads at different cases. Hence, our proposed method has focused on identifying the features based on the QRS complex.

1.4 Methodology Some of the limitations that exist in the previous related works are (1) a system that is less reliant on the features and field of application based on deep learning should be promoted and (2) most of the previous works reported have failed to enhance the classification accuracy of the system. To overcome these limitations, we have proposed a technique based on the QRS complex using a time-frequency feature of the ECG recording as scalogram images. Fig. 1.5 illustrates the design of the proposed ECG classification based on transfer learning approach. The raw ECG signals have been preprocessed followed by classification using preprocessed output using a transfer learning approach based on a deep learning algorithm. Preprocessing steps include detrending of ECG signals followed by removal of noise and artifact by applying a bandpass filter. The output of the BPF signal has applied to the CWT for the conversion of scalogram images. Finally, the scalogram images were trained and validated by employing a CNN-based deep learning approach.

Figure 1.5 The block diagram of the proposed ECG classification model.

9

10

Sinam Ajitkumar Singh and Swanirbhar Majumder

1.4.1 Preprocessing The raw ECG signal of the PhysioNet 2016 challenge comprises noises and artifacts with a length that varies from 12 to 37 seconds. Hence, the first 12 seconds of the ECG recordings were employed for uniform analysis. Most of the ECG recordings collected from the PhysioNet database were influenced by the baseline trend. Due to breathing and the ambulatory motion of the patient, the baseline trend has been added in the ECG recording. Hence, a DWT can be used to detect and remove the trend. The raw ECG recording is decomposed into the 10th level using a Daubechies wavelet function. The approximation coefficient at level 10 carries a trend affecting the ECG recording. Fig. 1.6A illustrates the raw ECG recording affected by the baseline trend. Fig. 1.6B represents an approximation coefficient at level 10 of the ECG recording. The approximation coefficient at level 10 was set to zero and used to reconstruct the signal to eliminate the baseline trend. Fig. 1.7 shows the detrending of the raw ECG signal using DWT. The ECG signal has been corrupted by noise and motion artifact. After detrending the ECG recording, a passband filter with a cutoff frequency of 515 Hz has been applied by cascading a low-pass and high-pass filter in series as employed by Pan and

Figure 1.6 ECG signal with annotation a0011 from 100 samples to 24,099 samples, that is, 12 s. (A) Raw ECG affected by baseline trend. (B) Approximation coefficient at 10th level.

11

Short and noisy electrocardiogram classification based on deep learning

Figure 1.7 Detrending of ECG signal.

Figure 1.8 Removal of noise after applying a passband filter.

Tompkins [28]. Fig. 1.8 illustrates the elimination of noise by implementing a passband filter. The transfer function of low- and high-pass filters with order 2 is shown in Eqs. (1.6) and (1.7) respectively.  2 12 Z16 H ðZ Þ 5  (1.6) 2 12 Z1 H ðZ Þ 5

2 1 1 Z3216 1 1 1 Z1

1 Z 32

(1.7)

The amplitude response of low- and high-pass filters is illustrated in Eqs. (1.8) and (1.9) respectively.

12

Sinam Ajitkumar Singh and Swanirbhar Majumder

Figure 1.9 Scalogram representation of filtered ECG signal.

Sin2 ð6πfT Þ Sin2 ðπfT Þ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 256 1 Sin2 ð32πfT Þ jH ðωT Þj 5 CosðπfT Þ jH ðωT Þj 5

(1.8) (1.9)

After eliminating the unwanted noise, the filtered ECG signal is passed through the CWT based on the Morlet wavelet function to transform the one-dimensional ECG signals into two-dimensional scalogram images. The time-frequency scalogram representation of the filtered ECG signal is shown in Fig. 1.9.

1.4.2 Classification based on deep learning A CNN is a feed-forward deep learning algorithm which is based on a supervised learning approach that extracts the features on its own. It is based on transfer learning to provide a faster and efficient classification model. Hence, the proposed method employs a modified AlexNet model for classifying the heart anomaly using ECG scalogram images. AlexNet [29] is a standard pretrained model that uses millions of images based on the ImageNet database. The classification model is based on AlexNet obtained by modifying the final fully connected layer with two classes since it requires classifying between normal or abnormal. Several researchers have employed the modified AlexNet model for classifying the heart abnormality [30], classification of wetland [31], and prediction of prostate cancer [32]. The modified AlexNet model consists of five convolutional layers, three pooling layers, and three fully connected layers. Each convolutional layer is accompanied by an ReLU activation function that helps to provide an efficient training performance. The ReLU activation is shown in Eq. (1.10). The proposed CNN model for classifying the heart anomaly is illustrated in Fig. 1.10. The description of the layer based on the modified AlexNet model is shown in Table 1.1.   f ðc Þ 5 max 0; wc 1 b (1.10) where c, w, and b denote features, weight, and bias respectively.

13

Short and noisy electrocardiogram classification based on deep learning

Figure 1.10 The proposed modified AlexNet model for heart anomaly detection.

Table 1.1 The layer information of the proposed model. Layers Kernel size Output size

Input Convolution1 CCN Maxpool1 Convolution2 CCN Maxpool2 Convolution3 Convolution4 Convolution5 Maxpool3 Fullyconnected1 Fullyconnected2 Fullyconnected3

11 3 11 333 535 333 333 333 333 333

227 3 227 3 3 55 3 55 3 96 55 3 55 3 96 27 3 27 3 96 13 3 13 3 256 13 3 13 3 256 6 3 6 3 256 6 3 6 3 384 6 3 6 3 384 6 3 6 3 256 4 3 4 3 256 4096 4096 2

Padding

Stride

[0 0 0 0]

[4 4]

[0 0 0 0] [2 2 2 2]

[2 2] [1 1]

[0 [1 [1 [1 [0

[2 [1 [1 [1 [2

0 1 1 1 0

0 1 1 1 0

0] 1] 1] 1] 0]

2] 1] 1] 1] 2]

Information

Normalization ReLU Cross channel Pooling ReLU Cross channel Pooling ReLU ReLU ReLU Pooling 50% Dropout 50% Dropout Softmax

CCN, Cross channel normalization; ReLU, Rectified Linear Unit activation functions.

1.4.3 Decision fusion Most of the available literature reports that the classification performance based on a decision fusion approach improves significantly [16,33,34]. Different classifiers may have different classification performance for the same classification problem. Hence, many researchers have worked on various decision fusion approaches for improving the classification performance. Nguyen et al. [33] generated a decision fusion model by using two binary classifiers. By implementing the aforementioned concept, the

14

Sinam Ajitkumar Singh and Swanirbhar Majumder

Figure 1.11 The proposed decision fusion model.

Table 1.2 Setup parameter for heart anomaly detection based on deep learning. CNN parameter Value

Initial learning rate Weight decay rate Iterations Mini batch size Maximum epochs

1024 5 3 1024 500 10 20

proposed decision fusion approach has been adopted for comparison of our proposed method with the other traditional decision fusion methods. The proposed decision fusion model is illustrated in Fig. 1.11. The proposed decision fusion has been developed by modifying the final fully connected layer of the AlexNet model with base classifiers such as KNN, SVM, and Ensemble and LDA (Linear Discriminant Analysis).

1.4.4 Training the convolutional neural network model The proposed model is modified from the typical AlexNet model since the aim of the model is to classify the ECG signal as normal or abnormal. Hence the last layer of the AlexNet model has been remodeled with only two classes. Some of the parameters have changed as follows: (1) the initial learning rate value has been changed to 1024; (2) the weight decay has been changed to 0.0005; (3) the test iteration has been changed to 500; (4) to balance the number of samples, the number of test iterations that should occur per test interval (5) the maximum test iteration has changed to 1000. For training the model, the changed parameters have been shown in Table 1.2.

15

Short and noisy electrocardiogram classification based on deep learning

1.4.5 Performance parameter The performance of the proposed model is predicted by computing two performance parameters: The parameters are sensitivity and specificity. The equations representing the two performance parameters are provided below as:

Accuracy 5

Sensitivity 5

True Positive 3 100% True Positive 1 False Negative

(1.11)

Specificity 5

True Negative 3 100% True Negative 1 False Positive

(1.12)

ðTrue Positive 1 True NegativeÞ 3 100% (1.13) True Positive 1 True Negative 1 False Positive 1 False Negative

Sensitivity defines the percentage of unhealthy subjects who have been correctly classified as cardiac patients. Specificity measures the percentage of healthy subjects who have been correctly classified as normal.

1.5 Results and discussion The proposed CNN model is trained by employing 70% of the scalogram images and validated using the rest 30%. A total of 239 (abnormal: 168 and normal: 71) scalogram images is used for training. The trained model is then validated using 103 (abnormal: 72 and normal: 31) scalogram images. The proposed CNN model based on modified AlexNet has been trained and validated using the setup parameter as illustrated in Table 1.2. The training and validating results for the proposed model are shown in Fig. 1.12. The proposed model has a classification accuracy of 74.70% using a corrupted and noisy ECG recording. The error matrix of the proposed model is shown in Fig. 1.13.

Figure 1.12 Accuracy plot of the proposed CNN. (A) Training accuracy. (B) Validating accuracy.

16

Sinam Ajitkumar Singh and Swanirbhar Majumder

Figure 1.13 Confusion matrix of the heart abnormality classification using the proposed CNN model. Table 1.3 Comparative analysis of the proposed model with the existing methods. Classification method Sensitivity (%) Specificity (%)

Accuracy (%)

Proposed method GoogleNet

80.60 80.55

61.30 51.61

74.75 71.84

90.20 86.11 77.00 75.00

35.48 58.06 58.06 45.16

73.78 72.80 67.00 66.02

83.00 89.00 96.00 77.00 71.00

49.00 32.00 8.00 43.00 32.00

77.45 72.55 71.00 67.00 60.00

Decision fusion method

AlexNet-Ensemble AlexNet-KNN AlexNet-SVM AlexNet-LDA Existing methods

KNN [35] Ensemble [36] Homomorphic [37] Fractal dimension [38] Curve fitting [38]

Notes: Most the classifiers were trained and tested using ECG scalogram images except “Existing methods.” As “Existing methods” used a PCG recording for the classification of heart abnormality.

This section performs a comparative analysis for the classification of the cardiac abnormality using ECG scalogram images. The comparative result of the traditional methods with the proposed method is illustrated in Table 1.3. In this study, ECG characteristics based on the time-frequency domain scalogram images were used for predicting the heart abnormalities. The proposed method achieves a classification accuracy of 74.70% with 80.60% sensitivity and 61.30% specificity. Based on the traditional methods, the accuracy of models varies from 60.00% to 73.78% while sensitivity varied from 71.00% to 96.00%. But all of the conventional approaches have failed to improve

Short and noisy electrocardiogram classification based on deep learning

the specificity. Sensitivity based on a traditional method, which in turn is based on a PCG recording, has obtained the highest sensitivity of 96.00% but has failed to improve the specificity simultaneously. Still, AlexNet-Ensemble and Ensemble [36] have comparable classification accuracy with our proposed method. But, AlexNetEnsemble has a complex structure due to its duel model and classificationbased Ensemble [36], which were based on the noise-free PCG signals.

1.6 Conclusion The proposed method involves a novel approach for predicting the heart abnormalities using a noisy ECG recording based on a deep learning algorithm. The proposed method has used ECG recordings from the PhysioNet 2016 challenge database. The 2D scalogram images were formed after detrending and filtering the short and noisy ECG recording using CWT. The model was trained and validated based on 2D scalogram images. It has proved that the proposed method using the CNN model based on deep learning algorithms has attained a comparable performance when compared with the state-of-the-art methods. The limitations for analyzing heart abnormalities using ECG signals in this study are that it introduces a high computational complexity to the system by increasing the cost for classification performance. The cost of recording the ECG signal is also high when compared to the PCG signal. This chapter, however, demonstrates how the ECG signal can be analyzed as an alternative approach to heart abnormality prediction with improved performance. The future scope of this chapter is to further enhance the classification performance of the model by employing both ECG and PCG features at the tandem and finally hardware implementation of the above algorithm.

References [1] E.J.D.S. Luz, T.M. Nunes, V.H.C. De Albuquerque, J.P. Papa, D. Menotti, ECG arrhythmia classification based on optimum-path forest, Expert Syst. Appl. 40 (9) (2013) 35613573. [2] E.J.D.S. Luz, W.R. Schwartz, G. Cámara-Chávez, D. Menotti, ECG-based heartbeat classification for arrhythmia detection: a survey, Comput. Methods Prog. Biomed. 127 (2016) 144164. [3] M. Merone, P. Soda, M. Sansone, C. Sansone, ECG databases for biometric systems: a systematic review, Expert Syst. Appl. 67 (2017) 189202. [4] N. Dey, A.S. Ashour, F. Shi, S.J. Fong, R.S. Sherratt, Developing residential wireless sensor networks for ECG healthcare monitoring, IEEE Trans. Consum. Electron. 63 (4) (2017) 442449. [5] A.F. Khalaf, M.I. Owis, I.A. Yassine, A novel technique for cardiac arrhythmia classification using spectral correlation and support vector machines, Expert Syst. Appl. 42 (21) (2015) 83618368. [6] A. De Gaetano, S. Panunzi, F. Rinaldi, A. Risi, M. Sciandrone, A patient adaptable ECG beat classifier based on neural networks, Appl. Math. Comput. 213 (1) (2009) 243249. [7] P. Melin, J. Amezcua, F. Valdez, O. Castillo, A new neural network model based on the LVQ algorithm for multi-class classification of arrhythmias, Inf. Sci. 279 (2014) 483497.

17

18

Sinam Ajitkumar Singh and Swanirbhar Majumder

[8] R.J. Martis, U.R. Acharya, H. Prasad, C.K. Chua, C.M. Lim, J.S. Suri, Application of higher order statistics for atrial arrhythmia classification, Biomed. Signal. Process. Control. 8 (6) (2013) 888900. [9] E. Ramírez, O. Castillo, J. Soria, Hybrid system for cardiac arrhythmia classification with fuzzy KNearest neighbors and neural networks combined by a fuzzy inference system, Soft Computing for Recognition Based on Biometrics, Springer, Berlin, Heidelberg, 2010, pp. 3755. [10] S. Dilmac, M. Korurek, ECG heart beat classification method based on modified ABC algorithm, Appl. Soft Comput. 36 (2015) 641655. [11] S. Shadmand, B. Mashoufi, A new personalized ECG signal classification algorithm using blockbased neural network and particle swarm optimization, Biomed. Signal. Process. Control. 25 (2016) 1223. [12] J. Mateo, A.M. Torres, A. Aparicio, J.L. Santos, An efficient method for ECG beat classification and correction of ectopic beats, Comput. Electr. Eng. 53 (2016) 219229. [13] S.N. Yu, Y.H. Chen, Electrocardiogram beat classification based on wavelet transformation and probabilistic neural network, Pattern Recognit. Lett. 28 (10) (2007) 11421150. [14] M. Thomas, M.K. Das, S. Ari, Automatic ECG arrhythmia classification using dual tree complex wavelet based features, AEU Int. J. Electron. Commun. 69 (4) (2015) 715721. [15] R.S.S. Kumari, J.G. Devi, Classification of cardiac arrhythmias based on morphological and rhythmic features, Int. J. Biomed. Eng. Technol. 14 (3) (2014) 192208. [16] K. Li, W. Pan, Y. Li, Q. Jiang, G. Liu, A method to detect sleep apnea based on deep neural network and hidden markov model using single-lead ECG signal, Neurocomputing 294 (2018) 94101. [17] S.A. Singh, S. Majumder, A novel approach OSA detection using single-lead ECG scalogram based on deep neural network, J. Mech. Med. Biol. 19 (4) (2019) 1950026. [18] S. Mukhopadhyay, S. Biswas, A.B. Roy, N. Dey, Wavelet based QRS complex detection of ECG signal, arXiv preprint arXiv:1209.1563, 2012. [19] B. Ergen, Y. Tatar, H.O. Gulcur, Timefrequency analysis of phonocardiogram signals using wavelet transform: a comparative study, Comput. Meth. Biomech. Biomed. Eng. 15 (4) (2012) 371381. [20] N. Dey, A.S. Ashour, S. Borra (Eds.), Classification in BioApps: Automation of Decision Making, vol. 26, Springer, Cham, 2017. [21] M.P. Wachowiak, D.C. Hay, M.J. Johnson, Assessing heart rate variability through wavelet-based statistical measures, Comput. Biol. Med. 77 (2016) 222230. [22] C.H. Lin, Y.C. Du, T. Chen, Adaptive wavelet network for multiple cardiac arrhythmias recognition, Expert Syst. Appl. 34 (4) (2008) 26012611. [23] M. Kumar, R.B. Pachori, U.R. Acharya, Characterization of coronary artery disease using flexible analytic wavelet transform applied on ECG signals, Biomed. Signal Process. Control 31 (2017) 301308. [24] K. Lan, D.T. Wang, S. Fong, L.S. Liu, K.K. Wong, N. Dey, A survey of data mining and deep learning in bioinformatics, J. Med. Syst. 42 (8) (2018) 139. [25] C.Y. Lee, P.W. Gallagher, Z. Tu, Generalizing pooling functions in convolutional neural networks: mixed, gated, and tree, in: Artificial Intelligence and Statistics, 2016, pp. 464472. [26] G.D. Clifford, C. Liu, B. Moody, D. Springer, I. Silva, Q. Li, et al., Classification of normal/abnormal heart sound recordings: the PhysioNet/Computing in Cardiology Challenge 2016, 2016 Computing in Cardiology Conference (CinC), IEEE, 2016, pp. 609612. [27] C. Liu, D. Springer, Q. Li, B. Moody, R.A. Juan, F.J. Chorro, et al., An open access database for the evaluation of heart sound algorithms, Physiol. Meas. 37 (12) (2016) 2181. [28] J. Pan, W.J. Tompkins, A real-time QRS detection algorithm, IEEE Trans. Biomed. Eng. 32 (3) (1985) 230236. [29] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, in: Advances in Neural Information Processing Systems, 2012, pp. 10971105. [30] J.P. Dominguez-Morales, A.F. Jimenez-Fernandez, M.J. Dominguez-Morales, G. Jimenez-Moreno, Deep neural networks for the recognition and classification of heart murmurs using neuromorphic auditory sensors, IEEE Trans. Biomed. Circuits Syst. 12 (1) (2017) 2434.

Short and noisy electrocardiogram classification based on deep learning

[31] M. Rezaee, M. Mahdianpari, Y. Zhang, B. Salehi, Deep convolutional neural network for complex wetland classification using optical remote sensing imagery, IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 11 (9) (2018) 30303039. [32] T. Kajikawa, N. Kadoya, K. Ito, Y. Takayama, T. Chiba, S. Tomori, et al., Automated prediction of dosimetric eligibility of patients with prostate cancer undergoing intensity-modulated radiation therapy using a convolutional neural network, Radiol. Phys. Technol. 11 (3) (2018) 320327. [33] H.D. Nguyen, B.A. Wilkins, Q. Cheng, B.A. Benjamin, An online sleep apnea detection method based on recurrence quantification analysis, IEEE J. Biomed. Health Inform. 18 (4) (2013) 12851293. [34] S. Md Noor, J. Ren, S. Marshall, K. Michael, Hyperspectral image enhancement and mixture deeplearning classification of corneal epithelium injuries, Sensors 17 (11) (2017) 2644. [35] S.A. Singh, S. Majumder, Classification of unsegmented heart sound recording using KNN classifier, J. Mech. Med. Biol. 19 (2019) 1950025. [36] M. Khened, V.A. Kollerathu, G. Krishnamurthi, Fully convolutional multi-scale residual DenseNets for cardiac segmentation and automated cardiac diagnosis using ensemble of classifiers, Med. Image Anal. 51 (2019) 2145. [37] D.B. Springer, L. Tarassenko, G.D. Clifford, Logistic regression-HSMM-based heart sound segmentation, IEEE Trans. Biomed. Eng. 63 (4) (2015) 822832. [38] M. Hamidi, H. Ghassemian, M. Imani, Classification of heart sound signal using curve fitting and fractal dimension, Biomed. Signal. Process. Control. 39 (2018) 351359.

19

CHAPTER TWO

Single-layer convolution neural network for cardiac disease classification using electrocardiogram signals P. Gopika, C.S. Krishnendu, M. Hari Chandana, S. Ananthakrishnan, V. Sowmya, E.A. Gopalakrishnan and K.P. Soman Center for Computational Engineering and Networking (CEN), Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Coimbatore, India

2.1 Introduction Revolutionary changes have occurred in biomedical research during the past several decades. The basic challenge is to apply biology and engineering techniques to study and solve the problems in life science, especially in medicine [1]. Heart diseases are a major cause of sudden death: The death due to cardiovascular disease is about 26 million people annually [2], representing 31% of all global deaths. Early detection and diagnosis of heart disease can decrease the mortality rate. One of the most feasible and inexpensive methods of diagnosis is electrocardiogram (ECG). The process of recording the electrical activity of the heart by placing electrodes over the skin is known as electrocardiography. Heart activity can be observed from the ECG graph, for each heartbeat the heart emits a series of electrical discharge spikes. The sensors placed over the skin are used to record the spikes. The graph of voltage versus time produced by this medical procedure is an electrocardiogram, commonly called an ECG. An ECG can be used to measure the position and size of the heart chambers, the rhythm and rate of heartbeats, the presence of any damage to the conduction system or heart’s muscle cells, the effects of heart drugs, and the function of implanted pacemakers. An ECG signal is composed of a successive repetition of “PQRST” in monotony [3]. The signal consists of a P wave, QRS complex, and T wave, which are periodic in nature. The first half of the P wave represents a depolarization of the right atrium; the second half represents the depolarization of the left atrium. The QRS wave is sometimes called the QRS complex, and it represents the depolarization of the ventricles, and the T wave is produced by repolarization of the ventricles [4]. An irregularity in Deep Learning for Data Analytics. DOI: https://doi.org/10.1016/B978-0-12-819764-6.00003-X

© 2020 Elsevier Inc. All rights reserved.

21

22

P. Gopika et al.

heartbeat can be detected using parameters such as a P wave, QRS complex, T wave components, RR interval, and shape and duration of the heart signal in ECG signals [5]. An RR interval represents the time elapsed between two successive R-waves of the QRS complex on the electrocardiogram. Arrhythmia and myocardial infarction (MI) are two common cardiac diseases worldwide. When there is a decrease or a blockage in the blood flow to the heart, intense pain in the chest occurs which may lead to death. This is known as a heart attack or MI [1]. MI is one of the most common heart diseases in India, and it is mainly due to the contracted coronary arteries, which causes inadequate oxygen distribution to the myocardium [6]. An irregularity in the heart rate or rhythm is known as arrhythmia. If heartbeats are faster than 100 beats/min, it is called tachycardia; if it is slower than 60 beats/min, it is called bradycardia. An irregular and fast heartbeat is known as atrial fibrillation [7]. Early diagnosis of heart diseases through ECG signal analysis is important. In literature, the number of methods based on machine learning [8], signal processing [9], data mining, and deep learning are proposed [10]. In the current era of artificial intelligence (AI), the system can learn from data, identify patterns, and make decisions with minimal human intervention using various machine learning and deep learning algorithms, which are the fundamental roots of AI [11]. Various machine learning techniques can be used for the classification of ECG signals into healthy and unhealthy subjects. Machine learning algorithms employ two strategies: supervised and unsupervised. In supervised learning, the data contain the class label information. In unsupervised learning, the data do not have label information. The algorithms used in the unsupervised strategy include clustering algorithms, which cluster the data according to the commonly shared criteria. Some of the conventional supervised machine learning techniques used for classification are support vector machine (SVM) [12], random forest [13], naïve Bayes [14], etc. Currently, deep learning is the most popular machine learning technique used in various domains, including the medical field. Deep learning is a data-driven algorithm based on artificial neural networks (ANN). ANN are fast, accurate, self-adaptive, and nonlinear [15]. The main advantage of ANN is that it uses an activation function to provide nonlinear mapping between inputs and outputs. The sigmoid is an example of the activation function. Thus, to solve a nonlinear problem such as the classification of ECG signals, one can use the activation function. Statistical or deterministic approaches achieve similar and better results [15], however. Whereas the performance of statistical methods are good for linear problems, these methods cannot generate good results in the case of nonlinear problems. Statistical methods are developed based on the assumption of given linear time series. ANN can adaptively model the lower frequencies of the ECG which are inherently nonlinear. It is possible to remove nonlinear noise characteristics and time-varying ECG signals using ANN. In the medical field, the widely used deep learning algorithm is the convolutional neural network

23

Single-layer convolution neural network for cardiac disease classification using electrocardiogram signals

(CNN) [11]. CNN has wide a range of applications in the field of computer vision, including face recognition, scene labeling, image classification, action recognition, natural language processing, and so on [11]. This has motivated us to consider CNN for the study. In order to attain a high classification performance, the exiting CNN architectures require deep layers as the raw ECG signals are used as an input to the networks [15]. This in turn increases the computational complexity in terms of number of learnable parameters. To address this issue, we propose to use the preprocessed and the beat-segmented ECG signal, which is available in Kaggle [16]. The preprocessed ECG signal is given as input to the network. As a result, we were able to reduce the number of learnable parameters through single-layer CNN architecture to classify the different classes present in arrhythmia and MI. The rest of this chapter is organized as follows: Related works are presented in Section 2.2. The proposed methodology along with the data set description is provided in Section 2.3. The experimental result analysis is presented in Section 2.4. The chapter is concluded in Section 2.5.

2.2 Related works In recent literature, use of various techniques such as data mining, machine learning, and deep CNNs have improved the quality of diagnosis in the biomedical domain (refer to Table 2.1). For better classification purposes, various preprocessing techniques, feature extraction techniques, and classifiers have been used. In [14] and [23], they have used a discrete wavelet transform (DWT) method to extract RR intervals and normalized them using a z-score to classify the ECG signal. The classification accuracy Table 2.1 Related works for cardiac disease classification diagnosed using ECG. Researcher Method/architecture

Accuracy (%)

Acharya et al. [17] Acharya et al. [18] Zhang et al. [19] Baloglu et al. [20] Dallali et al. [15]

11-layer CNN and 10-fold cross-validation 9-layer CNN 12-layer 1D CNN 10-layer CNN on 12-lead ECG signal DWT to extract RR intervals and normalize it using a z-score and FCM to classify the ECG signal 7-layer CNN

93.18 94.03 97.6 99.00 99.05

DWT features, four RR interval features, and Teager energy values with 10-layer CNN

99.68

Sannino and De Pietro [21] Syed Muhammad Anwar et al. [22]

99.68

24

P. Gopika et al.

attained is 99.05% [15]. In Ref. [15], they have proposed a classification technique for an ECG heartbeat classification on the MIT-BIH Arrhythmia Database using a deep neural network that consists of seven hidden layers. The classification accuracy attained using this method is 99.68% [21]. In Ref. [21] they have proposed a novel method of using a nine-layer deep CNN to classify the heartbeats [21]. The method proposed in Ref. [21] attained 94.03% accuracy using MIT-BIH Arrhythmia Database containing five classes Ref. [18]. In Ref. [18], a new technique of deriving morphological and dynamic features of heartbeat by computing DWT features, four RR interval features fed to feed forward neural network with 10 hidden layers. The data set used from the MIT-BIH Arrhythmia Database contain 13,724 beats and the MIT-BIH Supraventricular Arrhythmia Database contain 22,151 beats. The average accuracy attained for class-oriented evaluation and subject-oriented evaluation are 99.75% and 99.84% respectively [22]. In Ref. [22], the authors have proposed the 10-layer-deep CNN model on the standard 12-lead ECG signal. They achieved an impressive accuracy and sensitivity performance of over 99.0% [20]. In Ref. [20], they have proposed a novel technique where the 11-layer CNN is used for an automated differentiation of shockable and nonshockable ventricular arrhythmia from 2 second ECG segments. It consists of a 10-fold cross-validated method with an achieved maximum accuracy, sensitivity, and specificity rate of 93.18%, 95.32%, and 91.04% respectively [17]. A 12-layer 1D CNN model is proposed to classify one lead individual heartbeat signal into five classes of heart diseases with an achieved accuracy rate of 97.6% [19]. In Ref. [24], deep learning architectures, namely CNN and RNN, are used for ECG signal classification. All of these proposed models have a very large number of learnable parameters due to the presence of many convolutional layers, which increases the computational complexity. To overcome these disadvantages we use feature-extracted data and a single-layer CNN for the classification of arrhythmia and MI diseases. The proposed architecture gives better results than state-of-the-art architecture in terms of accuracy and number of learnable parameters.

2.3 Methodology 2.3.1 Convolutional neural network CNN has a wide variety of applications in various fields. The first layer is the convolution layer. Convolutional layer is the primary building block of CNN. It extracts the high-level features from the input signal. The pooling layer is followed after the convolution layer. The pooling operations are fixed according to the applications. The different pooling operation includes max-pooling, min-pooling, and the average

Single-layer convolution neural network for cardiac disease classification using electrocardiogram signals

Table 2.2 The summary of the network architecture details for the arrhythmia and myocardial infarction classification. Layers Myocardial infarction Arrhythmia

Conv1D Max pooling Flatten Dense Dropout Dense

Output size

Learnable parameter

Output size

Learnable parameter

187 3 64 93 3 64 5952 3 1 128 3 1 128 3 1 231

256 0 0 7,61,984 0 258

187 3 64 93 3 64 5952 3 1 128 3 1 128 3 1 531

256 0 0 7,61,984 0 645

pooling. Pooling operation is mainly used for the dimensionality reduction and also to select the most significant feature. These features are fed to the fully connected layer which consists of activation function.

2.3.2 Network architecture The network parameters such as number of filters, filter size, number of hidden layers, batch size, and the learning rate are fixed by hyperparameter tuning. The optimum parameter value is obtained from various trails of experiments. The memory block neuron size in the CNN is varied from 2. And also the learning rate is also tuned from 0.01 0.2. The learning rate 0.1 is fixed by considering the computational time and cost. The pool length of max pooling layer are varied as 2, 3, and 5. The performance is high in the case of 64 neurons and the pool length of 3. The dense layer has 128 neurons. Dropout regularization technique is used after the dense layer. The summary of the network architecture for arrhythmia is tabulated in Table 2.2. The number of neurons in the input layer is 187. In the first convolution layer, 64 filters of each size 1 3 3 are used. Therefore the number of learnable parameters including the bias is 256 in the first layer. The same network parameters are followed for both the arrhythmia and MI until the pre-final layer. The number of learnable parameters in the dense layer including bias is 7,61,984 ((5952 3 128) 1128). The output layer contains 2 and 5 neurons in the case of MI and arrhythmia respectively.

2.4 Experimental result and analysis 2.4.1 Data set description 2.4.1.1 Arrhythmia In this chapter, we use the Arrhythmia Database available at Kaggle [16]. It is derived from the Physionet MIT-BIH (Massachusetts Institute of Technology-Beth Israel

25

26

P. Gopika et al.

Table 2.3 Category and annotation of arrhythmia according to the AAMI standard [27]. Category Annotation

N

S

V F Q

• • • • • • • • • • • • •

Normal Nodal escape Atrial escape Left/Right bundle branch block Atrial premature Nodal premature Supraventricular premature Ventricular escape Premature ventricular contraction Fusion of ventricular and normal Unclassifiable Paced Fusion of paced and normal

Figure 2.1 A sample of beat segmented and pre-processed ECG signal.

Hospital) database [16,25]. With this database, arrhythmia can be studied as beatsegmented and preprocessed ECG signals [26]. Each instance in the database is annotated in accordance with the Association for the Advancement of Medical Instrumentation (AAMI) standard by the physicians. It is resampled with the sampling frequency with 125 Hz. The database consists of five classes, which are shown in Table 2.3. The sample preprocessed and beat-segmented ECG is shown in Fig. 2.1. 2.4.1.2 Myocardial infarction In this chapter, we use the preprocessed and beat-segmented ECG database available at Kaggle. This database is derived from the Physikalisch-Technische Bundesansalt Diagnosis Database (PTBDB) [16]. It is resampled with the sampling frequency of 125 Hz. It consists of a database collected from 200 subjects. Among the 200 subjects,

Single-layer convolution neural network for cardiac disease classification using electrocardiogram signals

Table 2.4 Summary of train and test samples considered for ECG signal classification. Disease Class No of test samples No of training samples

Arrhythmia

Myocardial infarction

0 1 2 3 4 Total samples 0 1 Total samples

18,118 556 1448 162 1608 21,892 900 2012 2912

72,471 2224 27,265 641 6430 1,09,031 3236 8404 11,640

52 subjects are diagnosed as healthy and 148 subjects are diagnosed with MI. The beat-segmented and preprocessed database consists of 14,552 instances with two categories. The number of train and test samples used for arrhythmia and MI are tabulated in Table 2.4.

2.4.2 Arrhythmia disease classification using proposed convolutional neural network The Arrhythmia database has 1,09,446 beat-segmented and preprocessed ECG signals of five different classes, which include normal and four different abnormal categories. From each class, 80% of data are considered for training and 20% for testing. Epoch is one among the hyperparameters considered for training the model. Epoch value is fixed as 1000 [16]. The epoch versus accuracy and the epoch versus loss are shown in Figs. 2.2 and 2.3 respectively. From the figures, it is observed that when the epoch increases, there is an increase in the accuracy and decrease in the loss for the arrhythmia disease classification using the proposed single-layer CNN architecture (Fig. 2.4). The performance of the model is evaluated using standard metrics: precision, recall, and F1 score. The class-wise results obtained for the arrhythmia classification using the proposed single-layer convolution neural work is tabulated in Table 2.5. It is evident from Table 2.5 that the proposed work has high precision (0.98), recall (0.98), and F1 score (0.98). The existing benchmark architecture using residual CNN has precision (0.94), recall (0.94), and F1 score (0.94) respectively [16]. Also, this existing work shows that they have considered more samples for training the deep learning architecture. But the proposed work shows the improvement in the classification performance using less number of samples to train the proposed deep learning model. A confusion matrix is also used to show the performance of the model. The confusion matrix obtained for the Arrhythmia classification using the proposed single convolution layer is shown in Fig. 2.1. From Fig. 2.5, it is observed that the classification performance of class 0, 2, and 5 is high.

27

28

P. Gopika et al.

Figure 2.2 Plot showing epoch vs accuracy for arrhythmia dataset.

Figure 2.3 Plot showing epoch vs. loss for arrhythmia dataset.

Single-layer convolution neural network for cardiac disease classification using electrocardiogram signals

Figure 2.4 Confusion matrix obtained for classification of arrhythmia using proposed method. Table 2.5 Class-wise performance measures for arrhythmia classification using the proposed single-layer CNN architecture. Evaluation metrics Value Total

Accuracy (%) Precision

Recall

F1 score

Q F V S N Q F V S N Q F V S N

98.04 0.98 0.90 0.96 0.81 0.99 1 0.67 0.93 0.77 0.98 0.99 0.77 0.94 0.79 0.99

0.98

0.98

0.98

The false positive rate and false negative rate should be less, if the model is significant. The ROC curve is the graphical representation that shows the trade-off between true positive and the false positive at different thresholds. Fig. 2.6 shows the class-wise ROC curve obtained for the proposed single-layer convolution neural network. Area under the curve (AUC) represents a degree of separability. The higher the AUC, the better the classification model as we obtained for the proposed single-layer CNN architecture for the Arrhythmia classification (Fig. 2.6).

29

30

P. Gopika et al.

Figure 2.5 ROC curve for Arrhythmia classification using the proposed CNN architecture.

Figure 2.6 Epoch versus accuracy for myocardial infarction classification.

Single-layer convolution neural network for cardiac disease classification using electrocardiogram signals

Table 2.6 Classification performance of arrhythmia classification using SVM (RBF Kernel). Evaluation metrics Value

Accuracy Precision Recall F1 score

0.91 0.90 0.91 0.91

2.4.3 Arrhythmia classification using support vector machine We evaluated the performance of Arrhythmia classification using SVM. SVM is the classifier considered for arrhythmia classification. The input to the SVM is the beatsegmented and preprocessed data. The SVM finds out the best hyper-plane to differentiate the classes. The analysis is carried using different kernels. Among the different kernels, the performance of Arrhythmia classification is high using RBF kernel. C is the hyperparameter tuned to reduce the misclassification based on the training data. The classification performance is evaluated using standard metrics known as precision, recall and F1 score. The classification performance obtained using SVM is tabulated in Table 2.6.

2.4.4 Myocardial infarction disease classification using the proposed convolutional neural network The input to the proposed CNN is beat-segmented and preprocessed ECG signal for MI classification. The architecture details are given in Table 2.3. The final output layer depends on the number of output classes. The proposed architecture is same as in the case of Arrhythmia classification. The model is retrained with the same hyperparameters, which were used for Arrhythmia. The epoch versus accuracy and the epoch versus loss are plotted in the Figs. 2.7 and 2.8 respectively. It is observed from the Figs. 2.7 and 2.8 that the accuracy is stabilized and the loss is minimized after 200 epochs. The confusion matrix for MI classification is shown in Fig. 2.5. From the confusion matrix, it is shown that the normal class and the abnormal class have attained 97% and 100% respectively. The classification results of MI using the proposed single-layer CNN are tabulated in Table 2.7. From Table 2.7, it is observed that the proposed method attains precision, recall and F1 score of 0.99. The misclassification rate is very less in the proposed method. The ROC curve shows that the model has high performance, which is shown in Fig. 2.9.

31

32

P. Gopika et al.

Figure 2.7 Epoch versus loss for myocardial infarction classification.

Figure 2.8 Confusion matrix for myocardial infarction classification using the proposed single layer CNN.

The performance of the MI classification using SVM is computed. The results obtained for MI classification is tabulated in Table 2.8. The analysis shows that the classification results of precision, recall, and F1 score are 0.92. These results show that the performance of the deep learning method is high when compared with the machine learning algorithm.

33

Single-layer convolution neural network for cardiac disease classification using electrocardiogram signals

Table 2.7 Performance measures of myocardial infarction classification using the proposed singlelayer CNN architecture. Evaluation metrics Value Total

Accuracy (%) Precision Recall F1 score

0 1 0 1 0 1

98.76 0.99 0.99 0.97 1.00 0.98 0.99

0.99 0.99 0.99

Figure 2.9 ROC for myocardial infarction classification using the proposed single-layer CNN architecture.

2.4.4.1 Comparison of the proposed work against the literature The proposed work is compared against the literature as shown in Table 2.9. From this table, it is clear that the proposed single-layer CNN architecture can improve the classification performance in the case of arrhythmia and the MI.

34

P. Gopika et al.

Table 2.8 Performance measures for the myocardial infarction classification using support vector machines. Evaluation metrics Results obtained

Accuracy Precision Recall F1 score

0.92 0.92 0.92 0.92

Table 2.9 Performance comparison of the proposed single-layer CNN model against the state-ofthe-art CNN model for Arrhythmia and myocardial infarction classification. Data set Model Precision Recall F1 score

Arrhythmia Myocardial infarction

Existing [11] Proposed Existing [11] Proposed

0.94 0.98 0.96 0.99

0.94 0.98 0.95 0.99

0.94 0.98 0.95 0.99

2.5 Conclusion In this chapter, we proposed a single convolution layer CNN for the classification of Arrhythmia and MI. We trained the CNN with the preprocessed and beatsegmented ECG beats of Arrhythmia and MI. The single convolution layer CNN learns all the required features in a single layer since the data is beat-segmented and preprocessed. The results show that the performance of the Arrhythmia and Myocardial classification are enhanced by the proposed method. This can be used as the assistive diagnostic tools by the physicians. The limitations of the present work is that the proposed single-layer CNN architecture is validated only on two types of cardiac diseases. The future scope of the present work is to extend the analysis for other different cardiac diseases.

References [1] D.T. Wang, K. Lan, et al., A survey of data mining and deep learning in bioinformatics, J. Med. Syst. 42 (8) (2018) 139. [2] P. Ponikowski, et al., Heart failure: preventing disease and death worldwide, ESC Heart Fail. 1 (1) (2014) 4 25. [3] S. Mukhopadhyay, S. Biswas, A.B. Roy, N. Dey, Wavelet based QRS complex detection of ECG signal, Int. J. Eng. Res. Appl. 2 (3) (2012) 2248 2622, arXiv preprint arXiv:1209.1563. [4] J.W. Hurst, Naming of the waves in the ECG, with a brief account of their genesis, Circulation 98 (18) (1998) 1937 1942. [5] Y. Ozbay, B. Karlik, A recognition of ECG arrhytihemias using artificial neural networks, in: Conference Proceedings of the 23rd Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Istanbul, Turkey, 2001.

Single-layer convolution neural network for cardiac disease classification using electrocardiogram signals

[6] N. Dey, A.S. Ashour, S. Borra, Classification in BioApps: Automation of Decision Making, vol. 26, Springer, 2017. [7] A. Jezzini, M. Ayache, L. Elkhansa, Z. Al Abidin Ibrahim, ECG classification for sleep apnea detection, in: International Conference on Advances in Biomedical Engineering (ICABME) IEEE, Beirut, Lebanon, 2015. [8] P. Shimpi, S. Shah, M. Shroff, A. Godbole, A machine learning approach for the classification of cardiac arrhythmia, in: International Conference on Computing Methodologies and Communication (ICCMC), Erode, 2017. [9] N.V. Thakor, Y.S. Zhu, Applications of adaptive filtering to ECG analysis: noise cancellation and arrhythmia detection, Biomed. Eng. 38 (8) (1991) 785 794. [10] M.J. Shivajirao, L.N. Sanjay, A.G. Ashok, Arrhythmia disease classification using artificial neural network model, in: IEEE International Conference on Computational Intelligence and Computing Research, Coimbatore, India, 2010. [11] Y.W. Chang, et al., Training and testing low-degree polynomial data mappings via linear SVM, J. Mach. Learn. Res. 11 (2010) 1471 1490. [12] J.A. Nasiri, et al., ECG arrhythmia classification with support vector machines and genetic algorithm, in: Third UKSim European Symposium on Computer Modeling and Simulation, Athens, 2009. [13] M. Kropf, D. Hayn, G. Schreier, ECG classification based on time and frequency domain features using random forests, in: Computing in Cardiology (CinC), Rennes, 2017. [14] T. Soman, O.B. Patrick, Classification of arrhythmia using machine learning techniques, WSEAS Trans. Comput. 4 (6) (2005) 548 552. [15] A. Dallali, A. Kachouri, M. Samet, Classification of cardiac arrhythmia using wt, hrv, and fuzzy c-means clustering, Signal. Process. Int. J. 5 (3) (2011) 101 109. [16] M. Abadi, et al., Tensorflow: A system for large-scale machine learning, in: 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16), 2016. [17] U.R. Acharya, et al., A deep convolutional neural network model to classify heartbeats, Comput. Biol. Med. 89 (2017) 389 396. [18] U.R. Acharya, et al., Deep convolutional neural network for the automated diagnosis of congestive heart failure using ECG signals, Appl. Intell. 89 (2018) 389 396. [19] W. Zhang, L. Yu, L. Ye, W. Zhuang, F. Ma, ECG signal classification with deep learning for heart disease identification, in: International Conference on Big Data and Artificial Intelligence (BDAI), IEEE, Beijing, China, 2018. [20] U.B. Baloglu, et al., Classification of myocardial infarction with multi-lead ECG signals and deep CNN, Pattern Recogn. Lett. 122 (2019) 23 30. [21] G. Sannino, G. De Pietro, A deep learning approach for ECG-based heartbeat classification for arrhythmia detection, Futur. Gener. Comput. Syst. 86 (2018) 446 455. [22] S.M. Anwar, M. Gul, M. Majid, M. Alnowami, Arrhythmia classification of ECG signals using hybrid features, Comput. Math. Methods Med. 2018 (2018) 1 8. [23] M.F. Amri, M.I. Rizqyawan, A. Turnip, ECG signal processing using offline-wavelet transform method based on ECG-IoT device, in: 3rd International Conference on Information Technology, Computer, and Electrical Engineering (ICITACEE) IEEE, Semarang, Indonesia, 2016. [24] L. Guo, G. Sim, B. Matuszewski, Inter-patient ECG classification with convolutional and recurrent neural networks, arXiv preprint arXiv:1810.04121, 2018. [25] A.L. Goldberger, et al., PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals, Circulation 101 (23) (2000) 215 220. [26] physionet MIT-BIH Dataset, Kaggle, [Online]. Available: ,https://www.kaggle.com/mondejar/ mitbih-database., 2018. [27] M. Kachue, F. Shayan, M. Sarrafzadeh, Ecg heartbeat classification: a deep transferable representation, in: IEEE International Conference on Healthcare Informatics (ICHI), New York, NY, USA, 2018.

35

CHAPTER THREE

Generalization performance of deep autoencoder kernels for identification of abnormalities on electrocardiograms Gokhan Altan and Yakup Kutlu Department of Computer Engineering, Iskenderun Technical University, Hatay, Turkey

3.1 Introduction Deep learning (DL) has become the most popular classification algorithm with feature transferring algorithms, and large quantities of hidden layers, which provides detailed analysis detailed in recent years [1]. DL is not only a simple machine learning algorithm but also a specific type of artificial neural network that incorporates feature learning and feature reduction. The most prominent features that make the DL superior to other classifiers are feature learning stages, feature reduction, and pretraining of the model at the first stage [2]. Feature learning allows defining the classification parameters at a particular space away from randomness using unsupervised techniques. Predefined classification parameters provide support to optimization and designating the relationship between input features for modeling capable models. Transferred learning parameters and pretrained weights obtained according to the similarity of data to each other at a fixed threshold are unfolded to the supervised training models. Predefined parameters enable fitting optimum DL models with many hidden layers and a large number of neurons at each hidden layer [3]. Most popular DL algorithms are adapted to image recognition applications using the advantages of extracting low- and high-level features at the unsupervised stage, whereas the majority focus on initializing the weights to fit the supervised models by transferring unsupervised parameters [4]. DL algorithms used for machine vision are based on advanced image processing techniques at the first stage to extract high- and low-level features. Among these algorithms, generative adversarial networks and convolutional neural networks are the leading methods in the literature. Applying imagebased DL algorithms to the time series requires the conversion of signals to image plots to assess the pathologies. This conversion results in representing time series using twodimensional structures, or increasing the data dimension to be processed. Therefore, processing increased the dimensionality of data to feature learning stages stands out as Deep Learning for Data Analytics. DOI: https://doi.org/10.1016/B978-0-12-819764-6.00004-1

© 2020 Elsevier Inc. All rights reserved.

37

38

Gokhan Altan and Yakup Kutlu

an extra computation power [5]. In particular, the use of nonstandardized time series, frequency features, resolution features, signal length, differences in extracting start- and end-point of the time series further reduce the functionality of these DL algorithms. Despite these disadvantages, some studies achieved high enough classification performances on short-term signal plots by applying image-based DL algorithms to identify the pathological segment [6,7]. In the case of long-term signals, the plotting size increases considerably; thus providing a detailed analysis using many hidden layers eventuates in time-consuming analysis and train processes. In this case, the idea that an analysis of time-series signals not only folds signals into classification models directly as input, it was also considered that some fiducial and nonfiducial feature extraction methods should be adapted to DL algorithms as a pretraining stage. The transformation and statistical features obtained by the pretraining of the time series will help to identify significant and characteristic features for time-series signals, and it will also provide a feature dimensionality reduction to fit the DL model [8,9]. In this phase, deep belief networks (DBNs) and deep extreme learning machines (deep ELMs), which are frequently chosen to analyze time-series signals and its transformation features, become a current issue in recent developments. The DBN classifier determines the pretrained weights by using unsupervised restricted Boltzmann machines (RBMs). The DBN trigs a layer-wise pretraining, which initializes the weights depending on the probability and energy distribution of the data adjacent layers [3]. The weights obtained from previous RBMs have been used to calculate the weight vector of the nodes at an adjacent layer. Each layer has a connection with the adjacent hidden layers, but there is no connection between nodes at the same layer. Each node weight is related to the connected nodes at adjacent layers and is used to calculate the output weights and node weight vector for the upper layer [3,10]. The deep ELM model also initializes the weight of the connection between the interlayer by using autoencoder algorithms between adjacent layers as similar to DBN by using the hidden layer as both input and output [11]. Autoencoder models have great importance to identify the relationship between interlayers for more detailed analysis of big data. Autoencoder has a simple mathematical theory, which has no need for advanced computation capability for even complicated models and supplies enhancement and acceleration on DL algorithms [12]. Lan et al. reported a survey on the efficiency of the DL approaches and machine learning algorithms in healthcare systems. They mentioned the superiorities of the DL algorithms against common classifier models on recent data analytics [13]. Here, we emphasize performances in generalization, robustness, and speed for DBN, deep ELM with different autoencoder kernels on raw short-term electrocardiogram (ECG) and energy-time-frequency features of the time series, separately. The aim of the study is to evaluate the performance of autoencoder kernels on classification, feature learning, and generating representations on ECG features with

Generalization performance of deep autoencoder kernels for identification of abnormalities on electrocardiograms

energy-frequency-time modulations. The contributions of the study are combining the efficiency of feature reduction capabilities of the autoencoder kernels and fiducial features on ECGs, designing supervised learning kernels with Hessenberg decomposition technique in the pretraining of the deep models, and accelerating the training speed using ELM kernels and sparse autoencoder models. The remaining part of the chapter details the autoencoder models, deep autoencoder algorithms, and ELM autoencoder kernels on deep ELM structures in the next sections. The preprocessing and feature extraction on ECG recordings, classification models, experimental setups on deep models, and performance measurements of the proposed coronary disease identification system are evaluated and the achievements discussed. Further, the existing DL algorithms and deep ELM kernel are experimentally tested and compared with the proposed deep ELM kernel on ECG-based features.

3.2 Autoencoder Autoencoder kernels are unsupervised models to generate different presentations of the input data by setting target values to be equal to inputs. The adjustable size characteristic of autoencoder on encoded representations has produced it as an adaptable method at unsupervised stages of the DL algorithms [14]. Autoencoder is an unsupervised neural network that performs data compression from multidimensional to a preferred dimensionality. It reconstructs input data using the hidden layer weights calculated by encoding [12,15]. How new representations encoded after construction and reconstruction are similar to the input data, the performance of a modeled autoencoder is considered as high. Autoencoder consists of three layers including an input layer, a hidden layer, and the output layer. Input and output layers have an equal number of nodes in autoencoder. This is because the purpose of the autoencoder is to initialize the hidden layer parameters that will reconstruct the multidimensional input data [1517]. Fig. 3.1 depicts a simple autoencoder. Encoding is the process between the input layer and the hidden layer. The encoder is usually used to reduce multidimensional data to low-dimensional data (compressed representation), whereas it may allow obtaining different representations at equal dimension and higher dimensional data (sparse representation). The decoding is constructing the process of the output using hidden layer output weights between the hidden layer and the output layer [11,12,18]. Decoding aims to reconstruct the input data by using the sparse or compressed representations into preferred dimensionality. When x is denoted as input data, x^ is the output data that is the construction of input x by the autoencoder model. Since autoencoder is usually used for

39

40

Gokhan Altan and Yakup Kutlu

Figure 3.1 An autoencoder network.

compression, the hidden layer is called a bottleneck. However, if some sort of correlations exist between input features, the autoencoder can handle learning easily [19,20]. Conflictingly, subsequent reconstruction and decoding input data on autoencoder are leveraged when forcing the input through the network's bottleneck. Additionally, with the use of autoencoder in DL to transfer feature learning, it aims to perform a nonlinear feature reduction depending on the similarities of the input data (x1 ; x2 ; . . . ; xj ) in many machine learning algorithms. Autoencoder forces the system to construct features that are similar in a certain threshold into single nodes to decode different representations [11]. If the autoencoder uses a small number of nodes at the hidden layer than the input dimension (L , j) to reduce the feature dimensionality and to extract significant characteristics, the model is designated as under-complete or having a compressed representation. Conversely, if the node size in the hidden layer is greater than the size of the input data, this model is designated as over-complete or having sparse representation [11]. The capacity, number of nodes, and type of the encoder and decoder must be selected considering the complexity of the distribution of input data in order to perform successful unsupervised training on the autoencoder architecture [17,20].

41

Generalization performance of deep autoencoder kernels for identification of abnormalities on electrocardiograms

In addition to compression and feature reduction features, autoencoder models are used to denoise input data. It is often preferred for video editing and image generating. However, there are also incorrect assessments that define autoencoder as a standard filter. It is a wrong approach to evaluate the constructed autoencoder models as a constant filter. This is because each autoencoder consists of optimized output weights depending on the input [14,21]. Autoencoder takes the input data x and maps to H. H 5 σ1 ðWx 1 b1 Þ

(3.1)

where H is hidden layer variables or latent representation, σ1 is denoted as activation function, W is denoted as encoder weight matrix, and b1 is a bias of encoder. x^ 5 σ2 ðβH 1 b2 Þ

(3.2)

where x^ is the output of the autoencoder, σ2 is denoted as an activation function, β is denoted as an output weight matrix, and b2 is the bias of decoder. The decoder parameters may differ from the proper encoder parameters, depending on the models, layer depth, and node size of the autoencoder [22]. x^ 5 σ2 ðβ ðσ1 ðWx 1 b1 ÞÞ 1 b2 Þ

(3.3)

While the autoencoder should be sufficiently sensitive to input to obtain successful reconstruction, one of the most important problems of learning models should be modeled as insensitive from training data to prevent overfitting [15,23]. Input sensitivity trade-off exacts autoencoder models to retain only the variations in input data required by excluding redundancies within the input. This case comprises stating a reconstruction loss function (Lf ), which fosters the autoencoder to be sensitive to the inputs, and adding regularization value (ϕae ) to stabilize the autoencoder between discouraging overfitting and gaining input sensitivity [15]. 2

Lf 5 :x2 x^ : 1 ϕae

(3.4) 2

Lf 5 :x2 ðσ2 ðβ ðσ1 ðWx1b1 ÞÞ1b2 ÞÞ: 1 ϕae

(3.5)

Autoencoder kernels have been moved to the forefront of generative modeling in recent years owing to the theoretical connections between hidden variables. Autoencoder can be considered as a special type of feed-forward network, which has the ability to generate representations using linear and nonlinear kernels. Conjugate gradient descent and contrastive divergence algorithms can perform back-propagation to minimize the training error by weight optimization. In light of these advantages, the output matrix of autoencoder has provided a flexible modeling capacity to machine learning algorithms in recent years by defining neuron weights for multilayer classifiers [24,25].

42

Gokhan Altan and Yakup Kutlu

3.3 Deep autoencoder In theory, deep autoencoder should be an unsupervised method, which is organized using many hidden layers for both decoding and encoding stages. The increase in the number of hidden layers enables generating different representations at each level and minimizing the sensitivity associated with the input data. Using many hidden layers performs variational and probabilistic feature reduction as well as extracting features that are more dominant [16,21]. Especially, the stacking autoencoder ability has enhanced the deep autoencoder approaches by transferring the parameters to the joint layers. Deep autoencoder consists of multiple encoder and decoder phases to perform a sequential transfer of encoding parameters followed by a stack of decoding parameters (see Fig. 3.2). A most significant feature for deep autoencoder is enhancing the assessing capabilities of the systems by generating multiple representations of input data at each layer with different dimensions, which can characterize the input data precisely for the learning. Each additional hidden layer tends to learn even higher-order features [22]. Although increasing the hidden layer in size for achieving detailed analysis and coming up with big data solutions have enabled the system to reach meaningful results, modeling of DL structures for the data analysis process has brought about the requirements for a long training time and computation capacity depending on the increase in the number of optimizable parameters [12]. Therefore, it is focused on reducing training time for DL algorithms in recent studies.

X1

X1

H X2

He1

Hd1

h1 b1

X2 He2

h2

X2

X2 Hd2

b2

X

X

hL Xj–1

Hej

bL

Xj

Input layer

Figure 3.2 Deep autoencoder model.

Hdj

Xj–1

Xj

Hidden layers

Output layers

Generalization performance of deep autoencoder kernels for identification of abnormalities on electrocardiograms

3.3.1 Extreme learning machine autoencoder ELM is a fast and robust machine learning algorithm. It was straightly termed for the generalized single-layer feed-forward neural network (SLFN) in 20062008 [2628]. ELM theory supports the idea that randomness in the determination of input weights can feed the learning models with no tuning for any distribution. The model requires randomly defined hidden nodes and input weights without any optimization for the ordinary implementations (see Fig. 3.3). Therefore, only output weights of the model need to be calculated during the training of the ELM. This case limits the use of the back-propagation algorithms for ELM theory [29]. However, ELM theory varies the learning approaches, in theory, using different matrix-inversing kernels and decomposition techniques. The conventional ELM theory has at its core the MoorePenrose generalized matrix-inversing solutions [27]. aij is the input weight, which is a randomly assigned parameter, between the ith node at the input layer and jth node at single hidden layer. β ij is the output weight, which is estimated using the ELM kernel at training, between the ith node at the hidden layer and jth node at the output layer. The output weights are stacks that bunch up hidden nodes to output layer nodes. bi is the bias of ith node

Figure 3.3 ELM model.

43

44

Gokhan Altan and Yakup Kutlu

at the hidden layer, which is a threshold value to avoid the zero convergence. The SLFN with L hidden nodes is formulated as: XL fL ðxÞ 5 G ðX; Wi ; bi Þ  β i ; Wi ERd ; bi ; β i ER (3.6) i51 i where Gi ðÞ is the activation function of ith node at hidden layer. Gi ðx; Wi ; bi Þ 5 g ðWi X 1 bi Þ

(3.7)

The ELM theory proves that randomly selected parameters provide a fast training and a universal estimation capacity using the least mean square algorithm to calculate output weights [2628]. ELM tries to minimize the training error σ

σ

Minimize :β:υ 1 1 λ :Hβ2T :u 2

(3.8)

σ1 ; σ2 ; u; and v parameters were selected as suggested in Huang et al. [29] to get more robust learning and better generalization performance. T denotes the training output matrix: 3 2 h1 ðx1 Þ h2 ðx1 Þ ? hL21 ðx1 Þ hL ðx1 Þ 6 h ðx Þ h2 ðx2 Þ ? hL21 ðx2 Þ hL ðx2 Þ 7 7 6 1 2 7 6 7 (3.9) H 56 ^ ^ ^ ^ ^ 7 6 7 6 4 h1 ðxN21 Þ h2 ðxN21 Þ ? hL21 ðxN21 Þ hL ðxN21 Þ 5 h1 ðxN Þ

h2 ðxN Þ

? hL21 ðxN Þ 2 T 3 t1 6 tT 7 6 2 7 7 T 56 6 T^ 7 4t 5 m21 T tm

hL ðxN Þ

(3.10)

Output matrix of the ELM, β is solved using the MoorePenrose inversing solution (H y ), β 5 H yT

(3.11)

H T H or HH T must be a diagonal matrix to solve output weights β:λ is a positive value depending on the ridge regression theory.  21 Hy5 HT H HT (3.12)  21 1 HT T (3.13) 1HH T β5 λ

45

Generalization performance of deep autoencoder kernels for identification of abnormalities on electrocardiograms

Output function f ðxÞ of the ELM is formulated as follows:  21 1 T f ðxÞ 5 hðxÞβ 5 hðxÞ HT T 1HH λ

(3.14)

3.3.2 Deep extreme learning machine autoencoder The ELM algorithm, which comes into prominence with its generalization capability and fast training speed features, has been integrated into an unsupervised feature learning stage of DL as adaptive autoencoder-based kernels to enhance the efficiency of the DL classifiers [30]. The detailed analysis of big data requires advanced specifications to train a model. Especially, iterative optimization and training time on conventional classification algorithms, which back-propagate the classification parameters using gradient descent, are big deficiencies for machine learning implementations [28]. Backpropagation algorithms remain inadequate and slow approaches considering they have too many parameters including randomly assigned weights, biases, node parameters, iterative learning rates, activation function, and more to optimize in training stages by repetition. Unlike the conventional machine learning algorithms, ELM theory on the autoencoder approach comes into prominence with its accelerated training speed and generalization capability. The ELM autoencoder generates multilayer deep models using different presentations at each layer using ELM kernels [11]. Fig. 3.4 depicts the structure   of deep ELM. It utilizes input data X to generate the weights between X H 0 and H 1 . Learned H 1 weights are used to calculate the output weights of the next hidden layer (H 2 ). Therefore, H p weights are used to ðp11Þth hidden layer. ELM autoencoder learns the hidden layer parameters at each step, repetitiously. At the last layer, ridge regression generates the output weights of the multilayer ELM model [12,30]. The ELM autoencoder theory varies according to the number of input neurons and hidden layer neurons of the model. The feature mapping equations are obtained as W T W 5 I,bT b 5 1 for compressed and equal dimension representations: WW T 5 I, bbT 5 1 for sparse representation on condition that the hidden layer parameters are a random orthogonal matrix. The ELM autoencoder learning process intends to solve β T β 5 I. ELM autoencoder aims to minimize following the learning progress for compression and sparse representations by 2

2

:β:2 1 λ:Hβ2X:2

(3.15)

for equal dimension representation by 2

:Hβ2X:2

(3.16)

46

Gokhan Altan and Yakup Kutlu

H1 X1

h11

X1

X

H1

X1

h11

X1

h12

Xj

h1k

h1k

H2

H1

H2 h21

h11

h21

h12

h22

h1k

h2L

Hm hm1

b1T

h21 h22

hmn

h2L

Xj

H2

b2T

h2L

bmT

X1

t1

H1 X2

h11 X2

Hm

H2

t2 hm1

h2l b21

b11

bm1

h22 h12

X

b22

b12

t2

hm2

T

bm2

h2L Xj–1

h1k

b2L b1k

Xj

hmn

tj–1

bmn

ELM Classifier

tj

Figure 3.4 Deep ELM structure and feature transferring.

The ELM autoencoder learns the output weights using randomly defined input weights by layer-wise feature representations [18]. The deep ELM autoencoder transfers β T to the model to assign node parameters for each latter layer. β Ti is the ELM autoencoder transferred weight matrix between ith and ði11Þth layer of the deep ELM classifier. At the last layer, the deep ELM model feeds the learned parameters to a supervised ELM classifier [12]. The ELM autoencoder varies depending on the matrix-inversing techniques and the dimensional difference between input and the hidden layer. Inversing techniques may solve the learning of autoencoder at different capabilities and speeds, considering the variance dispersion of the input data [31]. Hessenberg decomposition-based ELM (HessELM) is one of the most robust and efficient kernels for classification [22]. HessELM is based on inversing a real symmetric matrix X. The solution of the HessELM gets a tri-diagonal matrix H [32]. The solved H matrix is the pseudoinverse of output weights for the ELM autoencoder. A HessELM kernel operates on a square matrix.  21 H1 5 HT HT H (3.17)

47

Generalization performance of deep autoencoder kernels for identification of abnormalities on electrocardiograms

H T H 5 QUQ

(3.18)

where Q is a unitary matrix, and U stands for the upper Hessenberg matrix, H 1 5 H T ðQUQ Þ21

(3.19)

H 1 5 H T QU 21 Q

(3.20)

3.4 Deep analysis of coronary artery disease Coronary artery disease (CAD) is a common cardiac disease which occurs because of narrowing arteries due to cholesterol-containing plaque. Major blood vessels have a limited capability to feed the heart with blood and oxygen, resulting in a cardiac abnormality [33]. This disease disrupts the rhythmic systole activity of the heart [34]. CAD is a pre-symptom of heart attack and congestive heart failure for adults. It is an important step to diagnose CAD to get the advanced cardiac levels under control. Many clinical skills are involved, including physical examinations, blood tests, ECG, echocardiogram, coronary angiography, and more [33,34]. In the literature, machine learning algorithms were used to classify the subject with CAD or without CAD. Polat and Gunes analyzed clinical features including age, sex, history, cardiac risk factors, heart rate, and more on the rule-based fuzzy classifier [35]. Many types of research focused on clinical parameters and optimization algorithms on linear and nonlinear classifiers [3639]. Akay et al. used physical examination metrics and frequency analysis of heart sounds to detect CAD [40]. Abdolmanafi et al. analyzed optical coherence tomography images of arteries to assess obstructions in the vessel using DL applications [41]. ECG is a uniform diagnostic tool, which records the electrical activity of the heart. It has many forms depending on the location of the electrodes during recording. Electrical changes plot a waveform, which contains P, Q, R, S, T, and U waves at a particular interval. The waveforms in ECG are fully related to the cardiac state of the heart. Hence, ECG is the most common diagnostic signal for prognosis, diagnosis, and monitoring of cardiac diseases [42]. Segmental, morphologic, and interval characteristics of the ECG among the waves have principal importance in identifying the subjects with cardiac and pulmonary abnormalities [43]. Dey et al. proposed a wireless sensor networkbased application to monitor the patient’s healthcare using ECG. They used a low-power technology for home control and smart-home services using a fuzzy logicbased diagnosis model [44]. Whereas many studies focused on clinical

48

Gokhan Altan and Yakup Kutlu

parameters, most of them have analyzed ECG using fiducial digital signal processing techniques. Acharya et al. extracted heart rate variability (HRV) features and nonlinear features from the ECG [45]. Some studies focused on the morphological features between P waves and ST wave interval features [46,47]. Yıldırım et al. reviewed the efficiency of entropy features on ECG to identify the patients with CAD and nonCAD [7]. In this study, short-term ECG recordings were analyzed using fiducial techniques and filtered raw ECG signals to identify CAD. S and T waves on ECG are the most significant characteristics that represent the abnormalities arising out of CAD, so we chose a long-term ST database (ST-DB) to assess the efficiency of the ELM autoencoder kernels and DBN on CAD analysis accordingly. The ST-DB comprises 85 longterm ECG recordings (25 ECGs with non-CAD, 60 ECGs with CAD) from 80 subjects [48,49]. Each ECG was digitized at 250 Hz with 12-bit resolution. ECGs with CAD exhibit a variety of ischemic cases with dominant ST episodes. ECG recordings vary in duration between 21 and 24 hours. Advanced signal processing techniques enable evaluating both time-domain and frequency-domain for time series. Through the fluency of transformations in analysis, multistage transform techniques can conclude time-consuming decomposition processes for long-term signals. It can even lead to endless processes resulting from noisy data and the compute capacity of the hardware. Some cardiac diseases, such as arrhythmia, show instantaneous pathologies, nonperiodic and unsettled changes in morphology using long-term ECG. Most of the cardiac diseases, however, including CAD, congestive heart failure, ischemia, and more may be diagnosed using periodic abnormalities on short-term ECG waveforms. Considering that CAD has enough significance on short-term ECGs with ST morphology, each long-term ECG was segmented into 10 seconds short-term ECGs using a moving window technique. Thus, we ensure not only avoidance of time-consuming analysis processes, but also the number of samples with CAD and non-CAD has been increased by 100 3 . Moving windows started with a manually selected P wave. Each short-term ECG was uniquely extracted so as not to include common intervals. It is observed that 98% of segmented short-term ECGs have 10 PQRST complexes. The block diagram of the proposed CAD classification models was depicted in Fig. 3.5. Short-term ECGs and energy-time-frequency features from ECGs were fed into the DL classifiers, separately. Fiducial techniques can extract characteristic information from the time series. Different modulations and subbands preserve significant variations at various frequency ranges. We applied the HilbertHuang transform (HHT) to the short-term ECGs to extract instantaneous frequency data. HHT is an empirical twostep method that is composed of empirical mode decomposition (EMD) and a Hilbert transform, respectively [50]. HHT has an adaptive and robust decomposition, which has an incomplete theory and empirical stopping criterion, for nonlinear and nonstationary time-series signals [51,52].

Generalization performance of deep autoencoder kernels for identification of abnormalities on electrocardiograms

Figure 3.5 Block diagram of the proposed CAD classification system.

EMD, which is the essential part of HHT, sifts intrinsic mode functions (IMFs) from any complicated input data. An IMF is a modulation, which has an equal (or differing at most by one) number of zero-crossing and extrema, whereas the mean of the local minima and maxima is a linear and symmetric envelope, respectively [53]. A random segment at EMD is depicted in Fig. 3.6 with envelopes and the original ECG signal. A number of the IMFs may vary depending on the signal. Ensemble EMD (EEMD) fixes the mode-mixing problem by adding white noise, which is the mean of IMFs from conventional EMD at each stage of the sifting process [54]. Each IMF is a different frequency modulation of the time series.

49

50

Gokhan Altan and Yakup Kutlu

Figure 3.6 A random segment at EEMD process.

X ðt Þ 5

Xn j51

IMFj 1 rn

(3.21)

where rn is the residual signal, X is the input signal, and n is the number of the sifted IMFs. The EEMD process extracts IMFs from each short-term ECG. Fig. 3.7 depicts the original signal and the sifted IMFs. In HHT, after sifting IMFs, the Hilbert transform is applied to each IMF to extract energy-time-frequency features. Hilbert transform characterizes the instantaneous frequency of IMFs by shifting phase angle with 1=πt: The Hilbert transform decomposes new modulations that have the same amplitude spectrum density with handled IMF [51]. The analytic function of Hilbert transform on IMFj is denoted as IMFj where ai ðt Þ is the amplitude and ωi ðt Þ is the instantaneous frequency. nXn o jωi ðt Þdt IMFj 5 R a ð t Þe (3.22) i51 i HHT is finalized by applying the Hilbert transform to each IMF. Hilbert spectral analysis is a method that can potentially be used to assess the dispersion of the analysis. Here, we analyzed each HHT applied frequency modulation as a base signal but not directly used as the input data. Each short-term ECG had 10 IMFs. Statistical features such as minimum, maximum, skewness, median, mean, kurtosis, standard deviation, mode, and energy features were calculated from each IMF extracted from short-term ECGs with CAD and non-CAD. The statistical features from each IMF, all IMFs and 10 seconds short-term ECGs were composed as three feature sets, separately. They are comprised of 9 features, 81 features (9 IMF 3 9 features), and 2500 features (10 seconds 3 250 Hz) for each IMF-based feature, a combination of all IMF-based statistical features, and raw ECG features, respectively. Thus, in this

Generalization performance of deep autoencoder kernels for identification of abnormalities on electrocardiograms

Figure 3.7 ECG signal and extracted IMFs by EEMD.

way, we have the ability to gauge the accuracy of DL classifiers on each IMF modulation and responsibility of statistical features to identify the CAD on fiducial techniques on ECG. The DL classifiers have some training parameters including the number of neurons, transfer function, learning rate, number of hidden layers, back-propagation algorithm, training kernel, and more to be iterated. It is a big necessity to establish the optimum DL model for the problem. We fixed learning rate as 0.001, preferred to use the unipolar sigmoid function as a transfer function due to its efficiency and ease in deriving to update DL weights. The CAD identification performances were evaluated by DL models with three, four, and five hidden layers of which a number of neurons vary at a range of 50350 increased by 10 neurons. The DBN with RBM, deep ELM, and deep ELM with HessELM autoencoder kernel were tested to classify short-term ECGs with CAD and non-CAD.

51

52

Gokhan Altan and Yakup Kutlu

The proposed models were handled using 10-fold cross-validation to avoid overfitting. In this method, the feature set is randomly divided into 10-folds so as to have uniformity, which contains the same number of short-term ECGs with CAD and non-CAD in each fold. Ninefolds are combined to train the DL model, and the remaining fold is used to test the trained DL model. This progress is repeated until all folds are tested [55]. Thus, we ensure that each short-term ECG feature is used in both training and testing, independently. Testing achievements are averaged to calculate the classification performance of the DL model. We preferred to calculate statistical test metrics including accuracy, specificity, precision, and sensitivity to evaluate the efficiency of the proposed DL models on different data sets, since using accuracy alone is not enough to establish the system performance. The highest identification performance achievements were presented in the results. High-dimensional features may contain redundant features which have no responsibility for the CAD; on the other hand, it suffers from the problem of dimensionality for machine learning algorithms. Thus, high-dimensional features cause overfitting. We applied a sequential forward feature selection (SFFS) algorithm to specify responsible features for identifying CAD with low computation costs [56]. SFFS starts with an empty feature set. At each step, it adds the most responsible feature to the chosen feature set for specifying a suboptimum feature set. Performance of a DL algorithm is evaluated by taking the training time into consideration, additionally to the classification accuracy and other statistical test metrics. While using detailed models with many hidden layers and a large number of neurons support achieving high classification accuracies, the researchers lost time as a result of DL models’ training. Recent enhancements on DL algorithms focus on integrating simple and robust mathematical theories to the training and pretraining stages to reduce the training time. Therefore, we presented the training times of the experimental DL models in addition to the aforementioned statistical test metrics. Ten seconds short-term ECG has 1250 data points. Short-term ECG data set is a high-dimensional feature set which has a standardized starting point with P wave. As presented in Table 3.1, when a short-term ECG feature set is fed into the DBN classifier, the proposed DBN models have achieved accuracy rates of 34.27%, 36.20%, and Table 3.1 CAD identification performances (%) using short-term ECG on DBN. Models Number of nodes Accuracy Sensitivity Selectivity Precision Time (s)

3 hidden layers (80-60-90) SFFS 4 hidden layers (100-90-180-50) SFFS 5 hidden layers (100-120-80-90-110) SFFS

34.26 42.15 36.20 41.61 45.19 56.88

32.73 42.30 32.80 35.13 37.97 51.60

19.02 23.19 21.57 26.86 29.57 37.45

Bold values are the highest achievements in accuracy for the experimented models

55.86 63.56 58.59 66.31 70.86 80.27

248 205 323 321 532 384

53

Generalization performance of deep autoencoder kernels for identification of abnormalities on electrocardiograms

Table 3.2 CAD identification performances (%) using short-term ECG on deep ELM with MoorePenrose kernel. Models Number of nodes Accuracy Sensitivity Selectivity Precision Time (s)

3 hidden layers (140-100-90) SFFS 4 hidden layers (160-120-90-90) SFFS 5 hidden layers (90-110-160-80-110) SFFS

66.92 75.56 67.75 78.86 64.75 81.01

62.43 77.32 64.27 79.32 60.73 82.48

46.28 56.72 47.02 61.04 44.12 64.83

87.04 86.63 86.59 89.54 85.06 89.79

17 15 19 19 23 21

Bold values are the highest achievements in accuracy for the experimented models

Table 3.3 CAD identification performances (%) using short-term ECG on deep ELM with HessELM kernel. Models Number of nodes Accuracy Sensitivity Selectivity Precision Time (s)

3 hidden layers (140-90-160) SFFS 4 hidden layers (210-180-90-200) SFFS 5 hidden layers (90-220-170-120-80) SFFS

52.52 62.48 72.02 81.44 60.66 65.08

52.30 59.48 68.32 78.52 57.72 63.82

31.66 41.74 51.55 63.17 40.02 43.96

72.77 82.48 89.58 94.22 81.10 82.77

17 16 17 17 25 23

Bold values are the highest achievements in accuracy for the experimented models

45.19% for three, four, and five hidden layers, respectively. The highest classification performances had an accuracy rate of 56.88% using SFFS on a five hidden layers DBN model. Table 3.2 presents the classification performances for deep ELM models with a MoorePenrose kernel. The proposed deep ELM with a MoorePenrose kernel has achieved better classification performances than DBN for the short-term ECG feature set. The deep ELM with a MoorePenrose kernel has achieved classification performances of 81.01%, 82.48%, and 89.79% for accuracy, sensitivity, and precision, respectively. DBN and deep ELM with a MoorePenrose kernel have more detailed CAD and non-CAD identification abilities on models with five hidden layers on SFFS. Deep ELM with a HessELM kernel has achieved the highest classification performances with an accuracy rate of 81.44% for four hidden layers (see Table 3.3). Considering the training time of the proposed models as a performance criterion, deep ELM algorithms take the lead against DBN. The deep ELM kernels are about 15 times faster than DBN. IMF-based features were fed into the DL algorithms. Each IMF was evaluated separately. Analyzing IMF-based features enables decomposing different frequency modulations considering energy-time-frequency characteristics. As it is presented in Table 3.4, The DBN classifier has reached a classification performance rate of 87.59%, 85.38%, 72.59%, and 96.64% for accuracy, sensitivity, specificity, and precision using IMF4 features, respectively. IMF9 is the lowest responsible for DBN on the identification of CAD using HHT. Deep ELM with MoorePenrose kernel has achieved more

54

Gokhan Altan and Yakup Kutlu

Table 3.4 CAD identification performances (%) using IMF-based data sets on DBN. Accuracy Sensitivity Selectivity Precision

Time (s)

IMF1 IMF2 IMF3 IMF4 IMF5 IMF6 IMF7 IMF8 IMF9 All IMFs

137 143 135 148 121 137 143 156 168 191

68.88 82.08 86.67 87.59 74.47 76.34 78.25 65.84 48.12 66.92

72.82 84.08 86.98 85.38 78.93 74.60 75.88 66.72 49.27 65.58

47.67 66.92 73.34 72.59 55.77 56.91 59.18 44.37 27.14 45.91

81.16 89.88 93.68 96.64 83.94 90.19 91.89 81.53 68.39 84.05

Bold values are the highest achievements in accuracy for the experimented models

Table 3.5 CAD identification performances (%) using IMF-based data sets on deep ELM with a MoorePenrose kernel. Accuracy Sensitivity Selectivity Precision Time (s)

IMF1 IMF2 IMF3 IMF4 IMF5 IMF6 IMF7 IMF8 IMF9 All IMFs

83.65 91.35 89.39 92.04 90.47 87.20 86.36 84.44 79.27 94.86

82.93 88.78 88.48 93.98 88.52 87.28 85.00 81.55 78.87 94.87

67.57 78.37 76.81 85.82 77.54 74.03 71.35 67.35 61.27 88.50

93.15 98.85 96.18 94.69 97.77 94.16 95.17 95.77 90.55 97.78

11 10 11 12 9 8 9 9 11 16

Bold values are the highest achievements in accuracy for the experimented models

accurate classification performance with an accuracy rate of 92.04% using IMF4 features (see Table 3.5). Deep ELM with a HessELM kernel has reached the highest classification performance. It presents classification performance rates of 93.76%, 93.98%, 86.59%, and 97.09% for accuracy, sensitivity, specificity, and precision (see Table 3.6). IMF2 is the most responsible feature set for the deep ELM with a HessELM kernel, whereas IMF4 is the common feature set for both DBN and deep ELM with a MoorePenrose kernel. Considering the training time of the proposed models as a performance criterion, the deep ELM kernels are about 17 times faster than DBN. Fig. 3.8 depicts the accuracy rates of IMF-based feature sets for DBN and deep ELM classifiers. IMF-based feature sets were composed as all IMFs feature sets to analyze all statistical features from frequency modulation, simultaneously. Low- and high-frequency modulations may have characteristic significance considering the ECG waveform. Therefore, the statistical features from each IMF are used as additional features to get the advantage of instantaneous energy-time-frequency distribution. As it is presented in Table 3.7, the DBN classifier with four hidden layers has achieved the highest

55

Generalization performance of deep autoencoder kernels for identification of abnormalities on electrocardiograms

Table 3.6 CAD identification performance (%) using IMF-based data sets on deep ELM with HessELM kernel. Accuracy Sensitivity Selectivity Precision Time (s)

IMF1 IMF2 IMF3 IMF4 IMF5 IMF6 IMF7 IMF8 IMF9 All IMFs

85.80 93.76 89.51 92.05 90.01 89.20 90.02 91.72 89.65 96.64

85.03 93.98 88.68 91.27 88.45 85.73 89.07 90.37 88.07 95.85

70.93 86.59 77.11 81.75 77.18 74.01 77.87 80.42 76.54 90.82

94.29 97.09 96.15 97.30 97.14 98.81 96.53 97.73 96.99 99.36

8 8 7 10 9 11 7 8 6 13

Bold values are the highest achievements in accuracy for the experimented models

Figure 3.8 Classification accuracies on IMF-based feature sets for DBN and deep ELM kernels.

classification performance rates of 66.92%, 65.58%, 45.91%, and 84.05% for accuracy, sensitivity, specificity, and precision among an experimented range of models, respectively. Using SFFS on the same DBN model improved the CAD identification performance to 90.12%, 89.52%, 78.44%, and 96.22% for accuracy, sensitivity, specificity, and precision, respectively. This case reveals the effect of the curse of dimensionality on DBN classifiers. The deep ELM with MoorePenrose kernel has performed fascinating identification performances for statistical features from all IMFs. It has separated CAD and non-CAD with performance rates of 94.86%, 94.87%, 88.50%, and 97.78% for accuracy, sensitivity, specificity, and precision, respectively (see Table 3.8). The deep ELM models with three and five hidden layers have also reached effective classification performances. The most operative DL algorithm on the composed IMF-based

56

Gokhan Altan and Yakup Kutlu

Table 3.7 CAD identification performances (%) depending on the model of DBN. Model Number of nodes Accuracy Sensitivity Selectivity Precision Time (s)

3 hidden layers

4 hidden layers

5 hidden layers

(330-170-90) (200-230-80) (180-240-220) SFFS (330-170-90) (60-140-320-100) (120-210-150-50) (220-100-120-160) SFFS (60-140-320-100) (60-100-230-70-350) (140-220-160-40-200) (70-100-260-180-80) SFFS (60-100-230-70-350)

62.95 62.24 61.21 86.36 66.92 66.22 65.01 90.12 63.33 62.38 58.89 78.25

57.87 58.70 54.45 85.45 65.58 61.52 61.32 89.52 58.37 57.15 54.28 74.55

42.64 41.64 41.46 71.72 45.91 45.63 44.31 78.44 42.96 42.15 38.94 58.79

84.83 82.79 85.28 94.72 84.05 86.79 84.93 96.22 84.98 84.54 81.26 93.28

279 304 288 244 367 389 362 344 589 608 623 504

Bold values are the highest achievements in accuracy for the experimented models

Table 3.8 CAD identification performances (%) depending on the model of deep ELM with MoorePenrose kernel. Model Number of nodes Accuracy Sensitivity Selectivity Precision Time (s)

3 hidden (100-170-210) layers (80-120-200) (230-160-80) SFFS (100-170-210) 4 hidden (140-230-80160) layers (70-120-160240) (260-200-80-170) SFFS (140-230-80-160) 5 hidden (230-160-210-110-70) layers (200-160-80-210-160) (90-110-140-160-100) SFFS (230-160-210-110-70)

90.24 89.12 88.26 91.47 94.86 92.56 91.14 95.02 91.84 90.02 89.71 92.71

89.62 88.80 88.35 91.97 94.87 94.98 92.88 96.10 93.68 90.72 90.53 93.87

78.64 76.98 75.90 82.40 88.50 87.81 83.58 90.81 85.22 79.86 79.43 85.93

96.29 95.47 94.66 95.78 97.78 94.51 94.47 96.83 94.69 94.93 94.65 95.72

10 11 10 8 13 13 14 11 15 16 12 10

Bold values are the highest achievements in accuracy for the experimented models

statistical features is deep ELM with HessELM kernel with performance rates of 96.64%, 95.85%, 90.82%, and 99.36% for accuracy, sensitivity, specificity, and precision, respectively (see Table 3.9). The achievements were increased in accuracy to 95.02% and 96.93% for MoorePenrose and HessELM kernels using SFFS. As we can easily extrapolate, the deep ELM models are less affected by feature size. Therefore, deep ELM models have the ability to generate different presentations by compression and sparsity. Deep ELM autoencoder kernels have overcome the curse of dimensionality.

57

Generalization performance of deep autoencoder kernels for identification of abnormalities on electrocardiograms

Table 3.9 CAD identification performances (%) depending on the model of deep ELM with HessELM kernel. Model Number of nodes Accuracy Sensitivity Selectivity Precision Time (s)

3 hidden (160-180-80) layers (200-210-90) (80-130-210) SFFS (160-180-80) 4 hidden (80-120-210-100) layers (210-100-70-80) (180-140-90-80) SFFS (80-120-210-100) 5 hidden (180-140-110-90-90) layers (210-180-160-190-170) (90-100-220-140-80) SFFS (180-140-110-90-90)

91.28 90.45 90.11 92.18 93.54 91.94 91.18 95.02 96.64 95.69 92.02 96.93

90.63 90.35 90.20 91.72 93.12 91.45 90.62 93.32 95.85 95.38 90.25 96.03

80.51 79.66 79.26 82.43 85.13 81.94 80.42 86.07 90.82 89.69 80.45 91.23

96.81 95.88 95.53 97.04 97.62 96.96 96.67 99.61 99.36 98.47 98.31 99.60

9 10 9 9 8 11 13 8 12 9 11 10

Bold values are the highest achievements in accuracy for the experimented models

3.5 Conclusion Hereby, efficiency and robustness of deep ELM and DBN classifiers are compared on short-term ECG features from patients with CAD and non-CAD. We explored the popular DL algorithms including DBN, and deep ELM with MoorePenrose and HessELM kernel in time-series analysis; in particular, how ELM autoencoder kernels accelerated the training time without impairing generalization capability and classification performance of DL. This study demonstrates how DL algorithms are effective not only on computer vision but also on the features obtained from time-series signals. There is existing research on deep ELM autoencoder kernels, [11,12,18,22,24,30,31]. However, these deep autoencoder models rarely show how time-series signals can be analyzed using energy-time-frequency features, raw signal, separately. Hereby, we compared the training time and statistical abnormality identification achievements as performance metrics on ECG for a HessELM-based ELM autoencoder [22], conventional ELM autoencoder, and DBN [1]. As we can see in Table 3.10, various feature extraction methods and classification algorithms were used to identify CAD. Therefore, it is awkward to make a complete comparison of classifiers. Lee et al. and Dua et al. separated the subjects with CAD and non-CAD using HRV features, which are common diagnostics for cardiac diseases. Giri et al. applied discrete wavelet transform to the ECG and utilized HRV measurements as additional features. They separated subjects with CAD and nonCAD with an accuracy rate of 90% using Gaussian mixture models with genetic

58

Gokhan Altan and Yakup Kutlu

Table 3.10 Comparison of the related works. Related works Features Classifier

Accuracy Sensitivity Selectivity

Lee et al. [57]

90.00





89.50





96.80

100.00

93.70

86.00





Support vector machines

94.08





DBN Deep ELM MoorePenrose kernel HessELM kernel

90.12

89.52

78.44

95.02

96.10

90.81

96.93

96.03

91.23

Dua et al. [58] Giri et al. [59] Arafat et al. [60] Alizadensani et al. [46] This study

HRV measurements HRV measurements Discrete wavelet transform ST measurements on ECG Q and ST measurements on ECG HHT on ECG

Support vector machines Multilayer perceptron Gaussian mixture model Fuzzy clustering

Bold values are the highest achievements in accuracy for the experimented models

algorithms [59]. Arafat et al. analyzed morphological ST measurements on ECG. They differentiated ECG with CAD with an accuracy rate of 86% using fuzzy clustering technique [60]. Alizadensani et al. proposed that Q waveform features are significant when used as additional features to the morphological ST measurements on the diagnosis of CAD. They reached a classification accuracy rate of 94.08% using support vector machines [46]. The proposed DL models on HHT features have achieved high classification performances. The deep ELM with HessELM kernel has achieved the highest CAD identification performance rates of 96.93%, 96.03%, and 91.23% for accuracy, sensitivity, and specificity. The training time for the proposed deep ELM model with five hidden layers is 10 seconds. It is a fabulous performance considering the number of classification parameters. Because low feature dimensionality increases sensitivity to the input data for the DL models, the compression encoding with the bottleneck model further results in insufficiency to prevent overfitting and eventuates inefficient generalization. On the other hand, while the deep ELM autoencoder has the ability to increase the feature dimensionality using the sparse representation, this can be coming to the forefront disadvantage at the training as for other machine learning algorithms. One of the biggest advantages of the deep ELM autoencoder kernels is excluding epochs and iterations at training. This issue composes the unsupervised stage of the deep ELM and provides a quick determination of the output weights by simple solutions without optimization and back-propagation. That is why the deep ELM is so fast for even extended DL models. The ELM autoencoder kernels are adaptable methods

Generalization performance of deep autoencoder kernels for identification of abnormalities on electrocardiograms

to predefine the classification parameters from the input data including time-series, images, and more for detailed analysis. High generalization capacity, robustness, and fast training speed make the ELM autoencoder faultless for recent and future DL algorithms. Limitations of the study are quantity of data and the experimented deep classifier model structures. There is a limited number of ECG recordings with CAD that are online available. To prove the actual efficiency of the proposed model, the system needs to be validated using many ECG recordings. Considering the computation capability of the systems, the experimented models are limited for sizes of neuron and hidden layers. We selected the three, four, and five hidden layers for DL algorithms considering the training time and modeling diversity. Enhancing the deep models with more hidden layers and neuron numbers at each layer will provide more detailed analysis for the patterns. Future scope of this research is to integrate the generalization capabilities of the deep ELM models into the healthcare systems to detect the cardiac diseases using short-term ECG recordings.

References [1] Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle, Greedy layer-wise training of deep networks, Adv. Neural Inf. Process. Syst. 19 (1) (2007) 153160. [2] G. Altan, Y. Kutlu, A.Ö. Pekmezci, S. Nural, Deep learning with 3D-second order difference plot on respiratory sounds, Biomed. Signal Process. Control. (2018). Available from: https://doi.org/ 10.1016/j.bspc.2018.05.014. [3] G.E. Hinton, S. Osindero, Y.-W. Teh, A fast learning algorithm for deep belief nets, Neural Comput. 18 (7) (2006) 15271554. Available from: https://doi.org/10.1162/neco.2006.18.7.1527. [4] A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep convolutional neural networks, Adv. Neural Inf. Process. Syst. (2012). Available from: https://doi.org/10.1016/j. protcy.2014.09.007. [5] H. Lee, R. Grosse, R. Ranganath, A.Y. Ng, Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. in: Proceedings of the 26th Annual International Conference on Machine Learning ICML 09, 2008, 2009, pp. 18. https://doi.org/10.1145/ 1553374.1553453. [6] R. Donida Labati, E. Muñoz, V. Piuri, R. Sassi, F. Scotti, Deep-ECG: convolutional neural networks for ECG biometric recognition, Pattern Recogn. Lett. (2018). Available from: https://doi. org/10.1016/j.patrec.2018.03.028. [7] Ö. Yıldırım, P. Pławiak, R.S. Tan, U.R. Acharya, Arrhythmia detection using deep convolutional neural network with long duration ECG signals, Comput. Biol. Med. (2018). Available from: https://doi.org/10.1016/j.compbiomed.2018.09.009. [8] G. Altan, Y. Kutlu, A.Ö. Pekmezci, A. Yayık, Diagnosis of chronic obstructive pulmonary disease using deep extreme learning machines with LU autoencoder kernel, in: 7th International Conference on Advanced Technologies (ICAT’18), Antalya, 2018, pp. 618622. [9] G. Altan, Y. Kutlu, N. Allahverdi, A multistage deep belief networks application on arrhythmia classification, Int. J. Intell. Syst. Appl. Eng. 4 (Special Issue 1) (2016) 222228. Available from: https:// doi.org/10.18201/IJISAE.270367. [10] N. Lopes, B. Ribeiro, Towards adaptive learning with improved convergence of deep belief networks on graphics processing units, Pattern Recogn. 47 (1) (2014) 114127. Available from: https://doi.org/10.1016/j.patcog.2013.06.029.

59

60

Gokhan Altan and Yakup Kutlu

[11] J. Tang, C. Deng, G.-B. Huang, Extreme learning machine for multilayer perceptron, IEEE Trans. Neural Netw. Learn. Syst. 27 (4) (2016) 809821. Available from: https://doi.org/10.1109/ TNNLS.2015.2424995. [12] M.D. Tissera, M.D. McDonnell, Deep extreme learning machines: Supervised autoencoding architecture for classification, Neurocomputing 174 (Part A) (2016) 4249. Available from: https://doi. org/10.1016/j.neucom.2015.03.110. [13] K. Lan, D.T. Wang, S. Fong, L.S. Liu, K.K.L. Wong, N. Dey, A survey of data mining and deep learning in bioinformatics, J. Med. Syst. (2018). Available from: https://doi.org/10.1007/s10916018-1003-9. [14] P. Vincent, H. Larochelle, Y. Bengio, P.-A. Manzagol, Extracting and composing robust features with denoising autoencoders, in: 25th International Conference on Machine Learning, 2008, pp. 10961103. https://doi.org/10.1145/1390156.1390294. [15] G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neural networks, Science. (2006). Available from: https://doi.org/10.1126/science.1127647. [16] X. Cheng, H. Liu, X. Xu, F. Sun, Denoising deep extreme learning machine for sparse representation, Memetic Comput. (2017). Available from: https://doi.org/10.1007/s12293-016-0185-2. [17] K. Sun, J. Zhang, C. Zhang, J. Hu, Generalized extreme learning machine autoencoder and a new deep neural network, Neurocomputing. (2017). Available from: https://doi.org/10.1016/j. neucom.2016.12.027. [18] Z. Yin, J. Zhang, Task-generic mental fatigue recognition based on neurophysiological signals and dynamical deep extreme learning machine, Neurocomputing 283 (2018) 266281. Available from: https://doi.org/10.1016/j.neucom.2017.12.062. [19] Y. Gu, Y. Chen, J. Liu, X. Jiang, Semi-supervised deep extreme learning machine for Wi-Fi based localization, Neurocomputing. (2015). Available from: https://doi.org/10.1016/j. neucom.2015.04.011. [20] W. Yu, F. Zhuang, Q. He, Z. Shi, Learning deep representations via extreme learning machines, Neurocomputing. (2015). Available from: https://doi.org/10.1016/j.neucom.2014.03.077. [21] P. Baldi, Autoencoders, unsupervised learning, and deep architectures, J. Mach. Learn. Res. 27 (2012) 3749. [22] G. Altan, Y. Kutlu, Hessenberg Elm autoencoder kernel for deep learning, J. Eng. Technol. Appl. Sci. 3 (2) (2018) 141151. Available from: https://doi.org/10.30931/jetas.450252. [23] W. Wang, Y. Huang, Y. Wang, L. Wang, Generalized autoencoder: a neural network framework for dimensionality reduction. in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2014. https://doi.org/10.1109/CVPRW.2014.79. [24] G. Altan, Y. Kutlu, Generative autoencoder kernels on deep learning for brain activity analysis, Nat. Eng. Sci. 3 (3) (2018) 311322. Available from: https://doi.org/10.28978/nesciences.468978. [25] F. Tian, B. Gao, Q. Cui, E. Chen, T. Liu, Learning deep representations for graph clustering, in: Proc. of AAAI, 2015, pp. 12931299. [26] E. Cambria, G.-B. Huang, L.L.C. Kasun, H. Zhou, C.M. Vong, J. Lin, et al., Representational learning with ELMs for big data, IEEE Intell. Syst. (2013). Available from: https://doi.org/10.1109/ MIS.2013.140. [27] G.B. Huang, L. Chen, C.K. Siew, Universal approximation using incremental constructive feedforward networks with random hidden nodes, IEEE Trans. Neural Netw. (2006). Available from: https://doi.org/10.1109/TNN.2006.875977. [28] G.-B. Huang, Q. Zhu, C. Siew, Extreme learning machine : a new learning scheme of feedforward neural networks, IEEE Int. J. Conf. Neural Netw. 2 (2004) 985990. Available from: https://doi. org/10.1109/IJCNN.2004.1380068. [29] G.B. Huang, An insight into extreme learning machines: random neurons, random features and kernels, Cognit. Comput. (2014). Available from: https://doi.org/10.1007/s12559-014-9255-2. [30] L.L.C. Kasun, H.M. Zhou, G.B. Huang, C.M. Vong, Representational Learning with ELMs for Big Data, IEEE Intell. Syst. 28 (6) (2013) 3134. [31] G. Song, Q. Dai, A novel double deep ELMs ensemble system for time series forecasting, Knowl.Based Syst. 134 (2017) 3149. Available from: https://doi.org/10.1016/j.knosys.2017.07.014.

Generalization performance of deep autoencoder kernels for identification of abnormalities on electrocardiograms

[32] G.H. Golub, C.F. Van Loan, The Hessenberg and Real Schur Forms, 7.4 in Matrix Computations, third ed., Johns Hopkins University, Baltimore, 1996. [33] S.D. Cagle, N. Cooperstein, Coronary artery disease, Prim. Care 45 (1) (2018) 4561. Available from: https://doi.org/10.1016/j.pop.2017.10.001. [34] I. Kriszbacher, M. Koppán, J. Bódis, Inflammation, atherosclerosis, and coronary artery disease, N. Engl. J. Med. 352 (16) (2005) 16851695. [35] K. Polat, S. Güne¸s, A hybrid approach to medical decision support systems: combining feature selection, fuzzy weighted pre-processing and AIRS, Comput. Methods Prog. Biomed. (2007). Available from: https://doi.org/10.1016/j.cmpb.2007.07.013. [36] I. Babaoglu, O. Findik, E. Ülker, A comparison of feature selection models utilizing binary particle swarm optimization and genetic algorithm in determining coronary artery disease using support vector machine, Expert Syst. Appl. (2010). Available from: https://doi.org/10.1016/j. eswa.2009.09.064. [37] Y.N. Devi, S. Anto, An evolutionary-fuzzy expert system for the diagnosis of coronary artery disease, Int. J. Adv. Res. Comput. Eng. Technol. 3 (4) (2014) 14781484. [38] N. Ghadiri Hedeshi, M. Saniee Abadeh, Coronary artery disease detection using a fuzzy-boosting PSO approach, Comput. Intell. Neurosci. (2014). Available from: https://doi.org/10.1155/2014/783734. [39] M.G. Tsipouras, T.P. Exarchos, D.I. Fotiadis, A.P. Kotsia, K.V. Vakalis, K.K. Naka, et al., Automated diagnosis of coronary artery disease based on data mining and fuzzy modeling, IEEE Trans. Inf. Technol. Biomed. (2008). Available from: https://doi.org/10.1109/TITB.2007.907985. [40] M. Akay, W. Welkowitz, J.L. Semmlow, J. Kostis, Application of the ARMA method to acoustic detection of coronary artery disease, Med. & Biol. Eng. Comput. (1991). Available from: https:// doi.org/10.1007/BF02441656. [41] A. Abdolmanafi, L. Duong, N. Dahdah, I.R. Adib, F. Cheriet, Characterization of coronary artery pathological formations from OCT imaging using deep learning, Biomed. Opt. Express (2018). Available from: https://doi.org/10.1364/BOE.9.004936. [42] Y. Özbay, G. Tezel, A new method for classification of ECG arrhythmias using neural network with adaptive activation function, Digit. Signal Process. (2010). Available from: https://doi.org/ 10.1016/j.dsp.2009.10.016. [43] G. Altan, Y. Kutlu, N. Allahverdi, A new approach to early diagnosis of congestive heart failure disease by using HilbertHuang transform, Comput. Methods Prog. Biomed. 137 (2016) 2334. Available from: https://doi.org/10.1016/J.CMPB.2016.09.003. [44] N. Dey, A.S. Ashour, F. Shi, S.J. Fong, R.S. Sherratt, Developing residential wireless sensor networks for ECG healthcare monitoring, IEEE Trans. Consum. Electron. (2017). Available from: https://doi.org/10.1109/TCE.2017.015063. [45] U.R. Acharya, O. Faust, V. Sree, G. Swapna, R.J. Martis, N.A. Kadri, et al., Linear and nonlinear analysis of normal and CAD-affected heart rate signals, Comput. Methods Prog. Biomed. (2014). Available from: https://doi.org/10.1016/j.cmpb.2013.08.017. [46] R. Alizadehsani, J. Habibi, M.J. Hosseini, H. Mashayekhi, R. Boghrati, A. Ghandeharioun, et al., A data mining approach for diagnosis of coronary artery disease, Comput. Methods Prog. Biomed. (2013). Available from: https://doi.org/10.1016/j.cmpb.2013.03.004. [47] M.G. Poddar, A.C. Birajdar, J. Virmani, Kriti, Automated classification of hypertension and coronary artery disease patients by PNN, KNN, and SVM classifiers using HRV analysis, in: Machine Learning in Bio-Signal Analysis and Diagnostic Imaging, 2019. https://doi.org/10.1016/b978-0-12816086-2.00005-9. [48] A.L. Goldberger, L.A.N. Amaral, L. Glass, J.M. Hausdorff, P.C. Ivanov, R.G. Mark, et al., PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals, Circulation 101 (2000) 215220. Available from: https://doi.org/10.1161/01. CIR.101.23.e215. [49] F. Jager, A. Taddei, G.B. Moody, M. Emdin, G. Antoliˇc, R. Dorn, et al., Long-term ST database: a reference for the development and evaluation of automated ischaemia detectors and for the study of the dynamics of myocardial ischaemia, Med. Biol. Eng. Comput. (2003). Available from: https:// doi.org/10.1007/BF02344885.

61

62

Gokhan Altan and Yakup Kutlu

[50] N.E. Huang, Z. Shen, S.R. Long, M.C. Wu, H.H. Shih, Q. Zheng, et al., The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis, Proc. R. Soc. A. Math. Phys. Eng. Sci. 454 (1971) (1998) 903995. Available from: https://doi.org/ 10.1098/rspa.1998.0193. [51] N.E. Huang, Introduction to the Hilbert Huang transform, Transform 5 (2005) 126. Available from: https://doi.org/10.1142/9789812703347_0001. [52] N.E. Huang, Z. Wu, A review on Hilbert-Huang transform: method and its applications, October 46 (2007) (2008) 123. Available from: https://doi.org/10.1029/2007RG000228.1. INTRODUCTION. [53] S.R. Long, N.E. Huang, C.C. Tung, M.L. Wu, R.Q. Lin, E. Mollo-Christensen, et al., The Hilbert techniques: an alternate approach for non-steady time series analysis, IEEE Geosci. Remote. Sens. Soc. Lett. 3 (1995) 611. [54] No.E. Huang, Z. Wu, Ensemble empirical mode decomposition: a noise-assisted data analysis method, Adv. Adapt. Data Anal. (2009). Available from: https://doi.org/10.1142/S1793536909000047. [55] T.-T. Wong, Performance evaluation of classification algorithms by k-fold and leave-one-out cross validation, Pattern Recogn. 48 (9) (2015) 28392846. Available from: https://doi.org/ 10.1016/j.patcog.2015.03.009. [56] J.S. Kirar, R.K. Agrawal, Relevant feature selection from a combination of spectral-temporal and spatial features for classification of motor imagery EEG, J. Med. Syst. (2018). Available from: https://doi.org/10.1007/s10916-018-0931-8. [57] H.G. Lee, K.Y. Noh, K.H. Ryu Mining biosignal data: coronary artery disease diagnosis using linear and nonlinear features of HRV, in: Emerging Technologies in Knowledge Discovery and Data Mining, 2007. https://doi.org/10.1007/978-3-540-77018-3_23. [58] S. Dua, X. Du, S.V. Sree, T.A. Vi, Novel classification of coronary artery disease using heart rate variability analysis, J. Mech. Med. Biol. (2012). Available from: https://doi.org/10.1142/ s0219519412400179. [59] D. Giri, U. Rajendra Acharya, R.J. Martis, S. Vinitha Sree, T.C. Lim, T. Ahamed, et al., Automated diagnosis of Coronary Artery Disease affected patients using LDA, PCA, ICA and discrete wavelet transform, Knowl.-Based Syst. (2013). Available from: https://doi.org/10.1016/j. knosys.2012.08.011. [60] S. Arafat, M. Dohrmann, M. Skubic, Classification of coronary artery disease stress ECGs using uncertainty modeling, in: ICSC Congress on Computational Intelligence Methods and Applications, 2006, pp. 14. https://doi.org/10.1109/cima.2005.1662362.

CHAPTER FOUR

Deep learning for early diagnosis of Alzheimer’s disease: a contribution and a brief review Iago Richard Rodrigues da Silva1, Gabriela dos Santos Lucas e Silva2, Rodrigo Gomes de Souza3, Maíra Araújo de Santana2, Washington Wagner Azevedo da Silva2, Manoel Eusébio de Lima3, Ricardo Emmanuel de Souza2, Roberta Fagundes1 and Wellington Pinheiro dos Santos2 1

Polytechnic School of Pernambuco, University of Pernambuco, UPE, Recife, Brazil Department of Biomedical Engineering, Federal University of Pernambuco, UFPE, Recife, Brazil 3 Center for Informatics, Federal University of Pernambuco, UFPE, Recife, Brazil 2

4.1 Introduction Alzheimer’s disease (AD) is an irreversible chronic neurodegenerative disease that results in loss of mental function due to the deterioration of brain tissue. According to the World Health Organization, AD is the most common cause of dementia, and its main risk factor is age, in most cases, seniors over 65 years old. This disease presents itself as progressive loss of behavioral and intellectual characteristics, memory decline, language and perception are some of the problems [1]. Researchers seek to understand the pathology of AD because it is not yet fully understood. Neurochemical changes are some of the defining characteristics of the disease. The presence of neurofibrillary tangles and the accumulation of plaques amyloids characterize AD. However, these changes may confuse, since they are also found in healthy brains only in lower quantity, and it is not yet known whether they are pathological traits of cause or consequence. Also, neuroanatomical changes are associated with AD, such as the atrophy of cortical regions of the encephalon and in the hippocampal region, in contrast to ventricular enlargement, are very common in patients diagnosed with the disease. Although medicine is not yet able to reduce its speed, effect, or interrupt the degenerative process leading to the death of the cells in AD patients, some therapeutic interventions and treatments can alleviate the symptoms of the disease, giving a better life quality to them. Although the pathological diagnosis of AD is performed only by autopsy, accurate clinical diagnosis is essential, especially in its initial stage, since, in

Deep Learning for Data Analytics. DOI: https://doi.org/10.1016/B978-0-12-819764-6.00005-3

© 2020 Elsevier Inc. All rights reserved.

63

64

Iago Richard Rodrigues da Silva et al.

addition to allowing the early initiation of available treatments, it is also able to help researchers looking for new treatments by allowing the reduction of time and the cost of clinical trials. The availability of early diagnosis of AD may also benefit its bearer in the maintenance of the life through the adoption of strengthening of the practice of healthy habits as well as to avoid exposure to situations that may cause risk or harm the treatment. Due to their ability to track the degree of neurodegeneration caused by AD, imaging clinical exams have been widely used as sources of data in studies that seek to perform the diagnosis in an automated way. In this role, the use of structural magnetic resonance imaging (sMRI), better suited to watch alterations in the cerebral structure, and the use of positron emission tomography (PET), which is more appropriate for detection of cerebral metabolic changes, are highlighted. Magnetic resonance imaging (MRI) is essential for the diagnosis of AD [28]. The specialists in neurology make the diagnosis through the image or signals [9] examination analysis. Currently, specialists have used clinical applications that assist them in making decisions. Applications of this type frequently use computational intelligence [28]. These applications can determine with high accuracy if the patient is healthy or has the disease. Currently, several computational intelligence techniques are used for the classification of AD. Among these techniques [10,11], we can highlight deep learning as an approach that has had extensive use in previous years in AD diagnosis [1214]. Machine learning techniques have been applied successfully in the task of distinguishing, in an automated way, Alzheimer’s patients from healthy individuals. However, due to the silent nature of the disease, in which the degenerative process begins many years before the first symptoms, the problem of classification had to be remodeled in a more sophisticated way. In this new model, it has also become essential to identify groups of patients presenting symptoms of moderate cognitive impairment (MCI), or that although they have not yet shown symptoms of cognitive impairment, already demonstrate compatibility with biomarkers related to the disease, known as preclinical Alzheimer’s. This encouraged the development of new studies that use classification models with three or four classes. The increased impact of the disease, especially in developed countries where the average age of the population is higher, has led to the establishment of global initiatives for the development of biomarkers such as ADNI (Alzheimer’s Disease Neuroimaging Initiative) and AIBL (Australian Imaging, Biomarker and Lifestyle Flagship Study of Aging). These works made available a large amount of longitudinal data, of clinical and demographic nature that could be used by studies based on machine learning to develop more accurate multiclass models for early diagnosis. There have also been several advances in the area of machine learning, a considerable number of these are related to the emergence of a new type of model known as a deep learning network, which presented, along with other key features,

Deep learning for early diagnosis of Alzheimer’s disease: a contribution and a brief review

the ability to manipulate large sets of data. Thus, together, such innovations have given rise to several studies that aimed to perform the classification of AD using deep learning. Thus, the main objective of this chapter is to identify the main work in order to raise and, if possible, respond to questions such as: what are the architectures of deep learning networks that are currently used for the diagnosis of AD? Has there been a gain in accuracy from the use of such networks and what are the limitations of such networks? In this chapter, we present a very brief review of the state of the art and propose a model for binary AD diagnosis based on deep feature extraction for the classification using MRI. The target is to correctly classify patients that contain AD and who does not have the disease with high precision. The model consists in the use of features generated by CNN and another technique to classify these features. First, the CNN is pretrained to get the best features that can be used for classification. Then, with these data, the supervised machine learning algorithms are used for the classification task. The use of other algorithms aims to maximize the accuracy of the model.

4.2 Literature review The use of machine learning has improved image diagnosis significantly in several cases, not only in AD diagnosis [28] but in breast cancer [1526] and multiple sclerosis diagnosis [27], as well as in several other applications. Taking into account AD early diagnosis using MRI anatomical analysis, we can use binary and three-class approaches. Here we investigate the binary approach.

4.2.1 Alzheimer’s disease binary classification The classification of AD is the target of researchers in the literature. In general, the methods for this purpose consist of three main steps, such as the acquisition of images, extraction of features, and the pattern recognition. In this section, we present some works in the state of the art that use the approach we mentioned. The work in Ref. [28] proposes an approach for the extraction of characteristics in MRI images. The method is called hybrid forward sequential selection (HFS) to identify regions of the brain that are correlated with the brain. Also, the statistical technique principal component analysis (PCA) is used to reduce the features. The work gets an accuracy of 0.912 and sensitivity and specificity equals to 0.886 and 0.923, respectively. Another work Ref. [29] uses data obtained through the MRI voxels in the feature extraction step. The voxels used are named volumes of interest (VOIs) and are selected from the gray matter (GM) segmentation. The generated features are put into a

65

66

Iago Richard Rodrigues da Silva et al.

genetic algorithm for rankings of relevance and correlation. Finally, ranked data is trained with the support vector machine (SVM) classifier. The work gets an accuracy of 0.9301 and sensitivity and specificity equal 0.8913 and 0.9680, respectively. The work in Ref. [30] proposes an approach with the extraction of features of multimodal data. The data are extracted from MRI and fluorine deoxi glicose positron emission tomography (FDG-PET) images. Also, they use cerebrospinal fluid data (CSF) and genetic data from patients. All this information is combined for training with the random forest algorithm. The work gets an accuracy of 0.862 and sensitivity and specificity equals to 0.851 and 0.861, respectively.

4.2.2 Alzheimer’s disease binary classification using a deep learning approach Approaches using deep learning are currently more likely to be used. The same phenomenon happens with AD classification. The use of depths is highlighted in this context due to its high capacity for generalization. The use of deep networks in sorting has some major strands having the use of a CNN, deep multilayer perceptron (MLP), or autoencoders [31]. While the MLP stands out in the classification context, CNN and autoencoders stand out in the context of feature extraction. These use of these techniques is not limited to MRI images, they are also used in PET-SCAN images [32] and electroencephalography signals [33], where they have achieved good results. The MLP approach has the same guidelines as the aforementioned. In this network, the input neurons are usually the feature arrays generated by the extraction of features. Then there is a complex structure that defines the network structure, containing many layers with many neurons [34]. The number of executions takes place through epochs, while the network learns and updates the weights of all neurons. In the paper Ref. [12] the authors use deep MLP for classification. The features were generated through a spectral matching. The results for accuracy are equal to 0.84, and for sensitivity and specificity are equal to 0.73 and 0.89, respectively. Another type of network with a deep approach is the autoencoders. Autoencoders work by displaying their input in their output. The purpose of this type of network is to perform the learning of the encodings of the features [35]. The approach in Ref. [36] is to detect the AD using autoencoders in the extraction of information of the data. In the classification step, the softmax regression is used, which is an activation function for multiclass classification [37]. The model presented an accuracy, sensitivity, and specificity of 0.8776, 0.8857, and 0.8722, respectively. The same authors applied the same neural network structure in another approach at work [38]. The difference is the use of MRI images along with PET. The metrics presented above had an increase to 0.8792, 0.9366, and 0.8722, respectively. Finally, another type of deep network used for AD classification is CNN. A CNN works with the learning of convolutional layer filters to extract features (more details

Deep learning for early diagnosis of Alzheimer’s disease: a contribution and a brief review

in Section 4.3.2). The work Ref. [13] uses an approach to AD classification using CNN. First, all images in the database corresponding to the upper brain are selected. These images are then used as input to the network. The learning is performed, and classification is done through the sigmoid activation function. The work reached an accuracy of 0.8485, while sensitivity and specificity equal 0.8593 and 0.8314, respectively. A similar approach was proposed in Ref. [14], where a CNN was also used. The difference is the data input and the robustness of the network. CNN data entry was previously selected landmarks in the database, while the network structure had more layers. In turn, the data classification was performed through the softmax activation function. About the metrics mentioned above, the model reached 0.9275, 0.9348, and 0.9130, respectively.

4.3 Methods In this section, we discuss the methods used in this work, such as data set used, data preprocessing, CNN network proposed, how we get the features from CNN network, data preparation for classifiers, and classification. The pipeline of these methods is illustrated in Fig. 4.1. Then, the next subsections explain each of the stages declared in the pipeline.

4.3.1 Data acquisition and preprocessing We used all data from the MIRIAD (Minimal Interval Resonance Imaging in Alzheimer’s Disease) data set Ref. [39] from University College London (UCL). This data set contains MRI of 69 persons divided into two groups: the first with 23 healthy patients (healthy controls, or HC), and the second with 46 patients with AD. This data set provides two classes for binary classification. In this work, we investigate the application of the proposed model in this context. The data set was created by the longitudinal approach, where the patients make the image acquisition in seven visits over 52 weeks. In total, 39 patients completed the visits, and among these, 22 patients were sent to make another exam. The total quantity of completed exams in this data set is 708. Table 4.1 shows the demographic data of the data set.

Figure 4.1 Block diagram describing the general methodology.

67

68

Iago Richard Rodrigues da Silva et al.

Table 4.1 Demographic data of the MIRIAD data set. Class Male/female Age (Mean 6 std)

NC AD

12/11 19/27

69.7 6 7.2 69.4 6 7.1

Figure 4.2 Procedure for acquisition of magnetic resonance slices.

Each examination represents a file containing images corresponding to the brain of each patient, representing a three-dimensional plane. First, we converted these files into two-dimensional images, represented by slices. The choice of slices is an essential process for the recognition of patterns in medical images, because if the slices are selected incorrectly, it may impair the learning of the algorithms. Therefore, for each examination we selected 30 slices in the axial plane, located above the eyes. Fig. 4.2 illustrates this procedure. After the acquisition of the slices, we normalized the images to values ranging between 0 and 1 to support the sigmoid decoders present at CNN, finishing the preprocessing step.

4.3.2 Convolutional neural network training and feature extraction With the use of a CNN, it is not necessary to perform image segmentation. Thus, we did not previously segment images into our approach. Other methods of image classification need this step to cut possible noise and binarization of the image [40]. CNN works by extracting the best features present in an image, following a definition through the class. In our model, the slices obtained are directly placed as input to CNN. Then the process of extracting attributes is performed on all the slices. Fig. 4.3 shows the CNN architecture used in this work. We define three convolutional layers that are responsible for the application of 3 3 3 filters. These layers are responsible for extracting the features present in an image [41]. At each layer, we increase the number of filters by multiplying by 2. The number of filters used in the first layer was 32, the second layer was 64, and the third layer was 128. At each layer drop, the number of features used becomes lower, causing loss of information. Because of this, only three convolutional layers were defined. With

Deep learning for early diagnosis of Alzheimer’s disease: a contribution and a brief review

Figure 4.3 CNN architecture used in this work.

the network training process, the filters are automatically adjusted for activation in the most relevant features. The activation function we use in the layers for this work is the Rectified linear units (ReLU). There are layers of pooling after the convolution layers [42]. A single value replaces the values belonging to a region in the features map. This decreases the number of attributes to drop to the next convolution layer. In the pooling layers, we use the maxðÞ function, with a 2 3 2 window, to get the highest values of each region. After each layer of pooling, we define layers of the dropout type. The dropout layers are responsible for ignoring a certain number of neurons in a network [43]. This process is performed in the training phase of the network. The goal of adding these layers is to reduce the risk of overfitting on CNN. This forces the deep network to learn the most robust features for better network learning. The parameter used in the three dropout layers was 0.20; this is the rate of neurons to be ignored in the network. In the fully connected layer, we use only one neuron. Then we use the sigmoid activation function for simple classification. It’s a binary classification problem, and this activation function is better for it. The fully connected layer input data is generated through the last convolution layer of this project, with 128 filters. For that, the data is reshaped and placed in a single array with 512 elements, through the flatten. Thus, the model is pretrained in n

69

70

Iago Richard Rodrigues da Silva et al.

Figure 4.4 The process of classification of extracted features. First, the features of the images are collected by CNN. Then the features are put into classifiers.

epochs to learn the best features. Upon completion of CNN learning, the data from each image generated through the flattening are placed in a multidimensional array. These data are used for the learning step with another computational intelligence algorithm.

4.3.3 Training and classification with other algorithms We apply this step with the aim of maximizing the accuracy and other metrics of the model. For this, three classification algorithms used in AD classification were defined: random forest [44], SVM [29], and k-nearest neighbors (k-NN) [45]. Fig. 4.4 illustrates this step.

4.4 Experiments and results The first subsection presents the environmental configuration parameters of the experiments. The next subsections present the results obtained.

4.4.1 Experimental settings We validate our method by applying the Alzheimer’s versus Healthy Controls (AD vs HC) classification task. We partitioned the MIRIAD database to perform CNN pretraining, according to the holdout method. The size of the training set used is 80%, while the other 20% is used to test the model. Finally, the network is pretrained, executed in 50 epochs to generate the feature array.

71

Deep learning for early diagnosis of Alzheimer’s disease: a contribution and a brief review

Table 4.2 The principal parameters used in the learning algorithms. Algorithm Parameter

Values used

Random forest SVM linear kernel SVM RBF kernel k-NN

50, 100, and 150  0.04, 0.07, and 0.10 1, 4, and 7

Iterations  Gamma (g) K

The feature array is used for the new learning. We use the aforementioned algorithms for this purpose. Table 4.2 shows the parameters used in each algorithm for the machine learning process. The data validation method of this array is cross-validation with K 5 10 (folds), where 10 executions are performed to evaluate the model. We adopt metrics to evaluate the prediction results. These depend on the amount of correctness of the positive (TP) and negative (TN) classes, as well as the errors of the positive (FP) and negative (FN) classes. The metrics chosen to evaluate our model were accuracy (ACC), sensitivity (SEN), and specificity (SPE), and the area under the region of interest (ROC) curve (AUC). Eqs. (4.14.3) illustrate the formulas that represent these metrics: TP 1 TN TP 1 TN 1 FP 1 FN TP SEN 5 TP 1 FN TN SPE 5 TN 1 FP

ACC 5

(4.1) (4.2) (4.3)

4.4.2 Classification results After applying the proposed method with the experimental configurations that have been defined, we get very close results in several metrics. Table 4.3 shows the results obtained. When we analyze the results, we can observe some contrasts. The first occurs in the application of the algorithm random forest. About classification, the algorithm achieved a high ACC rate of 0.8832, as well as the AUC rate of 0.9680, using 150 iterations. We note that there is a slight discrepancy between the two metrics because the hit rates of the two classes are unbalanced. We can reach this conclusion because of the values obtained for SEN to be equal to 0.8585 and SPE equal to 0.9438. That is, there is a higher hit rate than true negatives (patients who do not have AD). This fact makes random forest have the highest AUC rate. We can also observe that the number of iterations causes little influence on the metrics analyzed. The difference in values occurs only in the third decimal place.

72

Iago Richard Rodrigues da Silva et al.

Table 4.3 Results obtained with the experiments of AD versus HC classification. Algorithm ACC AUC SEN

SPE

Random forest, 50 iterations Random forest, 100 iterations Random forest, 150 iterations SVM, linear SVM, RBF, g 5 0.04 SVM, RBF, g 5 0.07 SVM, RBF, g 5 0.10 k-NN, K 5 1 k-NN, K 5 4 k-NN, K 5 7

0.9400 0.9438 0.9567 0.9264 0.9500 0.9539 0.9531 0.7142 0.7672 0.6599

0.8700 0.8800 0.8832 0.9508 0.9552 0.9583 0.9607 0.8512 0.8745 0.8118

0.9570 0.9650 0.9680 0.9460 0.9552 0.9583 0.9569 0.8750 0.9440 0.9520

0.8547 0.8585 0.8657 0.9636 0.9448 0.9495 0.9607 0.9650 0.9623 0.9836

The second contrast is with the application of the SVM algorithm. We have observed that the radial basis function (RBF) kernel has obtained superior results in some metrics compared to the linear kernel. About the linear kernel, the algorithm obtained a great ACC value among the algorithms under analysis, with the result equal to 0.9508, while the RBF kernel with g 5 0.10 obtained the highest ACC equal to 0.9607. Still, the other two algorithms obtained better results in other metrics. The AUC rate equals 0.9569 when the RBF kernel is almost 0.01 points lower than the AUC rate obtained with random forest. However, despite also being exceeded in SPE with a value equal to 0.8531, it obtained a SEN equal to 0.9607, with its best parameter choice. It shows that although the AUC is smaller about random forest, it got a high capacity of generalization. There is a greater balance in the classification with both parameters and kernels, even though the classes are unbalanced. Through the high rate of correctness in both classes, we can come to this conclusion. Finally, the k-NN algorithm has the best SEN rate, which is equal to 0.9836 using K 5 7. We used three different K values to evaluate our model, and in both we observed that the SEN metric is always high. This indicates that the algorithm had a high success rate in the positive classes. This fact positively influenced the AUC metric, obtaining 0.9520. The result obtained in the SPE metric is much lower when compared to those obtained with random forest and with SVM, using any parameters. This fact has a direct impact on the ACC, also getting the lowest of the three algorithms, with a result corresponding to 0.8118, using K 5 4. We used the RBF kernel in SVM, and it obtains the best classification results. This provided not only a high ACC result but also a high hit ratio in both classes. While in the other two algorithms there has always been a disproportion about the hit rates of the classes. With this, we can affirm that the CNN generated a hyperplane that can correctly separate the features. The configuration of the gap chosen provided good separation between the two classes. This is relevant for medical diagnosis, as false diagnoses can affect patients’ quality of life. In spite of obtaining a considerable high

Deep learning for early diagnosis of Alzheimer’s disease: a contribution and a brief review

accuracy, one cannot rely on k-NN and random forest in our method, due to the low rate of hits of the true or false negatives. Thus, we can also say that SVM with an RBF kernel is the most suitable for application in our approach. The state-of-the-art result for classification and diagnosis of AD for the binary class using either deep learning or non-deep learning is around 0.84. For comparison, we selected works of this type and used similar experimental configuration parameters. Table 4.4 presents a comparative table with approaches, configuration parameters, and results obtained in each work cited. Our approach is shown with high precision to solve this classification problem, reaching an ACC equal to 0.9607. This indicates that it is competitive with previously published works. After analyzing methods that do not use deep learning, we can attest to some observations. Regardless of the feature extractor used, accuracy was always higher using the SVM. This fact also occurs in our work, with this algorithm having superior results in the ACC metric. In this context of the use of SVM, the work [29] obtained the best results for SPE, with the value equal to 0.9680. It demonstrates that the SVM is one of the algorithms that can best be applied to solve the problem of AD classification. Despite the mathematical robustness of these feature extractors, none of these works outperformed the ACC of our model. Thus, we can verify a superiority of the methods using CNN to extract characteristics about other methods presented. In the analysis of the methods that use deep learning, we can also attest to some observations. About the attribute extraction criterion for this research problem, CNN provides better results when compared to autoencoders observing all metrics analyzed. But this does not detract from these approaches, which presented ACC greater than 0.87. Also, we found that the use of a hybrid model (CNN 1 SVM) gives a better result when compared to that obtained using only CNN. This is the case of works Refs. [13] and [14], which presented an approach with sigmoid and softmax in the classification layer and obtained ACC results smaller than ours. With this, we determine that the use of a more robust classification algorithm will tend to get more accurate results from a hybrid approach. Our approach provides a balance in the learning of both classes (AD and HC). For this reason, it obtained superior ACC about the analyzed works, being a deep approach or not. Our approach presented lower values of AUC concerning [14], and SPE concerning [29]. However, our SEN surpassed the others, and we obtained a very close SPE. This fact reflects in the balanced amount of correctness of the two classes, even though the base is unbalanced. The main contribution of our work is the fact that CNN generates optimal and precise features for the research problem of this work. Also, these features cause a boost in accuracy if used in a hybrid approach with another more robust classifier. We can prove it by the high hit rates obtained in all the algorithms that have been tested. Moreover, the application of the SVM algorithm was fundamental for balance in the quantity of correctness of both classes. The SVM algorithm features robustness and

73

Table 4.4 Comparison of Alzheimer’s disease binary classification (AD vs HC) with published works. Refs. Data set Feature extraction Classification Data partitioning algorithm

ACC

Liu et al. (2014) [36]

ADNI

Liu et al. (2015) [38]

ADNI

Shakeri et al. (2016) [12] Han and Zhao (2016) [28]

0.8776 

0.8857 0.8722

Softmax



0.8792 

0.9366 0.8861

Deep MLP SVM

 Cross-validation (10 folds) Cross-validation (10 folds) Holdout (75% train | 25% test) Holdout (80% train20% test) ADNI-1 as train and ADNI-2 as test Cross-validation (10 folds)

0.8400  0.9120 

0.7300 0.8900 0.8860 0.9230

Beheshti et al. (2017) [29]

ADNI

Voxel values

SVM

Tong et al. (2017) [30]

ADNI

Graphs

SVM

Silva et al. (2018) [13]

MIRIAD CNN

Sigmoid CNN MIRIAD CNN (landmarked Softmax points) CNN MIRIAD CNN

SPE

Cross-validation (10 folds)

ADNI ADNI

Our work

SEN

Softmax

Stacked autoencoders in gray matter Stacked autoencoders (MRI 1 PET) Spectral matching HFS

Liu et al. (2018) [14]

AUC

SVM

0.9301

0.9351 0.8913 0.9680

0.8620

0.9300 0.8510 0.8610

0.8485 

0.8593 0.8314

0.9275 0.9716 0.9348 0.9130 0.9607 0.9569 0.9607 0.9530

Deep learning for early diagnosis of Alzheimer’s disease: a contribution and a brief review

ease of learning about AD classification. This also provides an increase in accuracy. The proposed architecture is not new in the literature [46]; however, its application in MRI for diagnosis of AD has not yet been applied. The contribution is clearly explicit through the results obtained in all metrics analyzed.

4.5 Conclusion In this work, we presented a brief review of the state of the art of computer-based early diagnosis methods based on machine learning, with a special focus on deep learning. We also presented our own architecture for AD diagnosis, evolved from [13]. We presented a deep architecture for early diagnosis of AD based on implicit feature extraction for the classification of MRI. Our model aims to classify AD patients against a group of patients without the disease. To validate our proposal, we used the MIRIAD AD image database. We selected 30 slices from the upper brain above the eyes for learning. The method consists of a combination of feature extraction through CNN and classification with supervised algorithms. We employed MRI slices from the upper region to the eyes. We performed experiments with random forest, SVM, and k-NN algorithms, obtaining the following results for accuracy: 0.8800, 0.9508, and 0.8512, respectively. The experimental results point out the feasibility of our proposal, getting a reasonable generalization capacity, though the database is unbalanced. The model can generate adequate features for this problem. Additionally, these features boost accuracy if used in a hybrid approach with another more robust classifier. Compared to previous results of the state of the art, it is notable our approach is competitive. It is superior in two of the four metrics evaluated. We conclude that our model is efficient for the diagnosis of Alzheimer’s disease considering two classes (AD vs HC) for binary classification. However, more experiments should be performed using three classes of AD MRI databases, including the MCI stage.

Acknowledgment We are grateful for the Brazilian research agencies Facepe, CAPES, and CNPq, for the partial support of this research.

References [1] E.R. Kandel, J.H. Schwartz, T.M. Jessell, M.B.T. Jessell, S. Siegelbaum, et al., Principles of Neural Science, vol. 4, McGraw-Hill, New York, 2000. [2] W.P. dos Santos, R.E. de Souza, P.B. dos Santos Filho, Evaluation of Alzheimer’s disease by analysis of MR images using multilayer perceptrons and Kohonen SOM classifiers as an alternative to the ADC maps, 2007 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, IEEE, 2007, pp. 21182121.

75

76

Iago Richard Rodrigues da Silva et al.

[3] W.P. dos Santos, F.M. de Assis, R.E. de Souza, P.B. Mendes, H.S. de Souza Monteiro, H.D. Alves, A dialectical method to classify Alzheimer’s magnetic resonance images, Evolutionary Computation, IntechOpen, 2009. [4] W.P. dos Santos, F.M. de Assis, R.E. de Souza, P.B. dos Santos Filho, Evaluation of Alzheimer’s disease by analysis of MR images using objective dialectical classifiers as an alternative to ADC maps, 2008 30th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, IEEE, 2008, pp. 55065509. [5] W. P. d Santos, F. Assis, R. Souza, P.B. Santos Filho, F.L. Neto, Dialectical multispectral classification of diffusion-weighted magnetic resonance images as an alternative to apparent diffusion coefficients maps to perform anatomical analysis, Comput. Med. Imaging Graph. 33 (6) (2009) 442460. [6] W.P. dos Santos, F.M. de Assis, R.E. de Souza, P.B. dos Santos Filho, Dialectical classification of MR images for the evaluation of Alzheimer’s disease, in: G.R. Naik (Ed.), Recent Advances in Biomedical Engineering, IntechOpen, 2009. [7] W.P. dos Santos, R.E. de Souza, P.B. Santos Filho, F.B.L. Neto, F.M. de Assis, A dialectical approach for classification of DW-MR Alzheimer’s images, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence), IEEE, 2008, pp. 17281735. [8] W.P. dos Santos, F.M. de Assis, R.E. de Souza, P.B. Mendes, H.S. Monteiro, H.D. Alves, Fuzzybased dialectical non-supervised image classification and clustering, Int. J. Hybrid Intell. Syst. 7 (2) (2010) 115124. [9] N. Mammone, L. Bonanno, S.D. Salvo, S. Marino, P. Bramanti, A. Bramanti, et al., Permutation disalignment index as an indirect, EEG-based, measure of brain connectivity in MCI and AD patients, Int. J. Neural Syst. 27 (05) (2017) 1750020. [10] N. Dey, A.S. Ashour, S. Borra, Classification in BioApps: Automation of Decision Making, vol. 26, Springer, 2017. [11] K. Lan, D.-t. Wang, S. Fong, L.-s. Liu, K.K. Wong, et al., A survey of data mining and deep learning in bioinformatics, J. Med. Syst. 42 (8) (2018) 139. [12] M. Shakeri, H. Lombaert, S. Tripathi, S. Kadoury, ADNI, deep spectral-based shape features for Alzheimer’s disease classification, International Workshop on Spectral and Shape Analysis in Medical Imaging, Springer, 2016, pp. 1524. [13] I.R.R. Silva, R.G. de Souza, G.S. Silva, C.S. de Oliveira, L.H. Cavalcanti, R.S. Bezerra, et al., Utilização de Redes Convolucionais Para Classificação e Diagnóstico da Doença de Alzheimer, in: II Simpósio de Inovação em Engenharia Biomédica, 2018, pp. 7376. [14] M. Liu, J. Zhang, E. Adeli, D. Shen, Landmark-based deep multi-instance learning for brain disease diagnosis, Med. Image Anal. 43 (2018) 157168. [15] F.R. Cordeiro, W.P. Santos, A.G. Silva-Filho, A semi-supervised fuzzy GrowCut algorithm to segment and classify regions of interest of mammographic images, Expert Syst. Appl. 65 (2016) 116126. [16] W.W. Azevedo, S.M. Lima, I.M. Fernandes, A.D. Rocha, F.R. Cordeiro, A.G. da Silva-Filho, et al., Fuzzy morphological extreme learning machines to detect and classify masses in mammograms, 2015 IEEE International Conference on Fuzzy Systems (Fuzz-IEEE), IEEE, 2015, pp. 18. [17] F.R. Cordeiro, W.P. dos Santos, A.G. Silva-Filho, Segmentation of mammography by applying GrowCut for mass detection, Stud. Health Technol. Inform. 192 (2013) 8791. [18] F.R. Cordeiro, W.P. dos Santos, A.G. Silva-Filho, An adaptive semi-supervised Fuzzy GrowCut algorithm to segment masses of regions of interest of mammographic images, Appl. Soft Comput. 46 (2016) 613628. [19] F.R. Cordeiro, S.M. Lima, A.G. Silva-Filho, W.P. dos Santos, Segmentation of mammography by applying extreme learning machine in tumor detection, International Conference on Intelligent Data Engineering and Automated Learning, Springer, 2012, pp. 92100. [20] A.A. Mascaro, C.A. Mello, W.P. dos Santos, G.D. Cavalcanti, Mammographic images segmentation using texture descriptors, 2009 Annual International Conference of the IEEE Engineering in Medicine and Biology Society, IEEE, 2009, p. 3653.

Deep learning for early diagnosis of Alzheimer’s disease: a contribution and a brief review

[21] S.M. de Lima, A.G. da Silva-Filho, W.P. dos Santos, A methodology for classification of lesions in mammographies using Zernike Moments, ELM and SVM Neural Networks in a multi-kernel approach, 2014 IEEE International Conference on Systems, Man, and Cybernetics (SMC), IEEE, 2014, pp. 988991. [22] T. Cruz, T. Cruz, W. Santos, Detection and classification of lesions in mammographies using neural networks and morphological wavelets, IEEE Lat. Am. Trans. 16 (3) (2018) 926932. [23] F.R. Cordeiro, K.F.P. Bezerra, W.P. dos Santos, Random walker with fuzzy initialization applied to segment masses in mammography images, 2017 IEEE 30th International Symposium on ComputerBased Medical Systems (CBMS), IEEE, 2017, pp. 156161. [24] M. Santana, J. Pereira, N. Lima, F. Sousa, R. de Lima, W. dos Santos, Classificação de lesões em imagens frontais de termografia de mama a partir de sistema inteligente de suporte ao diagnóstico, in: Anais do I Simpósio de Inovação em Engenharia Biomédica-SABIO 2017, 2017, p. 16. [25] I. Fernandes, W. dos Santos, Classificação de Mamografias Utilizando Extração de Atributos de Textura e Redes Neurais Artificiais, Congresso Brasileiro de Engenharia Biomédica, vol. 8, CBEB 2014, 2014. [26] M. A. d Santana, J.M.S. Pereira, F. L. d Silva, N. M. d Lima, F. N. d Sousa, G. M. S. d Arruda, et al., Breast cancer diagnosis based on mammary thermography and extreme learning machines, Res. Biomed. Eng. 34 (2018) 4553. ,http://www.scielo.br/scielo.php?script 5 sci_arttextpid 5 S2446-47402018000100045nrm 5 iso.. [27] O. Commowick, A. Istace, M. Kain, B. Laurent, F. Leray, M. Simon, et al., Objective evaluation of multiple sclerosis lesion segmentation using a data management and processing infrastructure, Sci. Rep. 8 (1) (2018) 13650. [28] Y. Han, X.-M. Zhao, A hybrid sequential feature selection approach for the diagnosis of Alzheimer’s Disease, Neural Networks (IJCNN), 2016 International Joint Conference on, IEEE, 2016, pp. 12161220. [29] I. Beheshti, H. Demirel, H. Matsuda, ADNI, classification of Alzheimer’s disease and prediction of mild cognitive impairment-to-Alzheimer’s conversion from structural magnetic resource imaging using feature ranking and a genetic algorithm, Comput. Biol. Med. 83 (2017) 109119. [30] T. Tong, K. Gray, Q. Gao, L. Chen, D. Rueckert, ADNI, multi-modal classification of Alzheimer’s disease using nonlinear graph fusion, Pattern Recogn. 63 (2017) 171181. [31] L. Deng, D. Yu, et al., Deep learning: methods and applications, Found. Trends Signal Process. 7 (34) (2014) 197387. [32] Y. Ding, J.H. Sohn, M.G. Kawczynski, H. Trivedi, R. Harnish, N.W. Jenkins, et al., A deep learning model to predict a diagnosis of Alzheimer disease by using 18F-FDG PET of the brain, Radiology 290 (2) (2018) 456464. [33] F.C. Morabito, M. Campolo, N. Mammone, M. Versaci, S. Franceschetti, F. Tagliavini, et al., Deep learning representation from electroencephalography of early-stage Creutzfeldt-Jakob disease and features for differentiation from rapidly progressive dementia, Int. J. Neural Syst. 27 (02) (2017) 1650039. [34] G.F. Montufar, R. Pascanu, K. Cho, Y. Bengio, On the number of linear regions of deep neural networks, in: Advances in Neural Information Processing Systems, 2014, pp. 29242932. [35] L. Deng, M.L. Seltzer, D. Yu, A. Acero, A.-R. Mohamed, G. Hinton, Binary coding of speech spectrograms using a deep auto-encoder, in: Eleventh Annual Conference of the International Speech Communication Association, 2010. [36] S. Liu, S. Liu, W. Cai, S. Pujol, R. Kikinis, D. Feng, Early diagnosis of Alzheimer’s disease with deep learning, Biomedical Imaging (ISBI), 2014 IEEE 11th International Symposium on, IEEE, 2014, pp. 10151018. [37] D. Heckerman, C. Meek, Models and selection criteria for regression and classification, Proceedings of the Thirteenth conference on Uncertainty in Artificial Intelligence, Morgan Kaufmann Publishers Inc, 1997, pp. 223228.

77

78

Iago Richard Rodrigues da Silva et al.

[38] S. Liu, S. Liu, W. Cai, H. Che, S. Pujol, R. Kikinis, et al., Multimodal neuroimaging feature learning for multiclass diagnosis of Alzheimer’s disease, IEEE Trans. Biomed. Eng. 62 (4) (2015) 11321140. [39] I.B. Malone, D. Cash, G.R. Ridgway, D.G. MacManus, S. Ourselin, N.C. Fox, et al., MIRIAD: public release of a multiple time point Alzheimer’s MR imaging dataset, NeuroImage 70 (2013) 3336. [40] I.R.R. Silva, R.A.A. Fagundes, T.S.M.C. de Farias, Techniques for automatic liver segmentation in medical images of abdomen, IEEE Lat. Am. Trans. 16 (6) (2018) 18011808. Available from: https://doi.org/10.1109/TLA.2018.8444402. [41] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proc. IEEE 86 (11) (1998) 22782324. [42] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, in: Advances in Neural Information Processing Systems, 2012, pp. 10971105. [43] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting, J. Mach. Learn. Res. 15 (1) (2014) 19291958. [44] K. Oppedal, T. Eftestøl, K. Engan, M.K. Beyer, D. Aarsland, Classifying dementia using local binary patterns from different regions in magnetic resonance images, J. Biomed. Imaging 2015 (2015) 5. [45] S.H. Nozadi, S. Kadoury, ADNI, classification of Alzheimer’s and MCI patients from semantically parcelled PET images: a comparison between AV45 and FDG-PET, Int. J. Biomed. Imaging 2018 (2018). [46] X.-X. Niu, C.Y. Suen, A novel hybrid CNNSVM classifier for recognizing handwritten digits, Pattern Recogn. 45 (4) (2012) 13181325.

CHAPTER FIVE

Musculoskeletal radiographs classification using deep learning N. Harini, B. Ramji, S. Sriram, V. Sowmya and K.P. Soman Center for Computational Engineering and Networking (CEN), Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Coimbatore, India

5.1 Introduction In recent days, deep learning (DL) has become very popular and is used for solving research problems in a wide range of fields such as biomedical [1,2], cybersecurity [3], autonomous vehicles [4], etc. DL models are replacing traditional machine learning (ML)-based models because DL models have the ability to automatically extract the useful features from the input, while the traditional ML models require manual feature engineering. In the biomedical field, the emergence of DL has overcome many issues faced by traditional ML-based approaches [5]. One of the most popular DL architectures is the convolutional neural network. CNN is very popular in the computer vision field as it uses convolution filters to extract the various locationinvariant features. Since CNN model works well with the images, various CNN models and variants are used in biomedical fields like microscopic image classification [6], x-ray reconstruction [7], mammogram detection [8], liver lesion classification [9], brain MRI scan segmentation [10], plantar pressure image clustering [11], etc. Our study uses the radiographic images for abnormality detection and the feature extraction of these images are well performed by trained radiologists. Due to the increase in population, the computer-aided diagnosis is much needed. The most applicable demand for any diagnosis procedure is competency with trained radiologists. The purpose of the study is to provide some assistance in abnormality detection of radiographic images, which may be useful for the radiologist. The penetration of the x-ray on an object which is captured in a photographic film produces a radiographic image. The object density, geometry of the source, and object influence the intensity of exposure in the film [12]. The material properties, thickness, density, and absorption characteristics of the object influence the information content of the image [12]. The image intensifier’s speed [13] and x-ray focal spot size [14] influence the resolution of the image.

Deep Learning for Data Analytics. DOI: https://doi.org/10.1016/B978-0-12-819764-6.00006-5

© 2020 Elsevier Inc. All rights reserved.

79

80

N. Harini et al.

In computer vision, the use of a pretrained network and further fine-tuning of the network is referred to as transfer learning [15]. A pretrained network is one that was trained on a benchmark data set and obtained successful results. In the literature, most of the CNN models are pretrained with Imagenet data set from ILSVRC (ImageNet Large Scale Visual Recognition Challenge) [16]. All the pretrained models are for three-dimensional images (RGB) and each pretrained network has a default image size. In CNN, the first layers are general feature extractors so they can be transferred to other specific features by the last layers of the CNN [17]. Training the entire CNN model demands a lot of computational power and large set of data. The modification should be made at the specific feature-extraction phase of the pretrained network to map the application of our study. In our study, the pretrained network for the natural images are fine-tuned to medical radiographic images by adding few feature extractors at the end of the pretrained models. Only the added feature extractors were trained with the radiograph images by freezing the upper layers as shown in Fig. 5.1. Our objective is to study the effectiveness of various CNN-based architectures for detecting abnormalities in the radiograph images. We have obtained the radiograph images of upper extremities (shoulder, wrist, finger) from the musculoskeletal radiographs (MURA) [18] data set which contains finger, wrist, elbow, forearm, hand, humerus, and shoulder radiograph images. We have studied the performance of five famous ImageNet models (Xception [19], Inception [20], VGG-19 [21], DenseNet-169 [22], and MobileNet [23]). All the models that we have studied perform binary classification. The main contributions of our study are as follows: • We analyze the effectiveness of five CNN-based pretrained architectures for detecting abnormalities in the upper extremities radiograph images including shoulder, wrist, and finger. • We compare each model is based on metrics such as accuracy, precision, recall, and F1-score. Inp Pretrained CNN architectur

Legend: Frozen Trained Prediction

Figure 5.1 Pretrained CNN architecture diagram.

Musculoskeletal radiographs classification using deep learning

• We discuss the pros and cons of the use of pretrained CNN architectures with respect to radiographic images on different body regions. • We examine the challenges of the data set to apply the CNN-based architectures on abnormality detection of radiographic images. The rest of the sections of this chapter are organized as follows: Section 5.2 presents the literature survey on various biomedical applications. Section 5.3 contains a description of the MURA data set. Section 5.4 includes the proposed methodologies. Section 5.5 includes statistical measures. Section 5.6 provides experimental results and discussion. Section 5.7 contains the challenges and finally we end with the conclusion.

5.2 Related works Manual detection of abnormalities in radiographic images is a time-consuming task which requires medical expertise. Therefore, various ML- and DL-based solutions have been proposed over the last decade. An ML-based x-ray classifier is proposed in Refs. [24] and [25]. In Ref. [24], a three-level feature extraction is performed on xray images and a principal component analysis (PCA) is used to reduce the feature dimensionality. The extracted feature is fed into ML classifiers such as support vector machine (SVM) and k-nearest neighbor (k-NN). The experiment results have shown that SVM performed better than k-NN. A random forest-based radiograph classifier is proposed in Ref. [24] and its performance is compared with SVM. The proposed model achieved a precision of 93.10% while the SVM method achieves 89.07%. The drawback of ML-based approaches is that it needs manual feature engineering which is a tedious task especially in biomedical applications. Therefore, several DL-based solutions have been proposed as they have the ability to extract features automatically. CNN-based architectures are used in many biomedical applications where x-ray analysis is involved. In Ref. [26], a CNN-based fast detection of an intervertebral disk is proposed where the model is trained using 81 lateral lumbar x-ray images. The model achieves a precision of 0.905 with average computation time of 3 seconds per image, which greatly outperformed traditional methods in terms of accuracy and efficiency. A CNN-based robust finger joint detection from radiographs is proposed in Ref. [27]. The proposed system will achieve 98.02% average detection accuracy for 130 test data sets containing over 1950 joints. A CNN-based wrist fracture detection system is proposed in Ref. [28]. The data set in this study consists of wrist radiograph images from a MURA data set [18] and other radiographs of posteroanterior and lateral view images. The images are segmented to create patches and those patches are fed into CNN for fracture detection. The patch preparation is carried over by a global

81

82

N. Harini et al.

search initially with random forest regression voting (RFRV) followed by a local search performed by sequence of random forest constrained local models (RFCLMs). The result with respect to AUC of ROC is 96%. Many well-known CNN architectures including Inception v3, Xception, DenseNet-169, and many more are used in various biomedical applications. In Ref. [29], the abnormality detection on chest x-ray images from a Stanford normal radiology diagnostic data set is made using a CNN-based Inception V3 module and GoogLeNet, which is also based on an inception module. This has shown promise in creating a classification network to assist radiologists and physicians. Other than radiographic abnormality detection, Inception v3 is also used for PET scan diagnosis [30]. The model is trained and tested with 90% and 10% of Alzheimer’s Disease Neuroimaging Initiative (ADNI) data set respectively in order to predict the diagnosis of Alzheimer’s disease with 0.98 AUC on ROC, 82% specificity, and 100% sensitivity. In both Refs. [29] and [30], the Inception v3 model shows good performance and is also used in our work. CNNbased ensemble models are used for ankle fracture detection in [31]. The proposed models are trained on a small data set to classify the ankle radiographic images as normal and abnormal. The architectures are implemented on single-view and three-view ensemble models. In the ensemble model of five architectures, Xception, Xception with dropout and auxiliary tower yielded better results than the other three models (Inception v3, Resnet, and Resnet with dropout and auxiliary tower). A large data set for musculoskeletal radiographs abnormality detection is proposed in Ref. [18], where the authors have also classified radiograph images using a multiview 169-layer DenseNet model. The proposed model achieves a Cohen’s kappa statistic of 0.389. In our work, we have used the finger, wrist, and shoulder data from this data set and have analyzed the effectiveness of several CNN-based models. DLbased radiographic abnormality detection for wrist, hand, and ankle data is proposed in Ref. [32]. A few models like the VGG-19 are trained to classify radiographs into four classes: laterality, fracture, body part, and exam view. All models achieve at least 90% accuracy for all classes except fracture. The transfer learning approach works for both base and target tasks that share common features [17]. Several transfer learning-based solutions are proposed for biomedical problems that use CNN-based pretrained architectures. A transfer learning-based child pneumonia detection model is proposed which trains chest x-ray images on two pretrained models [33]. Inception v3 and MobileNet architectures are trained on 80% of data to classify radiograph images as bacterial pneumonia, viral pneumonia, and no pneumonia. In this work, MobileNet performed better than Inception v3 with 81.4% accuracy on the rest of the 20% data. In Ref. [34], a chest x-ray abnormality detection and localization framework is proposed where several pretrained models such as the VGG-19, AlexNet, and ResNet-50 are trained. The experimental analysis shows that better performance is achieved by the VGG-19 model, which is also used in our work.

Musculoskeletal radiographs classification using deep learning

5.3 Data set description and challenges 5.3.1 Description of the data set The MURA data set [18] is the largest publicly available musculoskeletal radiographic image data set which contains multiview images of finger, hand, wrist, forearm, elbow, humerus, and shoulder from the upper extremity region. It consists of 40,561 radiograph images in total from 14,656 studies of 12,173 patients, and every study has one or more radiographic images that are manually labeled by radiologists. This data set is collected and released by the Stanford ML group as part of the Bone X-Ray DL Competition [18]. The group has released train and valid data set and test data set is kept hidden. The train and valid set contains 36,812 and 3749 respectively, and the total number of train and valid set images in each study type is represented by Fig. 5.2. The MURA data set has two classes (normal and abnormal), and it contains 23,606 normal and 16,401 abnormal radiographs in total. The train set contains 21,939 normal and 14,873 abnormal radiograph samples whereas the valid set contains 1667 normal 1528 abnormal radiograph samples. Figs. 5.3 and 5.4 represent the total number of normal and abnormal images in train and valid set for each study type. In this data set, there are 1912 elbow studies, 2110 finger studies, 2185 hand studies, 727 humerus studies, 1010 forearm studies, 3015 shoulder studies, and 3697 wrist studies. While analyzing the data set, we observed that each study has one or more

Figure 5.2 Total train and valid images in all study types.

83

84

N. Harini et al.

Figure 5.3 Total normal and abnormal images in train set for all study types.

Figure 5.4 Total normal and abnormal images in valid set for all study types.

images and most of the studies in the data set have three images. In our work, we have used only finger, wrist, and shoulder radiographs as they are the least imbalanced when compared to other study types. Since the test set is hidden, we have used the valid data set for testing the performance of our models. Table 5.1 shows the statistics of the data used in our work.

85

Musculoskeletal radiographs classification using deep learning

Table 5.1 Total radiographs in finger, wrist, and shoulder data set. Study type Train Test

Finger Wrist Shoulder Total images in each class

Normal

Abnormal

Normal

Abnormal

3138 5769 4211 13,118

1968 3987 4168 10,123

214 364 285 863

247 295 278 820

Total images in each study type

5567 10,415 8942 Total: 24,924

5.3.2 Challenges faced We have observed several challenges while working on this analysis. The significant challenge we faced was from the data set. The original MURA data set contains radiograph images of different views but annotations of those views were not available in the open challenge data set released by the Stanford ML group. This may lead to significant degradation of the performance of our models. Further, the data set is highly imbalanced, which is evident from Figs. 5.3 and 5.4. The complexities of the open challenge data set are the following: 1. The data set contains images that are obfuscated and rotated. 2. It also includes images of varying depths. In other words, the data set consists of both gray scale and RGB images. 3. Some of the images contain multiple views and noisy information. 4. Few of the images have huge background of varying size.

5.4 Proposed methodologies 5.4.1 Data preprocessing Finger, wrist, and shoulder images are initially resized based on the default size of the architecture, and then the images are normalized. Further, to improve the performance of our models, the image augmentation is applied. Image augmentation is the process of creating synthetic images by performing various processing techniques such as rotation, flips, etc. Rotation (30 degrees), vertical, and horizontal flip are the three kinds of processing that we used for image augmentation.

5.4.2 Inception Inception V3 [20] allows for the expansion of the depth of the model without an increase in computational cost by using factorized convolutions and aggressive

86

N. Harini et al.

regularizations. The Inception V3 overcomes the vanishing gradient problem by batch normalization, which has been introduced in the Inception V2. In addition to this, the factorizing convolutions give better results with dimensionality reduction. As the inception module is responsible for the parallel process of convolution and pooling operations, the repetition of this block makes the network effective by selecting the combination of convolution and pooling for any task [35]. The architecture has convolutions and pooling layers with factorizations into asymmetric convolutions combined with high-dimensional representations, and the regularizations are implemented by auxiliary classifiers as represented in Fig. 5.5.

Figure 5.5 Inception V3 architecture diagram.

Musculoskeletal radiographs classification using deep learning

The number of parameters of the Inception V3 network is about 23,851,784, being comparatively not high with computation. The network used in our study is the pretrained Inception v3 network which has been trained on Imagenet data set, and the weights were loaded as the same. The pretrained network has a fully connected layer of 1000 in order to classify 1000 classes of Imagenet data set as shown in Fig. 5.5. In addition to this, four fully connected layers of 512,256 and 64 neurons were added as a dense layer of two neurons with the activation softmax to classify the MURA data set as normal and abnormal. As the pretrained network is used, the images are resized to the default input size of 299 3 299 3 3. This implementation of the network gave the number of parameters as 24,512,202 with trainable parameters as 660,418.

5.4.3 Xception Depthwise Separable Convolutions was proposed in Xception architecture [19], which includes the channel-wise spatial convolution, and then the point-wise convolution, which projects the actual channels of the image on a new channel space. The main advantage of the depthwise convolutions is the absence of nonlinearities on implementation. The 36 convolutional layers of the Xception network forms the feature extraction, and they are structured as 14 modules which are separated as entry flow, middle flow, and exit flow of the architecture as shown in Fig. 5.6. The entry flow consists of four modules as a convolution layer module and three residual connection modules. The residual connection module has two separable convolutions and is connected residually with a point-wise convolution. The middle flow has eight modules, which is a repetitive identity residual connection that has three separable convolution layers. Finally, the exit flow consists of a residual module and final, fully connected layers with a pooling layer. The final fully connected layers for classification are preceded by a global average pooling layer. Here, the pretrained network has a final fully connected layer of a thousand neurons to classify the Imagenet data. The final dense layer of pretrained architecture is followed by dense layers of 512,256 and 64 neurons where only these dense layers are trained with the train data set with activation Relu, which uses 23,570,898 parameters including trainable and nontrainable. To classify the MURA data set as normal and abnormal, a dense of two neurons with activation softmax is implemented. The default image size for the implementation of a pretrained network is (299 3 299 3 3)

5.4.4 VGG-19 The pretrained CNN version of a VGG model is trained on Imagenet data set by Visual Geometry Group at the University of Oxford. It has 16 convolutional layers, five max pooling layers with a 2 3 2 as pool size and three fully connected layers.

87

88

N. Harini et al.

Figure 5.6 Xception model flow diagram.

Figure 5.7 VGG-19 architecture diagram.

Each convolution layer has a filter kernel size of 3 3 3, but the depth is of increasing order as shown in Fig. 5.7. The first two fully connected layers have 4096 neurons each and the last fully connected layer has 1000 neurons. The default input size for the pretrained VGG-19 is 224 3 224 3 3 [21]. Though the network has a higher

Musculoskeletal radiographs classification using deep learning

number of learning parameters of about 143,667,240, it has the ability to learn deep features. In our study, the pretrained VGG-19 network is fined-tuned by adding three more fully connected layers with 512,256, and 64 neurons respectively. To classify the images as normal and abnormal a final classification dense layer has been added with softmax activation. This implementation gave 144,327,658 learning parameters where 660,418 are trainable parameters.

5.4.5 DenseNet In the literature, the efficiency of the deep CNN depends on closer connections of the intermediate layers [22]. This phenomenon was achieved by passing the information from the previous layer to the next layer by stacking up the feature maps of both the layers. This approach reduced the number of parameters, encouraged the reuse of features, strengthened the feature propagation, and alleviated the vanishing gradient issues while backpropagating. The DenseNet architecture consists of convolutional layers, pooling layers, dense blocks, transition layers, and a fully connected layer as shown in Fig. 5.8. Each dense block consists of a 1 3 1 convolution layer which reduces the feature maps and a 3 3 3 convolution layer for improving the computational efficiency. In between the dense blocks, the transition layers are used to reduce the number of feature maps. The DenseNet model has different versions such as DenseNet-121, DenseNet-169, DenseNet-201, and DenseNet-224. In our study, we fine-tuned the pretrained DenseNet-169 architecture by training only the three fully

Figure 5.8 DenseNet-169 architecture diagram.

89

90

N. Harini et al.

connected layers as implemented in previous methods. The final dense layer of two neurons is used for the abnormality classification of musculoskeletal images. The total number of parameters is 14,968,298 including both the trainable and nontrainable parameters.

5.4.6 MobileNet There is a need for efficient DL models to be executed in mobiles and other smaller devices which demand fewer computational algorithms. MobileNets are designed to overcome such bottlenecks in the DL era [23]. MobileNet consists of a depthwise separable convolutional layer which splits the feature maps of the previous layer into separate layers and performs convolution using the filter kernels and then stack them up as the output of the layer. This algorithm reduces the computational parameters that makes the network compatible with mobiles. The MobileNet architecture as shown in Fig. 5.9 has one standard convolution layer followed by 13 depthwise separable convolutional layers and one fully connected layer. Each convolution layer is followed by batch normalization and ReLu activation function. The further fine-tuning is implemented as in previous methodologies The total number of parameters of the model is 4,914,282, which is comparatively less than other models implemented in our study.

Figure 5.9 MobileNet architecture diagram.

91

Musculoskeletal radiographs classification using deep learning

5.5 Statistical indicators To measure the performance of the proposed model for abnormalities detection in radiograph images, we have relied on the confusion matrix which denotes the performance of the classification model. The confusion matrix contains the following terms: • True positive (TP): It represents the amount of radiograph images that are correctly predicted as normal. • True negative (TN): It represents the amount of radiograph images that are correctly predicted as abnormal. • False positive (FP): It represents the amount of radiograph images that are actually abnormal but incorrectly predicted as normal. • False negative (FN): It represents the amount of radiograph images that are actually normal but incorrectly predicted as abnormal. Using the above metrics, we have estimated the accuracy, precision, recall, and F1score. These are defined as follows: Accuracy: Represents the amount of correct predictions (TP and TN) over all kinds of predictions. TP 1 TN TP 1 FP 1 FN 1 TN Precision: Denotes the number of correct positive predictions divided by the number of all positive predictions. Accuracy 5

TP FP 1 TP Recall: Denotes number of correct positive predictions over the number of all relevant samples. Precision 5

TP FN 1 TP F1-score: Denotes the harmonic mean between recall and precision. Recall 5

F1 -score 5 2 3

Recall 3 Precision Recall 1 Precision

5.6 Experimental results and discussions The models are implemented on the radiographic images of three different bones such as finger, wrist, and shoulder by the Keras library. The Keras library has all

92

N. Harini et al.

the CNN-based pretrained networks as model functions for easy implementation. Each body region is trained individually on the proposed methods and the results were compared between the architectures implemented. The results for each region of the body are depicted in Tables 5.3, 5.5 and 5.7.

5.6.1 Finger radiographic image classification From Table 5.3, the precision, recall, F1-score, and accuracy of the normal and abnormal classes of the finger radiograph images with respect to the implemented architectures are clear. Looking at the number of finger images, the total data is for 5567 images consisting of 5106 train images and 416 test images. Of the 5106 images, 3138 are of normal and 1968 are of abnormal. The training accuracies for the architectures Inception v3, Xception, VGG-19, DenseNet-169, and MobileNet are given in Table 5.2. From the table it is evident that the Xception network gives the highest training accuracy and lowest training loss, but when the model is tested with the test data, the accuracy is not as good as other architectures. The highest accuracy for the classification of the test data from the table is DenseNet-169. Also the precision of normal class, precision of abnormal class, recall of the normal class, and F1-score of the normal class is the highest in the DenseNet-169 results. As a majority of the performance measures are better in the DenseNet-169, we may end up with the conclusion Table 5.2 Training accuracies of five models on finger images. Architectures Training accuracy

Training loss

Inception v3 Xception VGG-19 DenseNet-169 MobileNet

0.5502 0.5429 0.5448 0.5442 0.5526

0.7175 0.7286 0.6888 0.7218 0.7098

Table 5.3 Comparison of performance metrics of five models on finger test images. Architectures Accuracy Class Precision Recall

Inception V3

0.4729

Xception

0.4642

VGG-19

0.4816

DenseNet-169

0.4967

MobileNet

0.4403

Normal Abnormal Normal Abnormal Normal Abnormal Normal Abnormal Normal Abnormal

0.46 0.52 0.46 0.50 0.46 0.53 0.48 0.73 0.45 0.35

0.81 0.18 0.94 0.05 0.73 0.26 0.96 0.10 0.89 0.05

F1-score

0.59 0.27 0.62 0.09 0.57 0.35 0.64 0.17 0.60 0.09

Musculoskeletal radiographs classification using deep learning

Figure 5.10 Confusion matrix for (A.) Inception v3 (B.) Xception (C.) VGG-19 (D.) DenseNet-169 (E.) MobileNet on finger test images.

that DenseNet-169 gives better performance. But the recall of the abnormal in DenseNet-169 result is very low, which says that out of the available abnormal data, most of them are considered to be normal, which can be clearly perceived from the confusion matrix of DenseNet-169 as shown in Fig. 5.10. In biomedical applications, both false positives and false negatives on classification play a vital role on the poor performance of the system. The higher recall of the abnormal class is found in the results obtained by VGG-19, and so if all the parameters are questioned, the model did not perform with bias toward any class compared to the performance of other architectures. The performance of the VGG-19 model can be analyzed from the confusion matrix obtained by the classification of test data. The FN is the lowest in the results obtained by the VGG-19 model and the FP is not very high. The total number of test images for the finger is 416, of which 214 belong to the normal class and 247 belong to the abnormal class. From the confusion matrix of the results obtained by the VGG-19 model, out of 214 normal images, only 57 are misclassified as abnormal and out of 247 abnormal images, only 182 are misclassified as normal. In the resulting analysis of the models implemented in the finger data of radiographic images, the performance of the VGG-19 with transfer learning is comparatively better.

5.6.2 Wrist radiographic image classification Out of the radiographic images of three different body regions, the wrist has the largest number of images with 9756 of which 5176 images belong to normal class and 3987 images belong to abnormal class. From Table 5.4, though the training accuracy for the Xception model is higher than other models, the accuracy on test images is

93

94

N. Harini et al.

Table 5.4 Training accuracies of five model of wrist images. Architectures Training accuracy

Training loss

Inception V3 Xception VGG-19 DenseNet MobileNet

0.5427 0.5229 0.6358 0.5394 0.5543

0.7328 0.7442 0.6406 0.7339 0.7228

Table 5.5 Comparison of performance metrics of five models on wrist test images. Architectures Accuracy Class Precision Recall

Inception V3

0.5493

Xception

0.5478

VGG-19

0.5159

DenseNet-169

0.5630

MobileNet

0.5524

Normal Abnormal Normal Abnormal Normal Abnormal Normal Abnormal Normal Abnormal

0.56 0.48 0.55 0.46 0.54 0.42 0.57 0.54 0.57 0.50

0.91 0.11 0.94 0.06 0.77 0.21 0.89 0.16 0.80 0.25

F1-score

0.69 0.17 0.70 0.11 0.64 0.28 0.69 0.25 0.66 0.33

not as good as any other models. Every model is tested on 659 images as 364 normal and 295 abnormal images. From Table 5.5, one can infer that DenseNet-169 performs with a higher accuracy on test images. Though the precision scores of normal and abnormal classes are high in the results obtained by the DenseNet-169 model, it shows a poor performance when the classifying images belong to abnormal classes. The poor performance on abnormal class is clear from the recall score for abnormal class in Table 5.2. The model that performs well with a low FN is MobileNet. The performance of the MobilNet model is justified with recall and the F1-score of abnormal class and also with the confusion matrix shown in Fig. 5.11. The confusion matrix of the MobileNet model has 234 images as FNs and 24 images as false positives which show that the model performs comparatively better with both normal and abnormal classes.

5.6.3 Shoulder radiographic image classification It can be observed from Table 5.6 that the Xception and DenseNet-169 models performed better when compared to the other models. The Xception model achieves the highest training accuracy and the second lowest training loss while DenseNet-169 achieves the lowest training loss and second highest training accuracy. However, when the performances of the models are evaluated on test data, it can be observed from Table 5.7 that

95

Musculoskeletal radiographs classification using deep learning

Figure 5.11 Confusion matrix for (A.) Inception v3 (B.) Xception (C.) VGG-19 (D.) DenseNet-169 (E.) MobileNet on wrist test images.

Table 5.6 Training accuracies of five models of shoulder images. Architectures Training accuracy

Training loss

Inception v3 Xception VGG-19 DenseNet-169 MobileNet

0.6097 0.5967 0.6714 0.5927 0.6012

0.6760 0.6907 0.5849 0.6874 0.6800

Table 5.7 Comparison of performance metrics of five models of shoulder test images. Architectures Accuracy Class Precision Recall

Inception V3

0.4956

Xception

0.5115

VGG-19

0.4742

DenseNet-169

0.5275

MobileNet

0.5204

Normal Abnormal Normal Abnormal Normal Abnormal Normal Abnormal Normal Abnormal

0.50 0.47 0.51 0.54 0.47 0.47 0.52 0.64 0.52 0.56

0.84 0.15 0.94 0.08 0.36 0.59 0.94 0.10 0.89 0.14

F1-score

0.63 0.22 0.66 0.13 0.41 0.53 0.67 0.17 0.65 0.22

96

N. Harini et al.

Figure 5.12 Confusion matrix for (A) Inception v3 (B) Xception (C) VGG-19 (D) DenseNet-169 (E) MobileNet on shoulder test images.

DenseNet-169 achieves better accuracy than Xception and all other models. Even though it achieves better precision and recall with respect to normal class, its recall is poor for abnormal class. Based on recall, it can be observed that performance of all models except VGG-19 are somewhat biased toward normal class. Even though the VGG-19 has poor accuracy when compared with other models, it has achieved better recall with respect to abnormal class. The confusion matrix is shown in Fig. 5.12. From the confusion matrix, one can infer that the lowest FN (classifying abnormal radiograph as normal one) is achieved by VGG-19 and the lowest FP is achieved by DenseNet-169. Therefore, we can deduce that both VGG-19 and DenseNet-169 perform comparatively better than all other models.

5.7 Conclusion In this chapter, the effectiveness of different CNN-based pretrained architectures for detecting abnormalities in the upper extremities from radiograph images on various body parts are analyzed. The performance analysis was based on the standard metrics called precision, recall, and F1-score. The main contribution of the present chapter is the analysis of the present work on abnormalities detection of different bones in radiographic images is given based on the pros and cons of the use of pretrained CNN architectures. The analysis of the present work shows that out of the five CNN architectures tested, the performance of VGG-19 is less unbiased toward finger and shoulder radiographs classification and so the network is more reliable for the initial diagnosis of

Musculoskeletal radiographs classification using deep learning

abnormality. One of the main challenges of the data set is that annotations for the different views of the data were missing, which requires a domain expert. Hence, the future scope of the present chapter can be toward the performance improvement of these models using different views of the data annotated by the radiology experts. The performance might be improved by incorporating resampling and segmentation techniques for dealing with data imbalance and training the model using noise-free data respectively. The experimental analysis presented in this chapter can also be applied to other radiographic images available for the hand, forearm, elbow, etc. The present work can also be extended with the implementation of other advanced DL algorithms such as generative adversarial networks and capsule neural network. It can also be further extended to abnormality categorization with the expert medical advice.

References [1] M.A. Anupama, V. Sowmya, K.P. Soman, Breast cancer classification using capsule network with preprocessed histology images, in: 2019 International Conference on Communication and Signal Processing (ICCSP), IEEE, April 2019, pp. 0143 0147. [2] N. Dey, A.S. Ashour, S. Borra (Eds.), Classification in BioApps: Automation of Decision Making, vol. 26, Springer, 2017, pp. 323 350. [3] S. Akarsh, S. Sriram, P. Poornachandran, V.K. Menon, K.P. Soman, Deep learning framework for domain generation algorithms prediction using long short-term memory, in: 2019 5th International Conference on Advanced Computing & Communication Systems (ICACCS), IEEE, March 2019, pp. 666 671. [4] N. Deepika, V.S. Variyar, Obstacle classification and detection for vision based navigation for autonomous driving, in: 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), IEEE, September 2017, pp. 2092 2097. [5] K. Lan, D.T. Wang, S. Fong, L.S. Liu, K.K. Wong, N. Dey, A survey of data mining and deep learning in bioinformatics, J. Med. Syst. 42 (8) (2018) 139. [6] Y. Wang, Y. Chen, N. Yang, L. Zheng, N. Dey, A.S. Ashour, et al., Classification of mice hepatic granuloma microscopic images based on a deep convolutional neural network, Appl. Soft Comput. 74 (2019) 40 50. [7] E. Kang, J. Min, J.C. Ye, A deep convolutional neural network using directional wavelets for low dose Xray CT reconstruction, Med. Phys. 44 (10) (2017) e360 e375. [8] W.B. Sampaio, E.M. Diniz, A.C. Silva, A.C. De Paiva, M. Gattass, Detection of masses in mammogram images using CNN, geostatistic functions and SVM, Comput. Biol. Med. 41 (8) (2011) 653 664. [9] M. Frid-Adar, I. Diamant, E. Klang, M. Amitai, J. Goldberger, H. Greenspan, GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification, Neurocomputing 321 (2018) 321 331. [10] F. Milletari, S.A. Ahmadi, C. Kroll, A. Plate, V. Rozanski, J. Maiostre, et al., Hough-CNN: deep learning for segmentation of deep brain regions in MRI and ultrasound, Comput. Vis. Image Underst. 164 (2017) 92 102. [11] Z. Li, N. Dey, A.S. Ashour, L. Cao, Y. Wang, D. Wang, et al., Convolutional neural network based clustering and manifold learning method for diabetic plantar pressure imaging dataset, J. Med. Imaging Health Inform. 7 (3) (2017) 639 652. [12] E.L. Hall, R.P. Kruger, S.J. Dwyer, D.L. Hall, R.W. Mclaren, G.S. Lodwick, A survey of preprocessing and feature extraction techniques for radiographic images, IEEE Trans. Commun. 100 (9) (1971) 1032 1044. [13] R.H. Morgan, An analysis of the physical factors controlling the diagnostic quality of roentgen images; unsharpness, Am. J. Roentgenol. Radium Ther. 62 (6) (1949) 870.

97

98

N. Harini et al.

[14] K. Doi, Optical transfer functions of the focal spot of X-ray tubes, Am. J. Roentgenol. Radium Therapy, Nucl. Med. 94 (1965) 712 718. [15] H.C. Shin, H.R. Roth, M. Gao, L. Lu, Z. Xu, I. Nogues, et al., Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning, IEEE Trans. Med. Imaging 35 (5) (2016) 1285 1298. [16] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, in: 2012 Advances in neural information processing systems, NIPS, 2012, pp. 1097 1105. [17] J. Yosinski, J. Clune, Y. Bengio, H. Lipson, How transferable are features in deep neural networks? in: 2014 Advances in neural information processing systems, NIPS, 2014, pp. 3320 3328. [18] P. Rajpurkar, J. Irvin, A. Bagul, D. Ding, T. Duan, H. Mehta, et al., Mura: large dataset for abnormality detection in musculoskeletal radiographs, in: arXiv preprint arXiv:1712.06957v4, 2017. [19] F. Chollet, Xception: deep learning with depthwise separable convolutions, in: 2017 Proceedings of the IEEE conference on computer vision and pattern recognition, IEEE, 2017, pp. 1251 1258. [20] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the inception architecture for computer vision, in: 2016 Proceedings of the IEEE conference on computer vision and pattern recognition, IEEE, 2016, pp. 2818 2826. [21] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, in: arXiv preprint arXiv:1409.1556, 2014. [22] G. Huang, Z. Liu, L. Van Der Maaten, K.Q. Weinberger, Densely connected convolutional networks, in: 2017 Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2017, pp. 4700 4708. [23] A.G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, et al., MobileNets: efficient convolutional neural networks for mobile vision applications, in: arXiv preprint arXiv:1704.04861, 2017. [24] A. Mueen, S. Baba, R. Zainuddin, Multilevel feature extraction and X-ray image classification, J. Appl. Sci. 7 (8) (2007) 1224 1229. [25] B.C. Ko, S.H. Kim, J.Y. Nam, X-ray image classification using random forests with local waveletbased CS-local binary patterns, J. Digit. Imaging 24 (6) (2011) 1141 1151. [26] R. Sa, W. Owens, R. Wiegand, M. Studin, D. Capoferri, K. Barooha, et al., Intervertebral disc detection in X-ray images using faster R-CNN, in: 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), IEEE, July 2017, pp. 564 567. [27] S. Lee, M. Choi, H.S. Choi, M.S. Park, S. Yoon, FingerNet: deep learning-based robust finger joint detection from radiographs, in: 2015 IEEE Biomedical Circuits and Systems Conference (BioCAS), IEEE, October 2015, pp. 1 4. [28] R. Ebsim, J. Naqvi, T.F. Cootes, Automatic detection of wrist fractures from posteroanterior and lateral radiographs: a deep learning-based approach, in: 2018 International Workshop on Computational Methods and Clinical Applications in Musculoskeletal Imaging, Springer, Cham, September 2018, pp. 114 125. [29] C. Tataru, D. Yi, A. Shenoyas, A. Ma, Deep learning for abnormality detection in chest X-ray images, 2017. [30] Y. Ding, J.H. Sohn, M.G. Kawczynski, H. Trivedi, R. Harnish, N.W. Jenkins, et al., A deep learning model to predict a diagnosis of alzheimer disease by using 18F-FDG PET of the brain, Radiology 290 (2) (2018) 456 464. [31] G. Kitamura, C.Y. Chung, B.E. Moore, Ankle fracture detection utilizing a convolutional neural network ensemble implemented with a small sample, de novo training, and multiview incorporation, J. Digit. Imaging 32 (4) (2019) 1 6. [32] J. Olczak, N. Fahlberg, A. Maki, A.S. Razavian, A. Jilert, A. Stark, et al., Artificial intelligence for analyzing orthopedic trauma radiographs: deep learning algorithms—are they on par with humans for diagnosing fractures? Acta Orthop. 88 (6) (2017) 581 586. [33] Thakker, D., Shah, V., Rele, J., Shah, V., & Khanapuri, J. Diagnosing child pneumonia using transfer learning. Int. J. Interdiscip. Res. Innov. 7(2), (pp: 100 104). [34] M.T. Islam, M.A. Aowal, A.T. Minhaz, K. Ashraf, Abnormality detection and localization in chest x-rays using deep convolutional neural networks, in: arXiv preprint arXiv:1705.09850, 2017. [35] A. Maier, C. Syben, T. Lasser, C. Riess, A gentle introduction to deep learning in medical image processing, Z. Med. Phys. 29 (2) (2019) 86 101.

CHAPTER SIX

Deep-wavelet neural networks for breast cancer early diagnosis using mammary termographies Valter Augusto de Freitas Barbosa1, Maíra Araújo de Santana1, Maria Karoline S. Andrade1, Rita de Cássia Fernandes de Lima2 and Wellington Pinheiro dos Santos1 1

Department of Biomedical Engineering, Federal University of Pernambuco, UFPE, Recife, Brazil Department of Mechanical Engineering, Federal University of Pernambuco, UFPE, Recife, Brazil

2

6.1 Introduction According to the World Health Organization (WHO), breast cancer is the most common type of cancer among women, reaching 2.1 million women each year worldwide [1]. Also, breast cancer accounts for the largest number of cancer-related deaths among women. In 2018, it was estimated that 6,27,000 women died from breast cancer, which represents 15% of all deaths of women due to cancer [1]. In Brazil, the National Cancer Institute José Alencar Gomes da Silva (INCA) estimated the diagnosis of 59,700 new cases of the disease in 2018 [2]. Risk factors that can trigger breast cancer can be classified into three categories: environmental and behavioral factors; factors of reproductive and hormonal history; and genetic and hereditary factors. Table 6.1 shows some risk factors for breast cancer, according to this classification. Yet, having any of these risk factors does not mean that the women will necessarily develop the disease [3]. Some habits such as engaging in physical activity, maintaining adequate body weight, avoiding alcoholic beverages, and breastfeeding can decrease breast cancer incidence by 30% [3]. However, the best current strategy to decrease disease morbidity and mortality is early detection [4]; in other words, finding the tumor in its initial stage increases the patient’s chances of cure. There are two strategies for the early detection of breast cancer: early diagnosis and screening. Early diagnosis consists of identifying, as early as possible, breast cancer in symptomatic individuals. Screening is the identification of breast cancer in asymptomatic subjects [1,5].

Deep Learning for Data Analytics. DOI: https://doi.org/10.1016/B978-0-12-819764-6.00007-7

© 2020 Elsevier Inc. All rights reserved.

99

100

Valter Augusto de Freitas Barbosa et al.

Table 6.1 Risk factors for breast cancer. Environmental and Factors of reproductive and behavioral factors hormonal history

Obesity and overweight after menopause; Physical inactivity and sedentary lifestyle; Consumption of alcoholic beverage; Frequent exposure to ionizing radiations.

First menstruation before 12 years; Not having children;

First pregnancy after 30 years; Stop menstruating (menopause) after age 55;

Genetic and hereditary factors

Family history of ovarian cancer; Cases of breast cancer in the family, especially before the age of 50; Family history of breast cancer in men; Genetic alteration, especially in the BRCA1 and BRCA2 genes.

Use of hormonal contraceptives (estrogen-progesterone). Have postmenopausal, hormone replacement, especially for more than five years. Source: Based on INSTITUTO NACIONAL DE CÂNCER JOSÉ ALENCAR GOMES DA SILVA, Câncer de mama. ,https://www.inca.gov.br/tipos-de-cancer/cancer-de-mama., 2018 (accessed 16.01.19).

The strategy of early diagnosis of breast cancer is based on the following tripod: population alert of the signs and symptoms of cancer; trained health professionals for evaluation; and health systems and services dedicated to diagnosis confirmation. So, screening uses simple tests in healthy individuals to identify the cancer in its asymptomatic stage [5]. The gold standard method for breast cancer screening is mammography. However, the technique has a low sensitivity (especially for patients with dense breasts), a high rate of false positives, and the risks of exposing the patient to ionizing radiation, especially in young patients (who have glandular breast tissue that is sensitive to ionizing radiations). It is worth mentioning that the exam takes place through the compression of the breast, causing discomfort to the patient. Still, in terms of accuracy, cost, access, and risk, mammography is the best-defined screening method for breast cancer [4,616]. In addition to mammography, other techniques such as ultrasound, nuclear magnetic resonance imaging, scintigraphy, thermography, and electrical impedance tomography can be used in screening. These methods are being exploited as complementary tools in the diagnosis of breast cancer [4,617]. One technique that has been used more in recent decades is breast thermography. It is characterized as being noninvasive, painless, without physical contact, and low cost, when compared to mammography, ultrasonography, and magnetic resonance [18]. Women of all ages can use breast thermography, even those in the groups where mammography is not indicated [19].

Deep-wavelet neural networks for breast cancer early diagnosis using mammary termographies

Breast thermography is a procedure to get an image representing the patient’s skin temperature distribution (surface temperature distribution). A healthy tissue will have a certain pattern of temperature distribution according to its metabolic activity. We can understand the metabolic activity, in this context, as a process of heat generation. Thus, when a physiological disturbance alters the metabolic activity of the tissue, its temperature distribution changes, so that it is possible to identify the disturbance source using a thermographic image. Breast lesion growth is associated with angiogenesis (production of new blood vessels) and increased blood flow. Such factors result in an increase in the local skin temperature by 1 C2 C [20]. Thus, the lesion becomes noticeable through thermographic imaging. In fact, breast thermography is a functional (physiological) exam. This allows the identification of the lesion in its early stages. As physiological changes often precede anatomical changes, breast thermography can identify a lesion even before mammography, which is an anatomical examination [19,21]. By analyzing the distribution of temperature and blood vessels in the breast, signs of a possible cancer or an expansion of precancerous cells may be found up to 10 years earlier than other techniques [18,19]. Although it is a promising technique, the analysis of thermographic images is not a simple task. Now it is done by comparing two contralateral images. Small asymmetries may demonstrate a region of abnormality when the images are almost symmetrical. These small differences may not be easy to identify [19]. The analysis becomes harder for a deeper lesion. The impact of its presence on the breast skin temperature is neither punctual nor intense. Still it may easily be seen in a thermographic image, but in a distributed and low-intensity way. Therefore, the development of an automatic method that eliminates human factors is important for increasing the performance of this technique [19]. Intelligent systems, based on the automatic segmentation of regions of interest, have been shown to be very efficient [22,23], especially those using form and texture descriptors and wavelet series decomposition, combined to connectionist learning machines [615,2437].

6.2 Related works Convolutional neural network (CNN) approaches have been used for different medical image classification purposes, including early breast cancer detection. For example, in Refs. [3840] we can see the use of CNNs to classify normal and abnormal mammograms. However, not too many researchers in deep learning are applying their analysis to breast infrared (IR) images. In general, these researchers are limiting

101

102

Valter Augusto de Freitas Barbosa et al.

their research to the classification of only normal and abnormal images. In Ref. [41], researchers used an IR image data set obtained when breast temperatures were obtained in thermal equilibrium to the room, after having been cooled using air stream (a process called dynamic protocol). Then, the IR images are segmented, aiming to remove regions of neck and arms. At the end, they are submitted to a deep neural network (DNN) of three layers, in which the last one is retrained. Thus, the resulting matrix of features is used as input to a support vector machine (SVM) to classify the images according to the possibility of the patient of having cancer. Their model was able to classify an image with cancer with a confidence of 0.78 and a healthy image with 0.94 of confidence. One approach that we consider in this chapter, beyond classifying images into normal or abnormal, is to classify the abnormal images into categories such as benign lesion, malign lesion, and cysts, as it was done by [29].

6.3 Breast thermography Everyone with a temperature other than absolute zero [lower temperature limit given by 0 on the Kelvin scale (0K), which equals 2273.15 C] emits electromagnetic waves, also called thermal radiation [42]. In our daily lives, the electromagnetic waves emitted by the bodies around us have a frequency in the infrared range. In cases such as metals heated at high temperatures this radiation happens in the visible spectrum, wherein the metals become incandescent and emit a reddish light. A practical example of this is the incandescent lamp, which uses a tungsten filament heated to high temperatures in the generation of visible light. The rate of emission of energy by thermal radiation per unit of time is given by Eq. (6.1) Prad 5 σEAT 4

(6.1)

where σ 5 5:67041028 W=m2 K4 is the Stefan-Bolztmann constant, E is the thermal emissivity, a surface property of the object, and A and T are the surface area and the temperature in kelvins of object, respectively [42]. However, as an object emits thermal radiation, it also absorbs thermal radiation from other bodies around it. Thus, Eq. (6.2) represents the net rate, between the energy absorbed and emitted by thermal radiation: 4 Pliq 5 σEAðTamb 2 T 4Þ

(6.2)

where Tamb is the temperature (in kelvins) of the environment in which the object is, where we assume that it is uniform [42].

Deep-wavelet neural networks for breast cancer early diagnosis using mammary termographies

Human skin emits infrared radiation at wavelengths in the range of 220 µm and with an average peak between 910 µm [19]. Knowing this, it is easy to understand that a thermographic camera has sensors sensitive to electromagnetic radiation in the infrared spectrum, and not in the visible light range like ordinary photographic cameras. In this way, thermal (IR) radiation emitted by a skin point can be directly converted to a temperature value representing that point and then mapped to a pixel of a digital image [18]. Therefore, a thermographic image is the mapping of the temperature distribution of a given region, which can be presented in both gray and pseudocoloring. Lawson, in 1956, performed one of the first studies to associate breast surface temperature and the presence of a lesion. In his study, he directly measured the temperature of the skin through a thermocouple. When working with a group of 26 breast cancer patients, there was an average temperature increase of 2.27 F (equivalent to 1.26 C) of the tumor area or ipsilateral areola (as shown in Fig. 6.1). In the same article, he cites the study done by Massopust (also considered as one of the pioneers of the studies of breast thermography [43]) in infrared images of the superficial blood vessels. However, he finishes his quotation by arguing the technical limitations of the time that made the technique unfeasible. In fact, initial problems such as the low sensitivity of the detectors represented a huge source of error for thermography limiting and delaying its acceptance until the 1990s [19,20] when infrared sensor technology was no longer restricted to military purposes. At this time, a new type of camera was introduced, starting with the second generation of infrared cameras, which corrected some of the problems of previous cameras as thermal fluctuations due to the internal heat of the equipment during use (known as thermal drift), low sensitivity, and image acquisition time [44,45]. Currently we have cameras that can provide digital themographic images. It is a huge advantage as it allows the use of computational tools for digital image processing to assess these images, such as artificial neural networks [4]. Fig. 6.2 shows different thermographic images, from patients without lesion, with cyst, benign lesion, and malign lesion. In addition to its application in the detection of breast tumors, thermography can also be used in other areas such as orthopedics, dentistry, cardiology, endocrinology, forensic medicine, hemodynamics, obstetrics, physiotherapy, and ergonomics [20].

6.3.1 Breast thermographic images acquisition protocol Thermograms are sensitive to changes in temperature, humidity, and ambient ventilation. Accordingly, their acquisition must be performed under controlled conditions [18]. This section describes the protocol used by the team at the Department of Mechanical Engineering of the UFPE in the acquisition of the breast thermographic images. Such protocol was proposed by [47]. Fig. 6.3 summarizes the protocol approach. Each step will be described in the following subsections.

103

104

Valter Augusto de Freitas Barbosa et al.

(A)

(B)

Figure 6.1 Temperature changes of breasts in a patient with unilateral carcinoma. (A) Left breast with carcinoma. (B) Right breast without anomaly. Based on R. Lawson, Implications of surface temperatures in the diagnosis of breast cancer, Can. Med. Assoc. J. 75 (4) (1956) 309 [46].

6.3.1.1 Room preparation For the acquisition of thermographic images it is necessary to prepare the working environment to avoid external noise in the acquired images. Therefore, the following precautions should be taken: • Air-conditioning of the room through an air conditioner; • Measurement of room temperature and relative humidity (to set in the infrared camera);

Deep-wavelet neural networks for breast cancer early diagnosis using mammary termographies

Figure 6.2 Examples of breast thermography images for patients (A) with no lesion, (B) with cyst, (C) with benign lesion, and (D) with malign lesion.

• Airflow should be avoided. So, windows should be kept closed and door openings should be controlled; • Fluorescent lamps may be used in the room, but they should be turned off during patient acclimatization and exam. 6.3.1.2 Patient preparation Due to the attendance procedure adopted by the hospital, the patients spend at least two hours without being exposed to sunlight, without physical exercise, without overeating or overdrinking and without taking baths [47]. Because it is a research project, the whole process of the examination is explained to the patient, and if the patient gives her consent, she must sign the Informed Consent Form (TCLE). In addition, the technical staff makes copies of the patient record and exams (if available). All of this will be part of the patient documentation regarding the exam.

105

106

Valter Augusto de Freitas Barbosa et al.

Figure 6.3 Flowchart of breast thermal imaging protocol.

After that, the patient wears a disposable gown and waits for 10 minutes, not touching the breasts, so that there is a decrease in the metabolic heat emitted by the patient. This period is called the acclimatization period, when there is thermal equilibrium between room and patient. 6.3.1.3 Images acquisition To standardize the thermographic images obtained, the acquisition is done using an apparatus as shown in Fig. 6.4. The apparatus is formed by a rotating chair to accommodate the patient; in addition, there is a superior support where the patient can place her arms. Finally, with the aid of a tripod the camera is placed on a cart on rails. The function of the rails is to assist in the adjustment of the distance between the camera and the patient. Before the exam, the camera is configured with some parameters: emissivity, ambient temperature, relative humidity, and distance between the camera and the patient. Finally, with the patient properly accommodated, the thermographic images are obtained. We acquire two series of images: one with fixed distance between the camera and the patient and the other with variable distance. The second series of images is made to provide some more specific information for medical visual analysis. The first series is to be used in digital image processing methods. Thus, for this work, we will focus on the first series of images. In this series we have images in the following

Deep-wavelet neural networks for breast cancer early diagnosis using mammary termographies

Figure 6.4 Apparatus for breast thermographic images acquisition. Based on M.M.d. Oliveira, Desenvolvimento de protocolo e construção de um aparato mecânico para padronização da aquisição de imagens termográficas de mama (dissertation), Federal University of Pernambuco, Recife, 2016.

Figure 6.5 Examples of positions for images acquisition per patient, at a fixed distance: (A) T1, (B) T2, (C) LEMD, (D) LEME, (E) LIMD, (F) LIME.

positions: frontal of both breasts (T1); frontal of both breasts with hands upwards, resting on the upper support of the apparatus (T2); external lateral of right breast and external lateral of left breast (LEMD and LEME); and internal lateral of isolated right and left breasts (LIMD and LIME). Fig. 6.5 shows examples of these images.

107

108

Valter Augusto de Freitas Barbosa et al.

For this study, the images were obtained using a FLIR Systems thermographic camera of ThermaCAMTM S45 model. This camera has a field of view of 24 3 18 degrees, 0.3 m, spatial resolution of 1.3 mrad, focal plane array detector (FPA), uncooled microbolometer, 320 3 240 pixels; spectral amplitude of 7.513 µm; available standard temperature ranges: 240 C to 120 C, 210 C and 155 C, 0 C to 500 C, 350 C to 1500 C [47]. The scale used for breast thermography is 210 C and 155 C; thermal sensitivity of 0.06 C, and accuracy of 6 1 C [47].

6.4 Deep-wavelet neural network Deep-wavelet neural network (DWNN) is a deep learning method for features extraction based on the Mallat algorithm for wavelet decomposition at multiple levels [48]. In wavelet decomposition, low-pass and high-pass filters are applied to an image, resulting in a set of other images. Images resulting from low-pass and high-pass filters are respectively called approximations and details [48]. In the approximations, the softness of the original image is highlighted, while in the details the edges (or regions of discontinuity) are highlighted. This strategy is used in patterns recognition, by allowing the images analysis in both spatial and frequency domains [48]. In the DWNN approach, a neuron is created by combining a given filter with a process of image size reduction, called downsampling. The whole process is shown in Fig. 6.6. All filters used in the DWNN create a filter bank, which are kept fixed throughout the process. Let’s say that the bank has n filters. Thus, an input image will be submitted to n neurons that build the first intermediate layer of the neural network. In the second layer, the images resulted from the first layer will be individually submitted to the same filter bank and downsampling, as it was done for the input image. The process repeats itself for the third and subsequent intermediate layers. Finally, in the output layer of the DWNN, we have the synthesis block. It is responsible for extracting

Figure 6.6 DWNN neuron, gi box represents any filter of the filter bank of the network and k 2 box represents downsampling. X and Y are input and output images of the neuron, a comparison of their respective sizes.

109

Deep-wavelet neural networks for breast cancer early diagnosis using mammary termographies

g1

2

g1

. . .

2

. . .

2

SB

g1

2

g1

2

SB

g1

2

g1

2

SB

. . .

g1 gn

2

. . .

g1

g1 . . .

g1

. . .

. . .

g1

2 . . .

. . .

2

g1

2

g1

. . .

2

. . .

g1

. . .

x2

SB . . .

2

g1 . . .

. . .

2 . . .

x1

SB

2

g1

. . .

. . .

. . .

. . .

g2

. . .

g1

. . .

2

2

. . .

g1

g1

xn x n+1 x n+2

. . .

. . .

x 2n

SB

2 . . .

. . .

. . .

2 2 . . .

2

. . .

xN

SB

N=n

m

m hidden layers

Figure 6.7 Outlining the deep-wavelet neural network (DWNN).

information from the resulting images from the whole process. This approach is outlined in Fig. 6.7. The filter bank, downsampling and the synthesis block will be detailed in the following sections.

6.4.1 Filter bank The filter bank used in the DWNN is fixed and composed of orthogonal filters. Considering S the domain of the image (also called Support) and ℝ the set of real numbers, we can say that the orthogonal filters are of type gi :S-ℝ, para 1 # i # n. Then, the filter bank (G), may be mathematically represented by: G 5 fg1 ; g2 ; g3 ; . . .; gn g

(6.3)

Before we define the DWNN filter bank, we need to define which neighborhood will be considered during the filtering process. After doing this, we can finally define the gi filters. To better understand this process, let’s take as an example a neighborhood of 8 pixels (neighborhood-8). It means that, when analyzing a random

110

Valter Augusto de Freitas Barbosa et al.

pixel

u- 5 ði; jÞ,

we

consider

as

its

neighbors

the

pixels: So, the lateral, vertical, and diagonal pixels are neighbors of the analyzed pixel, such as shown in Fig. 6.8A. Considering a neighborhood-8, we can build an orthonormal base of filters containing a total of five filters. Four of those filters are bandpass filters containing one specific orientation selectivity [48]. So, each filter will highlight details in a given orientation. Fig. 6.8B presents the orientations of such filters for a neighborhood-8. Where g1 is the high-frequency vertical filter, responsible for highlighting horizontal edges, g2 , horizontal high-frequency filter, which highlights vertical edges, and g3 and g4 are the diagonal filters, which highlight the image. Thus, the filters g1 , g2 , g3 , and g4 build the set of high-pass filters, also classified as derivative filters, by highlighting the discontinuities of the input image. fði 1 1;jÞ; ði 2 1;jÞ; ði; j 1 1Þ; ði; j 2 1Þ; ði 1 1;j 1 1Þ; ði 1 1;j 2 1Þ; ði 2 1;j 1 1Þ; ði 2 1;j 2 1Þ.

Figure 6.8 (A) Neighborhood-8, are considered neighbors of pixel u- all gray pixels. (B) High-pass filters with orientation selectivity, g1 filter with vertical selectivity, g2 , horizontal filter, g3 e, g4 diagonal filters.

Deep-wavelet neural networks for breast cancer early diagnosis using mammary termographies

Figure 6.9 Example of normal low-pass filter, where the mask center (2,2) is the pixel to be replaced during filtering.

Figure 6.10 (A) Neighborhood to 24 pixels. (B) High-pass filters with orientation selectivity for a 24-pixel neighborhood.

In addition to the high-pass filters, the DWNN filter bank has one more filter, a low pass (g5 ), which acts as a softener. With the purpose of highlighting the homogeneous areas of the image, this filter is an integrating filter. An example of a normal low-pass filter for neighborhood-8 is shown in Fig. 6.9. The filters shown by Figs. 6.8B and 6.9 are valid for a neighborhood-8. However, if we choose a different neighborhood we must consider other filters. For example, if we consider a neighborhood of 24 pixels as shown in Fig. 6.10A, the high-pass selectivity filters of specific orientation will be given as shown in Fig. 6.10B. Note that in this case, we will have eight high-pass filters ðg1 ; g2 ; . . .; g8 Þ. For this neighborhood,

111

112

Valter Augusto de Freitas Barbosa et al.

the low-pass filter would be a 5 3 5 matrix formed by 1=25 terms, following the same filter principle given in Fig. 6.9 for a neighborhood-8. Then, when we choose a neighborhood of 24 pixels, we will use as a filter bank in the DWNN the orthonormal set by the filters shown in Fig. 6.10B and another low pass, so we have a total of night filters.

6.4.2 Downsampling The second step of a DWNN neuron is the downsampling, which is responsible for reducing the size of the image. It replaces four pixels of the image with only one. See Fig. 6.11 to better understand the process. Consider any function φk2 :ℝ4 -ℝ, where φk2 ðÞ can be the maximum function (returns the highest value between the input values), the minimum function (returns the smallest value), or the average or the median of the pixel values. Other functions of type ℝ4 -ℝ may also be used. Then, the pixels a0 , b0 , c 0 , d0 , identified in Fig. 6.11, have their values given by: a0 5 φk2 ða1 ; a2 ; a3 ; a4 Þ

b0 5 φk2 ðb1 ; b2 ; b3 ; b4 Þ

c 0 5 φk2 ðc1 ; c2 ; c3 ; c4 Þ

d 0 5 φk2 ðd1 ; d2 ; d3 ; d4 Þ

In the case shown in Fig. 6.11, we are able to return a 5 3 5 (25 pixels) image from a 10 3 10 image (100 pixels). The use of downsampling has an interesting feature since it decreases memory consumption during the algorithm execution. Consider, for example, a 4096-pixel image submitted to the neighborhood-8 orthonormal filter bank (referring to the first intermediate layer of the DWNN). As a result we will have other n 5 5 images, each containing the same amount of pixels. This results in an increase in the amount of data by a factor of five. When we consider more layers of the DWNN, the amount of data would grow exponentially by a factor nm.

Figure 6.11 Downsampling ðφk2 Þ, notice that through the process the size of the image has been reduced in one quarter.

113

Deep-wavelet neural networks for breast cancer early diagnosis using mammary termographies

Such inconvenience could make the process unfeasible. But by applying downsampling, as the number of images increases by a factor nm, the size of each image coming from the process reduces by a factor of 42m. An especially interesting situation occurs when we consider a neighborhood-8, for which the filter bank has in all five filters. However, we can combine the diagonal filters (g3 and g4 in Fig. 6.8B) to work with a bank containing four filters. In this special case, after m layers of DWNN neurons, we will have 4m images each reduced, relative to the input image, by a factor of 42m . In this way, the amount of data will remain constant throughout the execution of the algorithm. If we use such an approach in the 4096-pixel image from the previous example, after the first layer of the DWNN, we will get four images of 1024 pixels as a result. The amount of data, then, stays constant during the process. Once we talk about the filter bank and downsampling we can return to the schematic of a DWNN neuron shown in Fig. 6.6. Considering X and Y, the input and output images for a given neuron, respectively, Y is related to X as follows: Y : 5 φk2 ðgi  XÞ;

(6.4)

where * symbol stands for a convolution.

6.4.3 Synthesis block The output layer of the DWNN is build by the synthesis blocks. Each block has the function of extracting data from each intermediate image, thus representing it, as shown in Fig. 6.12. Then, in the synthesis blocks, each of the nm images will be submitted to a function ϕ:S-ℝ. Among other possibilities, ϕðÞ can assume a function of maximum, minimum, average, or median. Its purpose is thus to replace the entire image with a single value. Thus, considering f ðu- ÞAℝ the pixel value u- , we have that xi Aℝ, as shown in Fig. 6.12, is obtained as follows: -

-

xi 5 ϕðf ð u Þ; ’ u ASÞ

Figure 6.12 Schematization of the synthesis process.

(6.5)

114

Valter Augusto de Freitas Barbosa et al.

At the end of the DWNN, when applying the synthesis block to all images resulting from the m intermediate layers, we will have a set of terms xi (1 # i # nm ). Such a set can be understood as the features of the input image (image representation). When we apply the DWNN to a set of images, we will obtain a database, which can be used as input to a classifier.

6.5 Classification After extracting features from the images using the DWNN method, we evaluated the performance of some algorithms in the tasks of detecting and classifying breast lesions. In this study we use a SVM, multilayer perceptron (MLP), extreme learning machine (ELM), and morfological extreme learning machine (mELM). Table 6.2 shows the parameters we set for each method. To generate these results, we used the software SID-Termo for feature extraction [29], GNU Octave [49,50], and Weka [51] for classification using ELMs and the other machine learning methods, respectively. The SVM performs a nonlinear mapping of the data set into a higher dimension space. Then, it creates a hyperplane to separate the distinct classes. It is common to change the function that will build this boundary to see which one fits better to the data set. [52]. MLP consists in a complex network with feedforward connections. It has a set of sensory units that make up the input layer, an intermediate layer (hidden layer), and the output layer. This neural network usually learns information through a backpropagation algorithm [52]. Table 6.2 Parameters set for each classifier. Classifier

SVM  MLP

ELM

mELM

Parameters

Linear kernel RBF kernel Neurons in hidden layer: aa Learning rate: 0.3 Momentum: 0.2 Iterations: 500 Activation function: sigmoid Neurons in hidden layer: 100 Kernel: sigmoid Range of weights: [2 1, 1] Neurons in hidden layer: 100 Kernel: dilatation and erosion Range of weights: [2 1, 1]

The constant a is a heuristic adopted in Weka implementation [51]. a a 5 (# attributes 1 # classes)/2.

Deep-wavelet neural networks for breast cancer early diagnosis using mammary termographies

The ELM is a training approach for single-layer neural networks. This classifier randomly generates input weights. Therefore, it is usually associated with fast learning phases. This characteristic may be relevant in applications that requires several training [53]. Proposed by Azevedo et al. (2015), the mELM consists in applying nonlinear kernels based on mathematical morphological operators. These operators perform procedures of dilatation and erosion on the data set [54]. For all the aforementioned classifiers, we performed tests using K-fold cross-validation methods with 10-folds to avoid overfitting. In this approach the algorithm splits the data set into K subgroups. Then, it takes the dataum one by one to train the classifier and builds the testing set using the remaining subgroups. The final performance consists in the mean from the K tests [55]. In order to assess statistical information, we perform 30 experiments per configuration of each classifier.

6.5.1 Experimental results and discussion The patients, submitted to breast thermography in this study, were diagnosed according to already consolidated methods, such as clinical exams, biopsies, mammograms, and ultrasonography [56]. The possible diagnoses can include cyst, benign lesion, malign lesion, and without lesion. In this way, we divide the infrared images into classes according to these diagnosis. In a second moment, we combine the images of patients with cyst, benign lesion, and malignant lesion to create a class called with lesion. Moreover, we used only T1 and T2 frontal images to perform the experiments. Previous studies from our group showed that the use of frontal images favors classifiers performance. Overall, we use 336 images in this study. From those, 73 are of cysts, 121 of benign lesions, and 76 of malignant lesions (total of 270 images with lesion) and 66 without lesion. All patients in the database have suspect of breast lesion and age above 35 years, the age from which mammograms are allowed in Brazil. After acquisition, we preprocessed the images to perform conversion from RGB-JET to grayscale. In the conversion, lighter shades of gray indicate higher temperatures. We also performed class-balancing, since the number of images in each group is different. To do so, we created new synthetic instances through the linear combination of attribute vectors of the same class [38]. 6.5.1.1 Lesion detection In the first experiments, we performed a binary classification. We use the classes with and without lesions in order to assess the algorithm’s ability to detect the existence of any kind of lesion in the breast tissue. To evaluate our results, we employed the kappa index, a very important metric to measure classification matches, mismatches, and errors in breast cancer image diagnosis and other applications [5760]. We adopted the classical interpretation of the kappa index [6165].

115

116

Valter Augusto de Freitas Barbosa et al.

Fig. 6.13 shows the results of accuracy and kappa statistics when using DWNN with two levels, which results in 16 features. Most of the classifiers achieved accuracy around 70%, except from SVM with RBF kernel, which could not reach at least 60% of accuracy. Regarding the kappa statistic, only ELM and the mELMs achieved a moderated result, which were also very similar to each other. We found weak kappa, less than 0.40, for SVM and MLP, when using this database. SVM with an RBF kernel performed worse in both cases, showing a kappa lower than 0.10 and around 55% of accuracy. As to dispersion, it was similar for all tested configurations. After increasing the number of levels to four, we obtained an expressive increase in the number of features, resulting in 256 features. So the database could be better represented, thus resulting in a better classification performance, as can be seen in the results from Fig. 6.14. The classifiers’ accuracy had an almost 20% of increase. For this data set, we found better performance of SVM with linear kernel and mELMs, with results close to 90% for accuracy. Regarding the kappa statistic, we also observe an

Figure 6.13 Results of (A) accuracy and (B) kappa statistic for lesion detection using two levels in the DWNN.

Deep-wavelet neural networks for breast cancer early diagnosis using mammary termographies

Figure 6.14 Results of (A) accuracy and (B) kappa statistic for lesion detection using four levels in the DWNN.

expressive increase in this metric for all classifiers. However, linear SVM performed slightly worse than the mELMs when looking to this metric. The mELMs had a kappa above 0.80, while kappa for SVM with linear kernel was around 0.75. MLP had an intermediate performance, but higher dispersion, when compared with the other methods. Again, the SVM using an RBF kernel obtained the less-satisfying results, but now with more than 70% of accuracy and kappa around 0.55. Table 6.3 shows the mean and standard deviation for these results. The C2 columns show the results using two levels, while C4 stands for the data set using four levels in the DWNN. 6.5.1.2 Lesion classification In a second moment, we use the three classes of lesions to perform the experiments. By doing that, we aimed to verify classifiers’ performance in differentiating each type

117

118

Valter Augusto de Freitas Barbosa et al.

Table 6.3 Classification results regarding the lesion detection problem (with or without lesion). Accuracy (%) Kappa C2 C4 C2 C4

SVM MLP ELM

Linear RBF  Sigmoid Dilation Erosion

66.4 6 3.7 54.8 6 2.2 68.4 6 4.1 67.5 6 4.5 68.1 6 4.4 68.3 6 4.2

86.9 6 2.9 76.2 6 3.2 81.2 6 7.3 79.4 6 4.2 86.4 6 3.2 86.4 6 3.3

0.330 6 0.070 0.090 6 0.040 0.370 6 0.080 0.602 6 0.056 0.606 6 0.056 0.608 6 0.056

0.740 6 0.060 0.520 6 0.060 0.62 6 0.15 0.742 6 0.055 0.826 6 0.042 0.826 6 0.043

of lesion. So, in the lesion classification problem we have the classes: cyst, benign lesion, and malign lesion. When using two levels in the DWNN, most of the classifiers achieved accuracy around 50%, except from SVM with an RBF kernel, whose accuracy was around 35%. Regarding to the kappa statistic, ELM and the mELMs achieved slightly better results, all around 0.45. We saw less than 0.30 for kappa when using SVM and MLP algorithms. SVM with an RBF kernel showed the worst performance. As to dispersion, it was similar for all tested configurations. Fig. 6.15 illustrates these results. Finally, in Fig. 6.16 we present the results obtained when we use four levels in the DWNN. This time, the best accuracy and kappa were found when using an MLP classifier. For this database, MLP achieved a little less than 90% of accuracy and kappa of 0.80. It was closely followed by SVM with a linear kernel and mELMs, with both dilatation and erosion kernels, all with accuracy around 80% and kappa around 0.70. One more time, SVM with an RBF kernel presented the worse results, an accuracy around 60% and a kappa of 0.40. In a manner similar to Table 6.3, Table 6.4 shows the mean and standard deviation of the results, but now for the three-class problem.

6.6 Conclusion In this chapter, we introduced the DWNN as a feature extraction method for image representation. We applied this tool to represent thermographic images, in order to identify breast lesions. This technique is being exploited to be used in an automatic system to support health professionals on breast cancer diagnosis. One of the main advantages of the DWNN method is the reduced computational cost, since it takes a few seconds to perform features extraction. From our results, we found that as the number of features increases, by adding more levels in the DWNN, we could achieve better performance in solving the

Deep-wavelet neural networks for breast cancer early diagnosis using mammary termographies

Figure 6.15 Results of (A) accuracy and (B) kappa statistic for lesion classification using two levels in the DWNN.

classification problem. This behavior was observed in both detection and classification of lesions. We also noticed that this exponential increase in the number of features results in features’ redundancy. This redundancy explains the fast growth of SVM performance when we add more levels to the DWNN, since SVM considers the most relevant features during its training process. This kind of features selection avoids redundancy and therefore improves the algorithm’s performance. In all cases, data dispersion was greater for a kappa statistic than for accuracy. This behavior was expected, since the kappa statistic is a more rigorous metric than accuracy. Overall, we found less-satisfying results for the lesion classification problem than for the lesion detection. This confirms that identifying a breast lesion is easier than differentiating specific types of lesions. Regarding the performance of the diagnosis of breast cancer, we were able to achieve high values of sensibility and specificity using the DWNN. These metrics are

119

120

Valter Augusto de Freitas Barbosa et al.

Figure 6.16 Results of (A) accuracy and (B) kappa statistic for lesion classification using four levels in the DWNN. Table 6.4 Results for the classification into cyst, benign lesion, or malign lesion (three-class problem). Accuracy (%) Kappa C2P1 C4P1 C2P1 C4P1

SVM MLP ELM

Linear RBF  Sigmoid Dilation Erosion

50.2 6 4.9 37.1 6 3.5 48.3 6 5.6 48.7 6 5.8 49.5 6 5.6 49.5 6 5.8

79.0 6 4.8 59.1 6 4.4 84.2 6 4.4 68.8 6 5.6 76.1 6 5.2 76.1 6 4.9

0.250 6 0.070 0.060 6 0.050 0.220 6 0.080 0.439 6 0.063 0.449 6 0.062 0.449 6 0.064

0.690 6 0.070 0.390 6 0.070 0.760 6 0.070 0.650 6 0.063 0.734 6 0.058 0.735 6 0.055

directly associated to the amount of true and false positives and true and false negatives. So, an efficient diagnostic system needs to maximize both sensitivity and specificity. Through our method we achieve an overall efficiency of 0.87 for lesions

Deep-wavelet neural networks for breast cancer early diagnosis using mammary termographies

detection, with 0.95 of sensitivity and specificity of 0.79. Regarding lesion classification, or the ability to differentiate the lesions, the efficiency of the system for the best case was 0.86, with sensitivity of 0.89 and specificity of 0.84. This study points to DWNN as a promising tool for representing medical images in general. For future studies, we believe we may improve results by modulating DWNN parameters. With the growing development of computer-aided diagnostics, tools such as the DWNN are important and necessary to optimize these systems, thus providing increasingly reliable diagnostics.

Acknowledgments We thank the Brazilian scientific agencies FACEPE and CNPq for the partial financial support of this research.

References [1] ORGANIZAÇÃO MUNDIAL DA SAÚDE, Breast cancer. ,https://www.who.int/cancer/prevention/diagnosis-screening/breast-cancer/en/., 2019 (accessed 16.01.19). [2] INSTITUTO NACIONAL DE CÂNCER JOSÉ ALENCAR GOMES DA SILVA, Rio de Janeiro, ESTIMATIVA 2018— Incidência de câncer no Brasil. ,http://www1.inca.gov.br/estimativa/2018/., 2018 (accessed 22.01.19). [3] INSTITUTO NACIONAL DE CÂNCER JOSÉ ALENCAR GOMES DA SILVA, Câncer de mama. ,https://www.inca.gov.br/tipos-de-cancer/cancer-de-mama., 2018 (accessed 16.01.19). [4] D. Walker, T. Kaczor, Breast thermography: history, theory, and use is this screening tool adequate for standalone use? Nat. Med. J. 4 (7) (2012). [5] INSTITUTO NACIONAL DE CÂNCER JOSÉ ALENCAR GOMES DA SILVA, Rio de Janeiro, Diretrizes para a Detecção Precoce do Câncer de mama no Brasil, 2015. [6] F.R. Cordeiro, W.P. Santos, A.G. Silva-Filho, A semi-supervised fuzzy growcut algorithm to segment and classify regions of interest of mammographic images, Expert Syst. Appl. 65 (2016) 116126. [7] W.W. Azevedo, S.M. Lima, I.M. Fernandes, A.D. Rocha, F.R. Cordeiro, A.G. da Silva-Filho, et al., Fuzzy morphological extreme learning machines to detect and classify masses in mammograms, 2015 IEEE International Conference on Fuzzy Systems (Fuzz-IEEE), IEEE, 2015, pp. 18. [8] F.R. Cordeiro, W.P. dos Santos, A.G. Silva-Filho, Segmentation of mammography by applying growcut for mass detection, Stud. Health Technol. Inform. 192 (2013) 8791. [9] F.R. Cordeiro, W.P. dos Santos, A.G. Silva-Filho, An adaptive semi-supervised fuzzy growcut algorithm to segment masses of regions of interest of mammographic images, Appl. Soft Comput. 46 (2016) 613628. [10] F.R. Cordeiro, S.M. Lima, A.G. Silva-Filho, W.P. dos Santos, Segmentation of mammography by applying extreme learning machine in tumor detection, International Conference on Intelligent Data Engineering and Automated Learning, Springer, 2012, pp. 92100. [11] A.A. Mascaro, C.A. Mello, W.P. dos Santos, G.D. Cavalcanti, Mammographic images segmentation using texture descriptors, 2009 Annual International Conference of the IEEE Engineering in Medicine and Biology Society, IEEE, 2009, p. 3653. [12] S.M. de Lima, A.G. da Silva-Filho, W.P. dos Santos, A methodology for classification of lesions in mammographies using zernike moments, elm and svm neural networks in a multi-kernel approach, 2014 IEEE International Conference on Systems, Man, and Cybernetics (SMC), IEEE, 2014, pp. 988991. [13] T. Cruz, T. Cruz, W. Santos, Detection and classification of lesions in mammographies using neural networks and morphological wavelets, IEEE Lat. Am. Trans. 16 (3) (2018) 926932.

121

122

Valter Augusto de Freitas Barbosa et al.

[14] F.R. Cordeiro, K.F.P. Bezerra, W.P. dos Santos, Random walker with fuzzy initialization applied to segment masses in mammography images, 2017 IEEE 30th International Symposium on ComputerBased Medical Systems (CBMS), IEEE, 2017, pp. 156161. [15] M. Santana, J. Pereira, N. Lima, F. Sousa, R. de Lima, W. dos Santos, Classificação de lesões em imagens frontais de termografia de mama a partir de sistema inteligente de suporte ao diagnóstico, Anais do I Simpósio de Inovação em Engenharia Biomédica-SABIO 2017, 2017, p. 16. [16] I. Fernandes, W. dos Santos, Classificação de mamografias utilizando extração de atributos de textura e redes neurais artificiais, in: Congresso Brasileiro de Engenharia Biomédica  CBEB 2014, vol. 8, 2014. [17] M. Araujo, K. Queiroz, M. Pininga, R. Lima, W. Santos, Uso de regiões elipsoidais como ferramenta de segmentação em termogramas de mama, in: XXIII Congresso Brasileiro de Engenharia Biomédica  CBEB 2012, 2012. [18] T.B. Borchartt, A. Conci, R.C. Lima, R. Resmini, A. Sanchez, Breast thermography from an image processing viewpoint: a survey, Signal. Process. 93 (10) (2013) 27852803. [19] M. Etehadtavakol, E.Y. Ng, Breast thermography as a potential non-contact method in the early detection of cancer: a review, J. Mech. Med. Biol. 13 (2) (2013) 1330001. [20] L.F. Meira, E. Krueger, E.B. Neves, P. Nohama, M.A. de Souza, Termografia na área biomédica, Pan Am. J. Med. Therm. (2014) 3141. [21] G. Schaefer, T. Nakashima, M. Zavisek, Analysis of breast thermograms based on statistical image features and hybrid fuzzy classification, International Symposium on Visual Computing, Springer, 2008, pp. 753762. [22] N. Dey, A.S. Ashour, S. Borra, Classification in BioApps: Automation of Decision Making, vol. 26, Springer, 2017. [23] K. Lan, D.-t Wang, S. Fong, L.-s Liu, K.K. Wong, N. Dey, A survey of data mining and deep learning in bioinformatics, J. Med. Syst. 42 (8) (2018) 139. [24] M. Salmeri, A. Mencattini, G. Rabottino, A. Accattatis, R. Lojacono, Assisted breast cancer diagnosis environment: a tool for dicom mammographic images analysis, 2009 IEEE International Workshop on Medical Measurements and Applications, IEEE, 2009, pp. 160165. [25] W. Chen, M.L. Giger, U. Bick, A fuzzy c-means (fcm)-based approach for computerized segmentation of breast lesions in dynamic contrast-enhanced mr images1, Acad. Radiol. 13 (1) (2006) 6372. [26] Z.M. Nordin, N.A.M. Isa, K.Z. Zamli, U.K. Ngah, M.E. Aziz, Semi-automated region of interest selection tool for mammographic image, 2008 International Symposium on Information Technology, vol. 1, IEEE, 2008, pp. 16. [27] S.K. Bandyopadhyay, Survey on segmentation methods for locating masses in a mammogram image, Int. J. Comput. Appl. 9 (11) (2010) 2528. [28] A. Boujelben, A.C. Chaabani, H. Tmar, M. Abid, Feature extraction from contours shape for tumor analyzing in mammographic images, 2009 Digital Image Computing: Techniques and Applications, IEEE, 2009, pp. 395399. [29] M.A.D. Santana, J.M.S. Pereira, F.L.D. Silva, N.M.D. Lima, F.N.D. Sousa, G.M.S.D. Arruda, et al., Breast cancer diagnosis based on mammary thermography and extreme learning machines, Res. Biomed. Eng. 34 (2018) 4553. ,http://www.scielo.br/scielo.php? script 5 sci_arttext&pid 5 S2446-47402018000100045&nrm 5 iso.. [30] W.P. dos Santos, R.E. de Souza, P.B. dos Santos Filho, Evaluation of alzheimer’s disease by analysis of mr images using multilayer perceptrons and kohonen som classifiers as an alternative to the adc maps, 2007 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, IEEE, 2007, pp. 21182121. [31] W.P. Dos Santos, F.M. De Assis, R.E. De Souza, P.B. Mendes, H.S. de Souza Monteiro, H.D. Alves, A dialectical method to classify alzheimer’s magnetic resonance images, Evolutionary Computation, IntechOpen, 2009. [32] W.P. dos Santos, F.M. de Assis, R.E. de Souza, P.B. dos Santos Filho, Evaluation of alzheimer’s disease by analysis of mr images using objective dialectical classifiers as an alternative to adc maps, 2008 30th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, IEEE, 2008, pp. 55065509.

Deep-wavelet neural networks for breast cancer early diagnosis using mammary termographies

[33] W.P. dos Santos, F. Assis, R. Souza, P.B. Santos Filho, F.L. Neto, Dialectical multispectral classification of diffusion-weighted magnetic resonance images as an alternative to apparent diffusion coefficients maps to perform anatomical analysis, Comput. Med. Imaging Graph. 33 (6) (2009) 442460. [34] W.P. dos Santos, F.M. de Assis, R.E. de Souza, P.B. dos Santos Filho, Dialectical classification of mr images for the evaluation of alzheimer’s disease, Recent Advances in Biomedical Engineering, IntechOpen, 2009. [35] W.P. dos Santos, R.E. de Souza, P.B. Santos Filho, F.B.L. Neto, F.M. de Assis, A dialectical approach for classification of dw-mr alzheimer’s images, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence), IEEE, 2008, pp. 17281735. [36] O. Commowick, A. Istace, M. Kain, B. Laurent, F. Leray, M. Simon, et al., Objective evaluation of multiple sclerosis lesion segmentation using a data management and processing infrastructure, Sci. Rep. 8 (1) (2018) 13650. [37] W.P. dos Santos, F.M. de Assis, R.E. de Souza, P.B. Mendes, H.S. Monteiro, H.D. Alves, Fuzzybased dialectical non-supervised image classification and clustering, Int. J. Hybrid. Intell. Syst. 7 (2) (2010) 115124. [38] S.M. de Lima, A.G. da Silva-Filho, W.P. dos Santos, Detection and classification of masses in mammographic images in a multi-kernel approach, Comput. Meth. Prog. Biomed. 134 (2016) 1129. [39] S. Charan, M.J. Khan, K. Khurshid, Breast cancer detection in mammograms using convolutional neural network, in: 2018 International Conference on Computing, Mathematics and Engineering Technologies (iCoMET), IEEE, 2018, pp. 15. [40] Y.-D. Zhang, C. Pan, X. Chen, F. Wang, Abnormal breast identification by nine-layer convolutional neural network with parametric rectified linear unit and rank-based stochastic pooling, J. Comput. Sci. 27 (2018) 5768. [41] S. Mambou, P. Maresova, O. Krejcar, A. Selamat, K. Kuca, Breast cancer detection using infrared thermal imaging and a deep learning model, Sensors 18 (9) (2018) 2799. [42] D. Halliday, R. Resnick, J. Walker, tenth ed., Fundamentos de Física - Gravitação, Ondas e Termodinâmica, vol.2, LTC, 2016. [43] R. Williams, G. Williams, Pioneers of invisible radiation photography, Medical and Scientific Photography, 2002. [44] J.F. Head, C.A. Lipari, F. Wang, J.E. Davidson, R. Elliott, Application of second generation infrared imaging with computerized image analysis to breast cancer risk assessment, Proceedings of 18th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, vol. 5, IEEE, 1996, pp. 20932094. [45] M. Moghbel, S. Mashohor, A review of computer assisted detection/diagnosis (CAD) in breast thermography for breast cancer detection, Artif. Intell. Rev. 39 (4) (2013). [46] R. Lawson, Implications of surface temperatures in the diagnosis of breast cancer, Can. Med. Assoc. J. 75 (4) (1956) 309. [47] M.M.d. Oliveira, Desenvolvimento de protocolo e construção de um aparato mecânico para padronização da aquisição de imagens termográficas de mama (dissertation), Federal University of Pernambuco, Recife, 2016. [48] S.G. Mallat, Multifrequency channel decompositions of images and wavelet models, IEEE Trans. Acoust. Speech Signal Process. 37 (12) (1989) 20912110. [49] J.W. Eaton, D. Bateman, S. Hauberg, GNU Octave manual, Network Theory Limited, Bristol, 2002. [50] J.W. Eaton, D. Bateman, S. Hauberg, GNU Octave manual: version 3, Network Theory Limited, Bristol, 2008. [51] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I.H. Witten, The weka data mining software: an update, ACM SIGKDD explor. Newsl. 11 (1) (2009) 1018. [52] S. Haykin, Redes neurais: princípios e prática, Bookman, 2001. 8573077182. [53] G.B. Huang, Q.Y. Zhu, C.K. Siew, Extreme learning machine: theory and applications, Neurocomputing 70 (1-3) (2006) 489501. Available from: https://doi.org/10.1016/j.neucom.2005.12.126. arXiv:1311.4555.

123

124

Valter Augusto de Freitas Barbosa et al.

[54] W.W. Azevedo, S.M. Lima, I.M. Fernandes, A.D. Rocha, F.R. Cordeiro, A.G. Da Silva-Filho, W. P. Dos Santos, Fuzzy morphological extreme learning machines to detect and classify masses in mammograms, in: IEEE International Conference on Fuzzy Systems, 2015. doi:10.1109/FUZZIEEE.2015.7337975. [55] Y. Jung, J. Hu, A K-fold averaging cross-validation procedure, J. Nonparametr. Statist. 27 (2) (2015) 167179. Available from: https://doi.org/10.1080/10485252.2015.1010532. [56] H.D. Neto, Segmentação e análise automáticas de termogramas: um método auxiliar na detecção do câncer de mama, Ph.D. thesis (Master Thesis), Federal University of Pernambuco (UFPE), 2014. [57] D.S. Gomes, S.S. Porto, D. Balabram, H. Gobbi, Inter-observer variability between general pathologists and a specialist in breast pathology in the diagnosis of lobular neoplasia, columnar cell lesions, atypical ductal hyperplasia and ductal carcinoma in situ of the breast, Diagn. Pathol. 9 (1) (2014) 121. [58] A. Thomas, S. Kümmel, F. Fritzsche, M. Warm, B. Ebert, B. Hamm, et al., Real-time sonoelastography performed in addition to b-mode ultrasound and mammography: improved differentiation of breast lesions? Acad. Radiol. 13 (12) (2006) 14961504. [59] M. Calas, R. Almeida, B. Gutfilen, W. Pereira, Intraobserver interpretation of breast ultrasonography following the bi-rads classification, Eur. J. Radiol. 74 (3) (2010) 525528. [60] M.J.G. Calas, R.M. Almeida, B. Gutfilen, W.C. Pereira, Interobserver concordance in the bi-rads classification of breast ultrasound exams, Clinics 67 (2) (2012) 185189. [61] R.L. Brennan, D.J. Prediger, Coefficient kappa: some uses, misuses, and alternatives, Educ. Psychol. Meas. 41 (3) (1981) 687699. [62] S.R. Munoz, S.I. Bangdiwala, Interpretation of kappa and b statistics measures of agreement, J. Appl. Stat. 24 (1) (1997) 105112. [63] M. Feuerman, A.R. Miller, Relationships between statistical measures of agreement: sensitivity, specificity and kappa, J. Eval. Clin. Pract. 14 (5) (2008) 930933. [64] W. Vach, The dependence of cohen’s kappa on the prevalence does not matter, J. Clin. Epidemiol. 58 (7) (2005) 655661. [65] J. Landis, G. Koch, The measurement of observer agreement for categorical data, Biometrics 33 (1) (1977) 159174.

CHAPTER SEVEN

Deep learning on information retrieval and its applications Runjie Zhu, Xinhui Tu and Jimmy Xiangji Huang 1

Department of Computer Science & Engineering, York University, Toronto, Canada School of Computer Science, Central China Normal University, Wuhan, China 3 School of Information Technology, York University, Toronto, Canada 2

7.1 Introduction Information retrieval (IR) refers to the activity of searching and obtaining certain information resources that are relevant to a specific information need from a collection of the resources pool. It is commonly seen in many real-world applications such as web search, digital libraries, electronical health records, and so on. As the information resources generated after the search can be large in quantity and different in quality, it is essential to rank these information resources in an order of the degree of relevance. This ranking motion largely distinguishes IR problems from other similar but different problems. Therefore, the ranking models remain as the central component of IR research. In a common IR task, users typically send in a specific request into a search engine with a query first, consisting of keywords or questions. Then the search engine searches the related contents from the entire data collection by looking into the index of web pages and texts, in order to return a ranked list of relevant web pages or text documents for users. The key challenge in this process is to minimize the querydocument semantic gap. All the ranking models aim to properly handle different types of mismatch in natural language. Therefore, in IR, many tasks can be treated as the problem of text matching. Many different text retrieval and ranking models, including basic handcrafted probabilistic models, semantic-based models, term dependency-based models, and learning to rank models, have been proposed in the past few decades. And these proposed methods, especially the learning to rank models, have proved to be effective in many realworld IR tasks. However, as deep learning and neural networks become more and more popular in the domain of IR, we still see many limitations and room for improvements in model effectiveness in these traditional models. Traditional methods of matching queries and documents, or documents and documents depend heavily on handcrafted feature designs to evaluate the degree of Deep Learning for Data Analytics. DOI: https://doi.org/10.1016/B978-0-12-819764-6.00008-9

© 2020 Elsevier Inc. All rights reserved.

125

126

Runjie Zhu et al.

relevance between the two targets. Specifically, handcrafted ranking models adopt human-designed features such as term frequency, inverse document frequency (IDF), document lengths concepts, and human-designed matching functions to construct their ranking models. Representative handcrafted ranking models include BM25 and LM. The key problem here is that these handcrafted features and matching functions are not able to incorporate the most important semantic relationships or word order into document ranking. Similarly, although semantic-based ranking models and term dependence-based ranking models take semantic relationships and “bag of words” into account, the feature extraction process in the matching function still relies heavily on human involvement. Learning to rank models optimize the query-document matching process by automatically using learning matching functions. These models still cannot automatically learn useful ranking features on their own, and they still have to depend on human-designed features found in traditional models, such as BM25 and PageRank, which can be over-specific or biased in definition and time-consuming to design. Although they adopt a self-learning matching function where the function is able to learn the characteristics and rank-generate the results in order, relevance is still difficult to capture as the degree of relevance is embedded in the complicated human cognitive process. And a human’s initial judgments on handcrafted features can have direct effects on the final results. In recent years, deep learning methods have taken off in computer vision, speech recognition, and natural language processing tasks, and they have dominated the publications in these fields. These advanced deep neural network (DNN) models and deep learning techniques quickly changed the research in IR, too. The recent developments of Bidirectional Encoder Representations from Transformers (BERT), GPT-2, and XLNet are all so effective they have revolutionarily improved the results in information-retrieval tasks and natural language processing tasks. Theoretically speaking, these models are capable of exercising feature learning automatically to learn abstract representations from raw text inputs and to tackle complicated learning problems. Compared to the traditional models, these deep learning methods eliminate human involvement in both the feature extraction and ranking processes. Indeed, they can automatically learn features, patterns, relations between the words and phrases, and hierarchical structures of texts directly from large text collections. Deep semantic text matching is mainly performed in two aspects, text representations and matching functions. Specifically, in deep learning, the words shift from traditional use to distributed use. This occurs while the sentences shift from the bag-of-words model used in traditional methods to distributed representations used in deep learning methods. On the other hand, the inputs of the matching function shift from handcrafted features in traditional algorithms to automatically learned features in deep learning algorithms; the function itself shifts from simple cosine, or dot product functions, to neural network functions such as multilayer perceptron or neural tensor networks. These deep learning

Deep learning on information retrieval and its applications

matching functions involve richer matching signals and incorporate more soft matching patterns to improve semantic matching results. In light of the potential benefits and promising results that can be delivered by deep learning algorithms, in this chapter we present the current status of these newly proposed deep learning methods that are adopted in current IR research and tasks, discuss the unique advantages and challenges, and give the possible directions of future work. In Section 7.2 we provide an overview of the traditional approaches to IR and discuss the limitations of the handcrafted-based models. In Section 7.3 we examine and categorize the current deep learning approaches to IR into three major groups: methods of representation learning, methods of matching function learning, and methods of relevance learning. Meanwhile, we compare the experimental results of these deep learning models and focus on the novelty and diversity of these techniques, as well as their evaluation metrics. In Section 7.4 we present a review of the existing models with regard to different dimensions, and we make empirical comparisons between them. Finally, in Section 7.5 we summarize the recent advances and progresses made by these proposed methods, and we provide several possible directions of future research.

7.2 Traditional approaches to information retrieval The traditional approach to IR usually employs supervised machine learning techniques using handcrafted features. The process involves the step of representing the user’s query and document in collection as word vectors, and the step of computing similarity scores between them with cosine similarities.

7.2.1 Basic retrieval models Most of the existing traditional IR models adopt a “bag of words” to represent the user’s queries and documents. A typical IR model involves three major components, namely the term frequency, IDF, and the document length normalization [1]. Term frequency usually refers to the count of the occurrences of a specific term in the document, and the more occurrences of a term, the higher score the document is assigned. IDF refers to the words that are popular among the entire document collection, such as stop words. For IDF, the system will penalize the most popular words appearing in the documents to put emphasis on the correct saliency. Document length normalization refers to the mechanism that avoids the preferences toward lengthy documents, which includes more words and thus increases the possibility of matching a specific query term. The traditional retrieval models deal with these three factors and combine them to form retrieval algorithms that can achieve better performances.

127

128

Runjie Zhu et al.

Pivoted Normalization Model: The pivoted normalization retrieval model proposed by Singhal [2] is one of the best performing models in the vector space group. Theoretically, the texts in both queries and documents in the vector space models are represented as a vector of terms. As a result, the document ranking process is constructed on the results of a similarity measure of the query vectors and document vectors. The formula of pivoted normalization model is given by Eq. (7.1): S ðQ; DÞ 5

  X 1 1 ln 1 1 lnðc ðt; DÞÞ tAD - Q

jDj ð1 2 sÞ 1 s avdl

c ðt; QÞln

N 11 df ðtÞ

(7.1)

Okapi Model: The Okapi retrieval model introduced by Robertson and Walker, 1994 [3,4], is an example of a traditional probabilistic-based retrieval model. The formula of the Okapi model is presented as Eq. (7.2): ! X N 2 df ðt Þ 1 0:5 ðk1 1 1Þ 3 cðt; DÞ ðk3 1 1Þ 3 cðt; QÞ  S ðQ; DÞ 5 ln    jDj df ðt Þ 1 0:5 k3 1 cðt; QÞ 1 c ðt; DÞ k1 ð1 2 bÞ 1 b advl tAD - Q (7.2) where k1 ðbetween1:0 2 2:0Þ; b (usually 0.75), and k3 (between 01000) as constants. Dirichlet Prior Model: The Dirichlet prior model introduced by Zhai and Lafferty [5,6] is also one of the best retrieval models to take the approach toward language modeling. The model is constructed with the Dirichlet prior smoothing technique to smooth the process of document language modeling. Meanwhile, the ranking of documents is dependent on how likely a specific query would appear in a single document of the collection using estimated language model. The formula of Dirichlet prior model is presented as Eq. (7.3):  cðt; υÞ μ S ðQ; DÞ 5 c ðt; QÞln 1 1 1 jQjln jDj 1 μ μpðtjCÞ tAD - Q X



(7.3)

P(t|C ) is similar to the document frequency df(t) and it indicates how popular the term t is in the entire document collection. Other related research about basic retrieval models on genomic and biomedical data include Yin et al. [7], Huang et al. [8], and Huang and Hu [9].

7.2.2 Semantic-based models Semantic-based models attempt to use the semantic relationship between words to improve retrieval performance. Representative semantic-based models include the translation language model and LDA-based document models.

129

Deep learning on information retrieval and its applications

Translation Language Model: The translation language model proposed by Berger and Lafferty [10] is a statistical-based translation model. The document terms are statistically calculated and mapped to the query terms with the formula shown as Eq. (7.4): X   pðqd Þ 5 t ðqw Þl wjd (7.4) w

LDA-based Document Model: The LDA-based model introduced by Wei and Croft [11] is one that uses a collection of documents with the same conjugated Dirichlet prior to sample a mixture of different topics. The design of the LDA-based model prevents the algorithms from overfitting or the possibilities of generating new documents. Further, the model is interconnected with a simple language model structure to avoid any possibilities of information loss. The formula of the LDA-based document model is presented as Eq. (7.5):     k X Nd Nd pðwd Þ 5 λ pðti ðdÞpðwjti Þ pML ðwd Þ 1 1 2 pðwcoll Þ 1 ð1 2 λÞ Nd 1 u Nd 1 u i51 (7.5) Other related research about semantic-based retrieval models include Miao et al. [12], Jian et al. [13], and Tu et al. [14].

7.2.3 Term dependency-based models It is widely understood that dependencies exist between terms in a document collection. In other words, one term’s occurrence could send strong evidence to signal the occurrence of the other word. As the data among the documents are sparse, many studies used to assume the independence relationship between terms. Gradually, there was a rise of focus on term dependencies’ focus on phrases, or term occurrences based on pairs. Eventually more significant improvements made in modeling dependences are developed, based on sequential dependencies, full dependence variants of the models versus cross terms. Sequential Dependence Model (SDM): The sequential dependence model proposed by Metzler and Croft [15] is a sequential dependence model that studies the relationship between terms within the queries using the Markov random field retrieval model designed for term dependencies. The model carries the assumption of dependence among all the sequential term pairs extracted from queries. Specifically, it uses windows in both ordered and unordered ways to model n-grams or phrases in each document and the occurrences of each term pairs within an eight-terms window respectively, in order to learn the dependency between two terms. As a result, the SDM model uses the estimation of smoothing language modeling to assign weights

130

Runjie Zhu et al.

and scores to the terms and ordered/unordered windows, and to linearly combine the results to generate a final score of relevance degree in between the given user’s query versus the document. Cross Term: Zhao et al. confirms the dependence between terms by assuming an occurrence of a specific term in a query would affect the occurrences of other words in the context, named cross term [16,17]. Specifically, they suggest the model should represent the impact between two query terms appearing close to each other using a shape function. When the two query terms are further apart, the impact between each other in the shape function is weaker; while when they move closer to each other, the impact should be stronger with an intersection of the shape functions. Further, they constructed a hybrid model using the proposed concept of cross term and the traditional probabilistic models, called cross term retrieval (CRTER) to generate document retrieval and perform ranking tasks. They proved their model effectiveness by running experiments on TREC data collections.

7.2.4 Learning to rankbased models In the past few years, we have seen increasingly more relationships and relevance between different contents and signals in the web search, especially through those search log data. These sources of data are rich in information and are able to be used as inputs to build ranking models automatically. Thus, shifting from the aforementioned trends, there has been a rise in using machine learning algorithms to develop automatic ranking models in the form of f(q,d) for modern web search. In this section, we introduce the learning to rank models which construct the architecture with supervised, semi-supervised, or reinforcement learning (RL) techniques to achieve the tasks in IR. The training data used in the learning to rank models usually contain some specific patterns between the lists of terms or numerical scores that need to be ranked and further discovered. Liu [18] distributes the existing proposed learning to rank models into three subcategories, namely the pointwise, the pairwise, and the listwise, using the type of representations of the inputs and the loss function. Li [19] further explores and summarizes these different models in his survey paper. In the pointwise subcategory, there are Subset Ranking [20], McRank [21], Prank [22], and OC SVM [23]. In the pairwise subcategory, there are Ranking SVM [24], RankBoost [25], RankNet [26], GBRank [27], IR SVM [28], Lambda Rank [29], and LambdaMART [30]. Last but not least, in the listwise subcategory, there are ListNet [31], ListMLE [32], AdaRank [33], SVM MAP [34], and Soft Rank [35]. Experiments have been conducted on a large scale using large collections of benchmark data sets to learn the different performances of these three approaches. And the experimental results have shown that in general, the listwise subcategory usually manages to outperform both pointwise and pairwise subcategories in real life.

Deep learning on information retrieval and its applications

7.3 Deep learning approaches to IR The domain of IR has always been focused on two key questions: (1) how to represent the user’s intent and the content in document collection, and (2) how to match the semantic gap between the intent and content more appropriately. The recent development and progress made by deep learning techniques for IR have significantly improved the performance in solving these two central problems in IR: the representation and matching of the text corpuses. In the traditional approach to IR, the algorithms rely on handcrafted features to exercise the classification or matching tasks based on bag of keywords or cross terms within a sentence. In the modern approach to IR, the algorithms aim to conduct query and document understanding first. The understanding is processed by representing query and documents as multiple automatically learned feature vectors. Thus, multiple matching scores between a user’s query and document in document collections are calculated for neural ranking models to rank documents with the matching scores as given features. Unlike traditional handcrafted models and classical learning to rank models, these proposed neural models require a large amount of training data to perform. They learn feature representations automatically and directly from raw text in order to bridge the semantic gap between query and document vocabularies. The deep semantic text matching is mainly performed in two aspects: text representations and matching functions. Specifically, in deep learning, the words shift from one used in traditional approaches to distributed used in deep learning approaches. This occurs when the sentences shift from bag-of-words representation used in traditional methods to distributed representations used in deep learning methods. On the other hand, the inputs of the matching function shift from handcrafted features in traditional algorithms to automatically learned features in deep learning algorithms; the function itself shifts from simple cosine, or dot product functions to neural network functions such as multilayer perceptron or neural tensor networks. These deep learning matching functions involve richer matching signals and incorporate more soft matching patterns to perform better semantic matching results.

7.3.1 Representation learning-based methods As the recent development in deep learning brings novel and effective ways toward word representation, using high-dimensional real valued vectors to represent the meaning of a sentence becomes possible. In deep learning, the methods proposed in the past few years aim to represent target units with distributed representations. The goal here is to embed the words and phrases in similar meanings with similar embeddings. Cosine similarity is commonly used here to determine how similar two vectors are.

131

132

Runjie Zhu et al.

Figure 7.1 Typical flow of a representation learning matching system.

In fact, representation learning-based models refer to how a group of deep learning neural models build fixed-dimensional vectors representations for each text separately, and then the models continue to perform matching within the latent space. Representation learning for query-document matching can be generated in two steps. First, we need to calculate query and document representations using the neural networks. Then, we could conduct the query-document matching by computing a matching score. Fig. 7.1 shows the typical flow of a representation learning matching system. In general, the representation learning-based IR methods are divided into three major classes: DNN-based methods, CNN-based methods, and RNN-based methods 7.3.1.1 Deep neural networkbased methods DNN-based methods use DNNs to generate the text representations. They are constructed on the foundation of latent semantic models. Latent semantic models use semantic similarity to map a user’s query to its relevant documents, a function that traditional keyword-based matching methods are not capable of doing. The DNN-based latent semantic models aim to project queries and documents into a common lowdimensional space where the relevance between a user’s query and the document is presented as the spatial distance existed between them. In this section, we present an overview of representation learning based on DNNs, including basic architectures and their implementations. Huang et al. proposed a deep structured semantic model (DSSM) [36] in 2013, and it was considered to be the first successful neural ranking model that solves ad hoc retrieval issue directly. The proposed DSSM use the clickthrough data to discriminatively train neural ranking models by maximizing the conditional likelihood of the clicked documents with a given user’s query. In general, the model is divided into three major sections: the input section, the representation section, and the matching section.

Deep learning on information retrieval and its applications

The input section converts the vocabularies in the text corpus to bag-of-letter trigrams as input for improving the scalability and generalizability of them. Due to the nature of large vocabularies within the web search, the authors use word hashing to adjust the model to be more applicable to the searching tasks in the input section first. Specifically, with DSSM, the model shifts from adopting bag-of-words representation to the adoption of bag-of-letter trigrams representation. For example, in the traditional bag-of-words representation, we represent the selected words in a sentence, i.e., “candy store,” with a vector of [0, 0, 1, 0, . . ., 1, 0, 0]. In the DSSM, the words are further split down into the bag-of-letter trigrams format where “#candy# #store#” becomes #ca can and ndy dy# #st sto tor ore re# format of representation. Hence, the representation in vector turns out to be [0, 1, 0, 0, 1, 1, 0, . . ., 1]. The advantages of changing the representation by applying bag-of-letter trigrams include, but are not limited to (1) the trigram can effectively reduce vocabulary, thus largely reduce dimensions; (2) the trigram is flexible enough to generalize to unseen words; (3) the trigram is robust to misspelling and inflections, etc. After that, the representation section is in charge of mapping sentences into vectors with DNNs, where the sentences with similar semantical meanings are close to each other. Fig. 7.2 is an illustration of the representation section of the proposed DSSM. The model utilizes the DNNs of autoencoders to capture the compositional sentence representations by mapping high-dimensional raw text features into a low-dimensional condensed semantic feature space. The first hidden layer completed the word hashing with 30,000 elements are fed into the next level through multilayered nonlinear projections. Then the final semantic feature is an abstract feature generated from the

Figure 7.2 Illustration of the DSSM. From P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, L. Heck, Learning deep structured semantic models for web search using clickthrough data, in: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, CIKM ’13, ACM, New York, NY, USA, 2013, pp. 23332338

133

134

Runjie Zhu et al.

entire neural activities below. Once the bag-of-letter trigrams are fed into the fully connected layers and the output is generated as a list of condensed elements, DSSM runs cosine similarity as a matching function to measure the similarity between semantic vectors. 7.3.1.2 Convolutional neural networkbased methods Convolutional neural network (CNN) is a class of DNNs in deep learning that is commonly applied to computer vision [37] and natural language processing studies. It is an analogy to the neurons connectivity pattern in human brains, and it is a regularized version of multilayer perceptrons which are in fully connected networks. Specifically, a CNN is made up of one input layer, multiple hidden layers, and an output layer. The hidden layers structurally include convolutional layers, ReLU (activation function) layers, pooling layers, fully connected layers, and normalization layers. Compared to other classification algorithms, CNN requires much less preprocessing and can do better results with as the number of trainings increase. In natural language processing, a CNN is designed to identify local predictive features from a large structure, and to combine them to produce a fixed-size vector representation of the structure. Thus, in essence, CNN is an effective feature extraction architecture which can identify the predictive n-gram vocabularies in a sentence automatically. Hence, CNN-based representation learning methods can solve the problem discussed above by keeping local orders of the words. The inputs of a user’s query and documents into the neural network are sequences of words instead. The CNN-based representation learning methods applies a 1-D convolutional operation to reach the purpose of keeping the necessary information of the local word order. As shown in Fig. 7.3, vocabularies as features extracted after the pooling are the vocabularies that existed in the original sentence with correct word order. Built on top of that, convolutional DSSM (CDSSM), proposed by Shen et al. [39] in 2014, improved Huang et al.’s DSSM [36] by replacing the adoption of bag of words with the concatenation of term vectors in a sequence on the input. Specifically, the model learns each term within a given context window in the order of a word sequence to capture the n-gram based contextual features. Then, based on the significance learned from the n-gram features, the CDSSM structures the feature vectors up onto the sentence level. As a result, the model applies a nonlinear transformation to generate a continuous vector representation for the entire text corpus by extracting the high-level semantic information. Convolution in the proposed model is followed by global max-pooling. In general, the CDSSM (CLSM) model is under the structure of treating the sentence as a bag of n-grams and a max-pooling layer. It is capable of extracting the local contextual features from n-gram word level as well as the global contextual features from the max-pooling the sentence level in the text corpus.

Deep learning on information retrieval and its applications

Figure 7.3 Example of CNN-based representation learning methods. X. He, J. Gao, L. Deng, Deep learning for natural language processing: theory and practice tutorial, in: CIKM’14 Tutorial. ,https:// www.microsoft.com/en-us/research/publication/deep-learning-for-natural-language-processing-theoryand-practice-tutorial/., 2014 [38].

Similarly, Refs. [40] and [41] use sequence of word embeddings trained on large data collections as inputs to train a CNN-based representation learning model where each sequence of k words is compacted in the convolutional networks. The architecture of ARC-I proposed by Hu et al. [40] is very similar to DSSM [36]. The difference is that the ARC-I model performs a 1-D convolutional operation to learn text representations separately as CDSSM. The purpose of the model was to match two sentences and to serve the paraphrasing tasks originally. ARC-I first learns and extracts representations from the two sentences separately, and then it compares the extracted features with max layer pooling to generate a matching degree. Qiu et al. [41] uses the same approach by encoding the semantic meaning of sentences and applying a tensor layer to model the interactions in between them to solve the problems in question answering. The model combines the functions of modeling a sentence and matching semantic meaning together. Meanwhile, it learns to a degree to match questions and answers. Fig. 7.4 shows the basic architecture of the neural tensor network. It looks similar in structure to ARC-I and adopts the steps similar to the generation of matching scores. In a manner different from ARC-I, the CNTN model also runs experiments on Chinese data corpus. The experimental results suggest that CNN-based models in general possess better performance than traditional n-gram word embedding approaches. However, the CNTN still significantly outperforms other existing models as it brings complicated interactions between the sentences into algorithm calculation. The results in Chinese are slightly less effective than the ones in English, but it doesn’t affect the general efficiency of the performance of the model.

135

136

Runjie Zhu et al.

Figure 7.4 Architecture of the proposed neural tensor network [41].

7.3.1.3 Recurrent neural networkbased methods Recurrent neural networks (RNNs) are the neural networks with memories that are able to capture all information stored in sequence in the previous element. In other words, the RNNs are powerful enough to make use of the information in a relatively long sequence, since they perform the same tasks for every single element in the sequence, with output dependent on all previous computations. In general, RNNbased methods are able to capture and store long-dependence relationships. The long short-term memory (LSTM) model and the gated recurrent unit (GRU) are the popular variations in RNN-based representation learning methods. However, it is worth noting that in real applications, these RNN models are with certain limitations in that they can only look a few steps back. Palangi et al. [42] constructs an LSTM-based RNN model to solve the sentence representation learning issue in IR. As the model is able to retain long-term memory, the LSTM-RNN manages to capture the semantic meaning of the entire sentence as well as the most salient and less important words of the sentence as it goes through each term in sequence. Specifically, the LSTM-RNN model takes in and extracts each term within a sentence in a sequential way, in order to embed them into a semantic vector. Instead of simply doing summation, Palangi et al. use a sequence of letter trigrams as inputs to generate the embedding vector of the whole sentence. This helps the model to keep long-term memory with valuable information while giving up the less salient terms. Their experiments are carried out by a long short-term memory architecture, where the last output of the hidden layer is the representation of the full sentence.

7.3.2 Methods of matching function learning Unlike the representation learning approaches where word or sentence representations are generated before feeding into the final matching function layers, the deep learning methods based on matching function pull raw data inputs from queries and documents

Deep learning on information retrieval and its applications

Figure 7.5 Typical flow of a matching function learning for information retrieval.

directly into the neural networks to seek and construct basic low-level matching signals, and then the algorithms aggregate the results to generate the final matching patterns and scores, as shown in Fig. 7.5. In general, the matching function learning-based IR methods are divided into four major classes: matching with word-level similarity matrix, matching with attentional models, matching with transformer models, and methods of combining matching function learnings along with representation learnings. 7.3.2.1 Matching with word-level similarity matrix The CNN is considered to be one of the most effective breakthroughs in image retrieval and classification tasks, and it has proved to be effective in text processing as well in the past few years. In fact, most existing models of the matching functionbased methods with word-level similarity matrix are constructed with CNNs. The basic concept of CNN in word level similarity measurement is to slide the window function around each row of tokens or words in the entire text corpus, and multiply and sum up the values element-wise with original matrix. The ARC-II model proposed by Hu et al. in [40] is an example of performing matching function learning with word-level similarity matrix. Similar to ARC-I, the ARC-II model aims to serve paraphrasing tasks on purpose initially. Specifically, unlike the representation learning-based methods where the representations of queries and documents are generated by the neural networks first before they meet for similarity comparison, the matching function-based methods let the two target sentences interact before generating the higher-level word representations directly from context. Then, the basic matching signal is functioned by the phrase sum interaction matrix. Indeed, the interaction activation is calculated with nonlinear mappings to generate the interaction matrix, which is computed by sliding the windows of the CNN to capture the structure of interaction between local words. After all, multilayer perceptron is used as aggregation function to compute the final matching score.

137

138

Runjie Zhu et al.

The advantage of the proposed ARC-II model is that both the convolutional layers and the pooling functions in the model are able to keep the word order information within the original sentences, unlike DSSM in [36]. However, as the word embeddings in two n-grams are applied to build the 2-D matching matrix, the model still fails to capture the exact matching signals at the word level directly. The MatchPyramid [43] introduced by Pang et al. is another example of wordlevel matching which perceives text matching as image recognition using CNNs. The matching functions applied here consists of a 2D-convolutional network layer and a set of multilayer pooling. Specifically, the algorithm first computes an interaction matrix for the entry terms used in queries and documents using both an exact match in words and semantic similarity match in word embeddings. The matching matrix here is viewed as an image, and the advantage is that the positions of the terms are retained into the next layer. Then the model proceeds to feed the interaction matrix into the convolutional layer to capture rich matching patterns and to exercise feature pooling layer-by-layer. After layers of 2D-convolutional pooling, the algorithm adopts a series of feedforward layers to determine the final matching score based on the word similarities. Different from Pang et al.’s CNN approach, Wan et al. [44] computes the word similarity matrix using a RNN, specifically, a spatial RNN (SRNN) model named Match-SRNN. Although the word level similarity is still the basic matching unit here, instead of CNN, Wan et al. applies SRNNs followed by layers of multilayer pooling to find out the final matching score. Specifically, the proposed model adopts a structure of recursive matching and constructs a word interaction tensor to learn the degree of word similarities first. The Match-SRNN follows a pattern from the top-left corner to the bottom-right corner to calculate the scores recursively. In other words, each next unit in the structure is a calculative summarization of all matching signals from the surrounding units. In the second layer, the model borrows the results generated from the interaction layer to feed in a SRNN to further extract salient features. The hidden states of the RNN model are kept updated with both the coefficients of current interactions and the hidden state of the prefix. 7.3.2.2 Matching with attention model Attention mechanism in the matching function-based methods come from computer vision as well. When it was first adopted in text processing, the rationale behind it was to be able to locate certain prominent text features similar to locating a specific object in an image, for example, a dog or a cat in a given area. The basic structure of an attention mechanism consists of a read operator to read the inputs, a glimpse sensor to extract features from the inputs as any possible forms of neural networks, and a locator to predict where the next read operator has to be located.

Deep learning on information retrieval and its applications

Parikh et al.’s decomposable model [45] is an example that adapts deep learning methods using a matching function based on an attention mechanism. The proposed model uses a simple neural architecture to decompose the complicated problem into parallel subproblems to deal with separately. Theoretically, the model achieves the matching in three steps: (1) the model takes attendance to the soft-alignment of the words that appeared in both the user’s queries and documents; (2) the model uses the parallel structure to evaluate and compare the word-aligned in the subphrase to calculate the matching scores respectively; and finally (3) the model collects and consolidates all the output matching signals from subphrases to generate a final matching score of the two target sentences. 7.3.2.3 Matching with transformer model Since the end of 2017, there has been an explosion of adapting transformer-based models to language modeling tasks after Vaswani et al. proposed their model in the published article “Attention Is All You Need” [46]. The popularity of transformerbased models in recent work is because these models can generally overcome the inherent limitations of classic neural network architectures. For example, the transformer-based model overcomes the problem of speed inherent in RNN, LSTM, or GRU, which require sequential operations that are slow in nature. It can also overcome the long-term dependencies problem of CNN, which can never accurately handle long-range dependences in the text corpus. Indeed, these transformer-based deep learning models can remain unbiased to each term locally within the context that its self-attention process will not tend to prefer terms based on distance issues. Moreover, because of the structure of the transformer-based methods, the training efficiency of these models is much higher than RNN or CNN-based models since the tensor processing units can process a batch size of number of words, instead of sequences, simultaneously at each single layer for future multiplication. The BERT model proposed by Google at the end of 2018 [47] was a spark in the IR field. Since then, a considerable number of papers have been published by incorporating BERT into traditional IR models and other deep learning-based models. The structure of the BERT model consists of an input embedding layer, a positional encoding layer, and a major BERT transformer encoder body where a multi-headed self-attention mechanism and feedforward layers reside with an add and norm layer in between. In theory, the inputs to the BERT transformer encoder are fed in to first generate a series of input embedding. Then, the positional encoding functions process these embeddings by learning the relative positions of these terms. After that, the results are fed into the main body of the BERT model where the multi-head attention system models the context of these embeddings first before adding layers of norms and residuals to ensure the healthy training of the neural networks. As the embeddings pass

139

140

Runjie Zhu et al.

through the multi-headed self-attention system, they are computed into feedforward neural network layers to generate the nonlinear features with hierarchy. Based on the existing studies conducted with a BERT model in IR, the ranking strategies can be further divided into two subtasks: the feature-based task and the finetuning task. The feature-based BERT models adopt a same approach as most of the other existing representation learning methods. The BERT structure serves to generate representations of users’ queries and documents respectively, and then combines them with cosine similarities to compute the final ranking score. The tuning-based BERT model adopts the same approach as most of the other existing relevance learning methods. It serves to predict the degree of relevance of a query-document pair. Nogueira et al. use the BERT fine-tuning approach to improve the performance of passage re-ranking tasks. They published a paper in February 2019 [48] using BERT to implement query-based passage re-ranking tasks in a question-answering pipeline. In the proposed method, the BERT large model is adopted as a binary classification model for re-ranking, where sentence A is fed in as a query in Devlin’s architecture [47], and sentence B is fed in as a document passage. With this simple adaption of BERT, the experimental results show a relative 27% improvement over state-ofthe-art models in MRR @10. Following the same path as Ref. [48], Padigela et al. [49] ran an analysis on finetuning the BERT model using MS MARCO passage re-ranking data set, to explain how the BERT model can deliver such promising results in this specific task. After the experiments, the study shows the robustness of the BERT model on passage reranking tasks. Meanwhile, it proves that a traditional handcrafted BM25 model tends to be more biased toward high-query term frequency as term frequency and IDF are the two important values in the formula. On the other hand, the BERT model performs much better on abbreviation-based answers than on long queries with context answers. Padigela et al. suggest a possible direction of future work on incorporating various types of queries and effectively encoding queries contexts for lengthy queries. Borrowed from the successful experience BERT has achieved in questionanswering tasks, Yang et al. tried to adopt the simple application of BERT to ad hoc document retrieval [50] in the fine-tuning approach. To overcome the difficulty of document length, i.e., being too long for the BERT model, Yang et al. simply adopt separate inferences to each sentence in a document to generate scores of each, and then consolidate the scores in the end to generate the entire document score. The experimental results have shown substantial improvements in the BERT model in both Microblog track data and the newswire documents. This is another example of showing the powerfulness of the simple adoption of BERT to IR tasks. Aside from Ref. [50], Ref. [51] is another recent advancement achieved by transformer-based models in ad hoc document retrieval tasks with fine-tuning approach. MacAvaney et al. constructed hybrid models of Contextualized Embeddings for

Deep learning on information retrieval and its applications

Document Ranking (CEDR) simply by using classification vectors of BERT and existing neural network models to perform ad hoc ranking tasks. The experimental results on Robust04 and Web Track 2012-14 have shown further boosting strength and performance of BERT to be applied on neural models in the ad hoc retrieval systems. Following the same logic of Refs. [50,51], Dai and Callan [52] also proposed a study on the BERT model in the subdomain of ad hoc retrieval system. They believe the BERT model is better at leveraging language structures to improve query written tasks. Moreover, by incorporating the search knowledge with the capability of understanding the texts, the pretrained BERT model opens new opportunities for context understanding and language structure modeling. Thus, the fine-tuning approach of BERT can deliver significantly improved results in ad hoc retrieval tasks. Qiao et al. [53] learned the two aforementioned approaches together by combining the pretrained BERT and fine-tuning BERT into one study, namely using MS MARCO to practice the passage re-ranking task and using TREC Web Track to practice ad hoc retrieval tasks for documents. The experimental results on two different groups of data sets prove that the BERT model is robust and strong in matching tasks as it can manage to distribute global attentions to the entire contexts; however, the study also gives some indications for future study. Specifically, the results from the question-based queries from MS MARCO data sets show promising results of BERT performing on question answering-based tasks like passage re-ranking. This robustness of the model shows the characteristic of BERT’s strong interaction capability when dealing with sequence-to-sequence matching. On the other hand, the fine-tuning approach of BERT on ad hoc retrieval still has room for further improvement. The BERT model tends to favor pretraining user clicks over using surrounding contexts to run matching functions. XLNet proposed by Yang et al. [54] is a sudden shock to the BERT model family. The generalized autoregressive pretraining XLNet model basically fixes the pretrainfine-tune discrepancy of BERT which cannot take the dependency between masked positions into account. Specifically, the XLNet eliminates the shortcomings of the BERT model with its autoregressive formulation and incorporates the Transformer-XL autoregressive model into the pretraining session. Moreover, it is able to learn the bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order [54]. The experimental results show an incredibly powerful performance that beats the BERT model in ad hoc IR. 7.3.2.4 Combining matching function learning and representation learning From the earlier discussions, we can see that both matching function learning-based and representation learning-based deep learning models can deliver promising results. Therefore, based on the existing literature, there has been a rising trend of research trying to combine the two together to build hybrid models. The Duet model

141

142

Runjie Zhu et al.

proposed by Mitra et al. [55] is an example of the model combination of containing both matching function learning-based and representation learning-based components. Specifically, the matching function learning is constructed for learning local matchings while the representation learning is applied to distributed matchings (see Fig. 7.6). In Mitra et al.’s architecture, the matching function learning-based component applies an indicator matrix to indicate where the term occurrence happens within a document, and then it uses convolutional layers to extract the salient features. Meanwhile, instead of using cosine similarity, the representation learning-based component uses the similar approaches adopted in the DSSM and ARC-I models discussed previously, by applying simple feedforward neural networks to compute the similarity scores. As a result, the proposed model produces a linear combination of the scores generated by the two components to give the final experimental results.

7.3.3 Methods of relevance learning Aside from the deep learning methods of representation learning and matching function learning, another big group belongs to the group of relevance learning. In the domain of IR, relevance matching is especially important as query and document comparisons on similarity do not always reflect the degree of relevance between them. For similarity matching, the models usually adopt a symmetric matching function and are limited to the same text formats comparison whereas relevance matching usually contains an asymmetric matching function and could contain all different forms, ranging from keywords to documents, to phrases and sentences, and to documents. Besides, the similarity matching models are good at judging whether the two selected queries or documents are similar semantically; however, this does not indicate whether the document is relevant to query. Further, the similarity matching measurement is usually conducted at all positions of the entire two target sentences, whereas for relevance matching the functions usually look at all different parts of the documents and are not limited to certain parts. As a result, the IR tasks that the two matching methods can perform would be different, with similarity matching focusing on paraphrase identification while relevance matching focuses on ad hoc retrieval which fulfills different users’ needs. The existing literature on deep learning methods using relevance learning are mainly divided into two subcategories: (1) relevance learning based on global distribution of matching strengths, and (2) relevance learning based on local context of matched terms. 7.3.3.1 Based on global distribution of matching strengths Deep learning methods based on relevance matching using global distribution of matching strengths are usually executed in two steps: (1) to calculate each query term’s matching signals across the entire document, followed by the calculation of the

Deep learning on information retrieval and its applications

Figure 7.6 The basic duet architecture with matching function learning on the left and the representation learning on the right [55].

143

144

Runjie Zhu et al.

distribution of global matching strengths; and (2) to consolidate the matching strengths’ distributions. The existing methods belonging to this category not only fulfill the need for short query and long document matching, but also achieve relatively more robust distributions of matching strength comparing to the raw matching signals obtained from queries and documents directly. However, due to the fact that the models tend to measure each query term’s matching signals separately across the documents, the word order is usually missing and hard to maintain when the matching strengths’ distributions are calculated. Guo et al. [56]’s proposed method of DRMM is an example of solving IR problems with relevance learning based on global distribution of matching strengths. Their model adapts different methods of counting, relative counting, and log counting to compute the interactions between the words in text corpus and the related documents. The basic matching point in the model is constructed by the cosine similarity measurements calculated from word embeddings, and the resulting scores are turned into matching histograms which tend to miss the word order information while mapping in the information. Then, each user’s query terms from the matched histograms are mapped to pass through layers of feedforward matching network with term gating network to produce each query term’s weight of attention. These calculated scores on each query term are then further fed into a layer of network to generate a final relevance score based on the weights of attention from previous layers. Hence, the semantic gaps in between the words are bridged with the embeddings generated from matching matrices. From the experiments conducted in the study, it is clearly shown that relevance learning versus ad hoc retrieval tasks are significantly different in nature. Most of the existing deep learning models are thus conducting similarity measurements instead of the ad hoc retrieval tasks. And the experimental results generated by the DRMM deep matching model doing ad hoc retrieval tasks outperforms most of the traditional baseline retrieval models significantly, while larger training data sets for training the deep learning models can possibly bring even more promising results. Following the same category as DRMM, Yang et al. [57] proposed the attentionbased neural matching model (aNMM) model in 2016 to exercise semantic matching with global distribution of matching strengths. Different from DRMM, the aNMM model constructs an interaction matrix among the words in the text corpus as the first step. Then, the model applies several kernel functions to calculate and aggregate the similarities among the words while assigning different weights to a specific range of similarity measurement with the same kernel functions. Once all the similarity measurements are computed through the kernel functions, the model consolidates all the outputs produced by the functions and weighs them based on the bin they belong to. Xiong et al. [58] proposed a deep learning method based on global distribution of matching strengths named Kernal Pooling as Matching Function (K-NRM). Word embeddings are used to calculate the cosine similarity to serve as the basic matching

Deep learning on information retrieval and its applications

signals and the construction of a basic matching matrix. Then, with each row of the terms in the matrix, a kernel pooling function and a nonlinear feature combination are exercised to extract soft match features in different levels and to put the featured words in order of saliency. After that, a learning to rank layer serves to combine all the learned features into a final ranking score. The advantage of this model is that the kernel-guided embeddings provide a more accurate multilevel soft match between queries and documents leading to better model performances. However, the disadvantage of this model is similar to DRMM, that the kernel pooling and nonlinear feature combination operations here fail to capture the order of the information. The Convolutional Neural Networks for Soft-Matching N-Grams model (ConvKNRM) [59] proposed by Dai et al. is an extension built on top of the KNRM model introduced before. Dai et al. fixes the problem of the KNRM model where the word order information is missing by adding an n-gram cross-matching function. The general architecture looks similar to KNRM, however, the Conv-KNRM model adds a convolutional layer in between the word embedding and cross-matching layer to produce the n-gram embeddings. As the KNRM has already achieved extremely promising results, it is not surprising that the accuracy of the Conv-KNRM model is even better. It is worth noting that the experiments were carried out in both English and Chinese log query searches. Based on the conducted study, it has also shown that the significantly improved results gained by the proposed method is primarily contributed to the cross-matching functions on n-grams in different lengths. 7.3.3.2 Based on local context of matched terms Except for the methods based on global distribution of matching strengths, the relevance-matching functions based on local context of matched terms have also been popular in this field of research over the past few years. Unlike global distribution matching, the deep learning methods in this category detect the local contexts around the target term in the documents first, before it conducts relevance matching between the user’s queries and the contexts of the term. After that, the relevance matching scores of the local matching signals are consolidated and output as a final result. Compared to the methods adopting global distribution of matching strengths, the local context matching methods are also capable of managing short query and long document matching as well as eliminating the noises contained in the document collections. The most prominent advantage of this category is that it can capture the word order information within the context of each specific term. DeepRank proposed by Pang et al. [60] constructed the relevance learning model by solely focusing on the extraction of terms occurrences within a document. The model mimics the relevance judgment process conducted by humans. Specifically, the relevance learning model computes the interactions between the user’s queries and

145

146

Runjie Zhu et al.

Figure 7.7 The architecture of the proposed DeepRank model [60].

the set window surrounding the target term first, and then it adapts to RNNs or CNNs to combine the features within each window to generate matching scores, using learned representations of queries, contexts, and the interactions between terms in queries and documents (see Fig. 7.7). The Position-Aware Neural IR model (PACRR) [61] proposed by Hui et al. in the same year is another example of learning degrees of relevance based on the local context of matched terms. Hui et al. aim to model better interactions between queries and documents based on their positions. Meanwhile, they believe that the degree of relevance matching is primarily dependent on specific positions of terms within a document. Thus, they project the specific positions of terms into two categories: (1) the first k terms within the document as FirstK, and (2) the most similar context windows within the document as k windows. Fan et al. [62] propose a data-driven method, the Hierarchical Neural Matching model (HiNT), which is able to capture diverse relevance patterns. Their proposed method consists of both a local layer which serves to produce local relevance signals for semantic matching purposes and a global layer which serves to consolidate local signals for decision making. Specifically, the global layer allows interactions and competitions between relevance signals at different levels and derives the final result of the degree of relevance based on that.

7.4 Discussions and analyses In this section, we discuss the differences among these representation-based, matching function learning-based, and relevance-based methods, and give an empirical analysis of their model performances and robustness.

Deep learning on information retrieval and its applications

Representation learning-based models generate representations of queries and documents separately before interacting with each other in the matching function layers. Therefore, the matching signals captured by the matching functions are already high-level semantic representations processed by the neural networks. Although the representations are drawn from raw inputs of queries and documents, they tend to lose some lower-level exact matching of terms or words that could affect the model performance. Thus, because there could be information loss during the process, representation learning-based models are generally good at dealing with tasks such as text documents classification, instead of IR. The matching function learning-based methods overcome the disadvantages of the representation learning-based models to some extent. The matching function learning-based models also pull raw data inputs from queries and documents directly to feed into the neural networks. However, the models in this subcategory seek and construct both basic low-level exact matching signals as well as high-level semantic matching signals. The matching signals from queries and documents get to interact with each other earlier than the ones in representation learning-based methods. And then these methods continue to consolidate the results to generate the final matching patterns and scores with the given signals. This learning process largely reduces information loss in the queries and documents training process; thus they are capable of delivering overall more robust results than representation learningbased ones. The two categories discussed above are based on query-document similarity measurements. However, similarity is not exactly the same as relevance. When short queries meet long documents, it would be very hard to learn purely by text-based similarities. Therefore, relevance-based methods provide a solution by constructing contexts to both queries and documents for comparisons. These models solve the problem of biases in document length, thus they can in general produce relatively better results among the three categories. The subapproach of global distributions of matching signals in the relevance learning category cannot only fulfill the need for short query and long document matching, but also achieve relatively more robust distributions of matching strength. However, the information of word order is usually missing in these proposed models. Unlike the global distribution matching, the local context of matched term-based methods detect the local contexts around the target term in the documents first, before it conducts relevance matching between queries and contexts. Models belonged to this category are capable of performing well on short queries and long documents matching as well. Moreover, they can eliminate the noises contained in the document collections. The most significant advantage of this category is that the local context of matched terms can capture the information of word order, thereby generating even better results. The experimental results of the state-of-the-art models are presented in detail in Table 7.1 [63].

147

148

Runjie Zhu et al.

Table 7.1 The experimental results of all existing literature on ad hoc retrieval data sets [63]. Sogou-log Robust 04 GOV2MQ2007 Medical datasets

MAP

P@20

MAP

P@10

NDCG@1

NDCG@10

BM25 (1994) QL (1998) RM3 (2001) RankSVM (2002) LambdaMart (2010) DSSM (2013) CDSSM (2014) ARC-I (2014) ARC-II (2014) MP (2016) Match - SRNN (2016) DRMM (2016) Duet (2017) DeepRank (2017) K-NRM (2017) SNRM (2018) SNRM 1 PRF (2018) CONV-KNRM (2018) HiNT (2018)

0.255 0.253 0.287   0.095 0.067 0.041 0.067 0.189  0.279    0.286 0.297  

0.370 0.369 0.377   0.171 0.125 0.065 0.128 0.290  0.382    0.377 0.395  

0.450   0.464 0.468 0.409 0.364 0.417 0.421 0.434 0.456 0.467 0.474 0.497     0.502

0.366   0.381 0.384 0.352 0.291 0.364 0.366 0.371 0.384 0.388 0.398 0.412     0.418

0.142 0.126  0.146   0.144   0.218  0.137   0.264   0.336 

0.287 0.282  0.309   0.333   0.380  0.315   0.428   0.481 

In general, among those representation learning-based models, RNN-based models tend to perform better as they take long-term dependencies into consideration. Besides, those methods using matching function learning can present better performances than the methods using representation learning. And those methods adopting representation learning methods are more capable than traditional baseline models. This gives us the indication of the importance of semantic representation to some extent. On the other hand, the ARC-I model and the CNTN model produce lesspromising results as compared to the LSTM-RNN model, which also suggests the importance of modeling the word orders. When compared to the relevance-based learning methods, the results generated from matching function learnings based on the Web Track 14 data set are less effective. The different models in relevance-based learning methods have advantages under different circumstances, however, in general, models adopting the local context of matched terms outperform models adopting global distribution of matching signals significantly. Aside from that, comparing to the models discussed previously, the incredibly good results generated by the BERT model are worth being discussed and analyzed separately. It is obvious that the BERT model has a significant comparative advantage in processing text documents that are relatively short in document length. According to Ref. [64],

Deep learning on information retrieval and its applications

some of the early neural models proposed are not good at ranking social media posts that are short in length. And the experimental results generated from those models are basically similar to the ones generated from RM3 baseline models. We can also see that the simple application of BERT to ad hoc IR tasks can provide substantially promising results and significant improvements over the rest of all existing neural IR models.

7.5 Conclusions and Future Work In this book chapter, we summarize the recent developments in the deep learning-based models to tackle the IR problems and introduce a novel way of classifying these existing IR models by feature learning characteristics. The trend shifting from traditional models, which include basic handcrafted retrieval models, semanticbased models, term dependency-based models and learning to rank models, toward deep learning-based models has changed research in IR significantly. As we have seen from these discussions, different deep learning-based approaches such as representation learning, matching function learning, and relevance learning all have their own advantages and disadvantages. However, it is proved that the capability of neural ranking models to extract features directly from raw text inputs overcome many limitations of traditional IR models that rely on handcrafted features. Moreover, the deep learning-based models are more capable of modeling complicated matching patterns than traditional retrieval models. In the future, more hybrid models can be built on top of BERT and other neural ranking models to produce better IR results. Meanwhile, the interpretability of why neural IR models can provide such promising results is still underexplored, and this is worth the effort to find the answer in the near future.

References [1] H. Fang, T. Tao, C. Zhai, Diagnostic evaluation of information retrieval models, ACM Trans. Inf. Syst. vol. V (2010) 146. [2] A. Singhal, C. Buckley, M. Mitra, Pivoted document length normalization, in: Proceedings of the 1996 ACM SIGIR Conference on Research and Development in Information Retrieval, 1996, pp. 2129. [3] S. Robertson, S. Walker, Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval, in: Proceedings of SIGIR’94, 1994, pp. 232241. [4] M. Hancock-Beaulieu, M. Gatford, X. Huang, S.E. Robertson, P.W. Steve Walker, Williams: Okapi at TREC-5, in: TREC, 1996. [5] C. Zhai, J. Lafferty, Model-based feedback in the language modeling approach to information retrieval, in: Tenth International Conference on Information and Knowledge Management (CIKM 2001), 2001a, pp. 403410. [6] C. Zhai, J. Lafferty, A study of smoothing methods for language models applied to ad hoc information retrieval, in: Proceedings of SIGIR’01, 2001b, pp. 334342.

149

150

Runjie Zhu et al.

[7] X. Yin, X.J. Huang, Z. Li, X. Zhou, A survival modeling approach to biomedical search result diversification using Wikipedia, in: IEEE Trans. Knowl. Data Eng. 25(6), 2012, 12011212. [8] X. Huang, M. Zhong, L. Si, York University at TREC 2005: genomics track, in: Proceedings of the Fourteenth Text REtrieval Conference (TREC), Gaithersburg, Maryland, USA, November 1518, 2005. [9] X. Huang, Q. Hu. A bayesian learning approach to promoting diversity in ranking for biomedical information retrieval, in: Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), Boston, MA, USA, July 1923, 2009, pp. 307314. [10] A. Berger, J. Lafferty, Information retrieval as statistical translation, in: Proc. 22nd Ann. Int’l ACM Conf. Research and Development in Information Retrieval (SIGIR ’99), 1999, pp. 222229. [11] X. Wei, W.B. Croft, LDA-based document models for Ad- Hoc retrieval, in: Proc. 29th Ann. Int’l ACM Conf. Research and Development in Information Retrieval (SIGIR ’06), 2006, pp. 178185. [12] J. Miao, J.X. Huang, J. Zhao, TopPRF: a probabilistic framework for integrating topic space into Pseudo relevance feedback, ACM Trans. Inf. Syst 34 (4) (2016) 22:122:36. [13] F. Jian, J.X. Huang, J. Zhao, T. He, P. Hu. A simple enhancement for Ad-hoc information retrieval via topic modelling, in: SIGIR, 2016, pp. 733736. [14] X. Tu, J.X. Huang, J. Luo, T. He, Exploiting semantic Coherence features for information retrieval, SIGIR (2016) 837840. [15] D. Metzler, W. Croft, A Markov random field model for term dependencies, in: Proceedings of the 2005 ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR’05, Salvador, Brazil, August 1519, 2005. [16] J. Zhao, J. Huang, B. He, CRTER: using cross terms to enhance probabilistic information retrieval, in: Proceedings of the 2011 ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR’11, Beijing, China, July 2428, 2011. [17] J. Zhao, J.X. Huang, Z. Ye, Modeling term associations for probabilistic information retrieval, ACM Trans. Inf. Syst 32 (2) (2014) 7:17:47. [18] T. Liu, Learning to rank for information retrieval, in: Microsoft Research Asia. ,http://didawikinf. di.unipi.it/lib/exe/fetch.php/magistraleinformatica/ir/ir13/1_-_learning_to_rank.pdf.. [19] H. Li, A short introduction to learning to rank, in: IEICE Trans. Inf. & Syst., vol. E94D, No. 10, 2011 ,http://times.cs.uiuc.edu/course/598f14/l2r.pdf.. [20] D. Cossock, T. Zhang, Subset ranking using regression, in: COLT ’06: Proceedings of the 19th Annual Conference on Learning Theory, 2006, pp. 605619. [21] P. Li, C. Burges, Q. Wu, McRank: learning to rank using multiple classification and gradient boosting, in: J. Platt, D. Koller, Y. Singer, S. Roweis (Eds.), Advances in Neural Information Processing Systems 20, MIT Press, Cambridge, MA, 2008, pp. 897904. [22] K. Crammer, Y. Singer, Pranking with ranking, in: NIPS, 2001, pp. 641647. [23] A. Shashua, A. Levin, Ranking with large margin principle: two approaches, in: S.T.S. Becker, K. Obermayer (Eds.), Advances in Neural Information Processing Systems 15, MIT Press, 2002. [24] R. Herbrich, T. Graepel, K. Obermayer, Large Margin rank boundaries for ordinal regression, MIT Press, Cambridge, MA, 2000. [25] Y. Freund, R.D. Iyer, R.E. Schapire, Y. Singer, An efficient boosting algorithm for combining preferences, J. Mach. Learn. Res. vol.4 (2003) 933969. [26] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, et al., Learning to rank using gradient descent, in: ICML ’05: Proceedings of the 22nd International Conference on Machine Learning, 2005, pp. 8996. [27] Z. Zheng, H. Zha, T. Zhang, O. Chapelle, K. Chen, G. Sun, A general boosting method and its application to learning ranking functions for web search,”, in: J. Platt, D. Koller, Y. Singer, S. Roweis (Eds.), Advances in Neural Information Processing Systems, 20, MIT Press, Cambridge, MA, 2008.

Deep learning on information retrieval and its applications

[28] Y. Cao, J. Xu, T.Y. Liu, H. Li, Y. Huang, H.W. Hon, Adapting ranking SVM to document retrieval, in: SIGIR’ 06, 2006, pp. 186193. [29] C. Burges, R. Ragno, Q. Le, Learning to rank with nonsmooth cost functions, Advances in Neural Information Processing Systems, 18, MIT Press, Cambridge, MA, 2006, pp. 395402. [30] Q. Wu, C.J.C. Burges, K.M. Svore, J. Gao, Adapting boosting for information retrieval measures, Inf. Retr. 13 (3) (2010) 254270. [31] Z. Cao, T. Qin, T.Y. Liu, M.F. Tsai, H. Li, Learning to rank: from pairwise approach to listwise approach, in: ICML ’07: Proceedings of the 24th International Conference on Machine Learning, 2007, pp. 129136. [32] F. Xia, T.Y. Liu, J. Wang, W. Zhang, H. Li, Listwise approach to learning to rank: theory and algorithm, in: ICML ’08: Proceedings of the 25th International Conference on Machine Learning, New York, NY, USA, ACM, 2008, pp. 11921199. [33] J. Xu, H. Li, AdaRank: a boosting algorithm for information retrieval, in: SIGIR ’07: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, USA, pp. 391398, ACM, 2007. [34] Y. Yue, T. Finley, F. Radlinski, T. Joachims, A support vector method for optimizing average precision, in: Proceedings of the 30th Annual International ACM SIGIR Conference, 2007, pp. 271278. [35] M. Taylor, J. Guiver, S. Robertson, T. Minka, SoftRank: optimizing non-smooth rank metrics, in: WSDM ’08: Proceedings of the International Conference on Web Search and Web Data Mining, New York, NY, USA, ACM, 2008, pp. 7786. [36] P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, L. Heck, Learning deep structured semantic models for web search using clickthrough data, in: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, CIKM ’13, ACM, New York, NY, USA, 2013, pp. 23332338. [37] A. Krizhevsky, I. Sutskever, G. Hinton, ImageNet classification with deep convolutional neural networks, in: Advances in Neural Information Processing Systems 25, Curran Associates, Inc., 2012 ,http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf.. [38] X. He, J. Gao, L. Deng, Deep learning for natural language processing: theory and practice tutorial, in: CIKM’14 Tutorial. ,https://www.microsoft.com/en-us/research/publication/deep-learningfor-natural-language-processing-theory-and-practice-tutorial/., 2014. [39] Y. Shen, X. He, J. Gao, L. Deng, G. Mesnil, A latent semantic model with convolutional-pooling structure for information retrieval, in: Proceedings of the 23nd ACM International Conference on Information & Knowledge Management, CIKM ’14, ACM, CIKM’14, Shanghai, China, November 37, 2014. [40] B. Hu, Z. Lu, H. Li, Q. Chen, Convolutional neural network architectures for matching natural language sentences, Advances in Neural Information Processing Systems, 27, Curran Associates, Inc., 2014, pp. 20422050. [41] X. Qiu, X. Huang, Convolutional neural tensor network architecture for community-based question answering, in: Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015). ,https://www.ijcai.org/Proceedings/15/Papers/188.pdf.. [42] H. Palangi, L. Deng, Y. Shen, J. Gao, X. He, J. Chen, et al., Deep sentence embedding using long short-term memory networks: analysis and application to information retrieval, in: arXiv. ,https:// arxiv.org/pdf/1502.06922.pdf.. [43] L. Pang, Y. Lan, J. Guo, J. Xu, S. Wan, X. Cheng, Text matching as image recognition, in: Thirtieth AAAI Conference on Artificial Intelligence, 2016. [44] S. Wan, Y. Lan, J. Xu, J. Guo, L. Pang, X. Cheng, Match-SRNN: modeling the recursive matching structure with spatial RNN, in: arXiv. ,https://arxiv.org/pdf/1604.04378.pdf.. [45] A. Parikh, O. Tackstrom, D. Das, J. Uszkoreit, A decomposable attention model for natural language inference, in: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, November 15, 2016, pp. 22492255.

151

152

Runjie Zhu et al.

[46] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, et al., Attention is all you need, in: arXiv. ,https://arxiv.org/pdf/1706.03762.pdf.. [47] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional transformers for language understanding, in: arXiv. ,https://arxiv.org/pdf/1810.04805.pdf.. [48] R. Nogueira, K. Cho, Passage re-ranking with BERT, in: arXiv. ,https://arxiv.org/pdf/ 1901.04085.pdf.. [49] H. Padigela, H. Zamani, W. Croft, Investigating the successes and failures of BERT for passage reranking, in: arXiv. ,https://arxiv.org/pdf/1905.01758v1.pdf.. [50] W. Yang, H. Zhang, J. Lin, Simple applications of BERT for Ad Hoc document retrieval, in: arXiv. ,https://arxiv.org/pdf/1903.10972.pdf.. [51] S. MacAvaney, A. Yates, A. Cohan, N. Goharian, CEDR: contextualized embeddings for document ranking, in: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’19, Paris, France, July 2125, 2019. [52] Z. Dai, J. Callan, Deeper text understanding for IR with contextual neural language modeling, in: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’19, Paris, France, July 2125, 2019. [53] Y. Qiao, C. Xiong, Z. Liu, Z. Liu, Understanding the behaviors of BERT in ranking, in: arXiv. ,https://arxiv.org/pdf/1904.07531.pdf.. [54] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, Q.V. Le, XLNet: generalized autoregressive pretraining for language understanding, in: arXiv. ,https://arxiv.org/pdf/1906.08237.pdf.. [55] B. Mitra, F. Diaz, N. Craswell, Learning to match using local and distributed representations of text for web search, in: 2017 International World Wide Web Conference Committee (IW3C2), published under Creative Commons CC BY 4.0 License. WWW 2017, Perth, Australia, April 37, 2017. [56] J. Guo, Y. Fan, Q. Ai, W. Croft, A deep relevance matching model for Ad-hoc retrieval, in: Proceedings of the 25th ACM International Conference on Information & Knowledge Management, CIKM’16, Indianapolis, IN, USA, October 2428, 2016. [57] L. Yang, Q. Ai, J. Guo, W. Croft, aNMM: ranking short answer texts with attention-based neural matching model, in: Proceedings of the 25th ACM International Conference on Information & Knowledge Management, CIKM’16, Indianapolis, IN, USA, October 2428, 2016. [58] C. Xiong, Z. Dai, J. Callan, Z. Liu, R. Power, End-to-end neural Ad-hoc ranking with kernel pooling, in: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’17, Shinjuku, Tokyo, Japan, August 0711, 2017. [59] Z. Dai, C. Xiong, J. Callan, Z. Liu, Convolutional neural networks for so-matching N-Grams in Ad-hoc search, in: Proceedings of the 11th ACM International Conference on Web Search and Data Mining, WSDM’18, Marina Del Rey, CA, USA, February 59, 2018. [60] L. Pang, Y. Lan, J. Guo, J. Xu, J. Xu, X. Cheng, DeepRank: a new deep architecture for relevance ranking in information retrieval, in: Proceedings of the 26th ACM International Conference on Information & Knowledge Management, CIKM ’17, ACM, Singapore, November 610, 2017. [61] K. Hui, A. Yates, K. Berberich, G. Melo, PACRR: a position-aware neural IR model for relevance matching, in: arXiv. ,https://arxiv.org/pdf/1704.03940.pdf.. [62] Y. Fan, J. Guo, Y. Lan, J. Xu, C. Zhai, X. Cheng, Modeling diverse relevance patterns in Ad-hoc retrieval, in: Proceedings of the 2018 ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’18, Ann Arbor, MI, USA, July 812, 2018. [63] J. Guo, Y. Fan, L. Pang, L. Yang, Q. Ai, H. Zamani, et al., A deep look into neural ranking models for information retrieval, in: Preprint submitted to Journal of Information Processing and Management March 19, 2019. [64] J. Rao, W. Yang, Y. Zhang, F. Ture, J. Lin. Multi-perspective relevance matching with hierarchical ConvNets for social media search, in: Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence (AAAI), 2019.

Deep learning on information retrieval and its applications

Further reading Q. Chen, Q. Hu, J. Huang, L. He, CA-RNN: using context-aligned recurrent neural networks for modeling sentence similarity, in: The Thirty-Second Association for the Advancement of Artificial Intelligence AAAI Conference on Artificial Intelligence, AAAI-18. H. Li, Opportunities and challenges in deep learning for information retrieval. ,http://www.hanglihl. com/uploads/3/4/4/6/34465961/tsinghua_opportunities_and_challenges_in_deep_learning_for_information_retrieval.pdf.. Peng Shi, Jinfeng Rao, and Jimmy Lin. 2018. Simple attention-based representation learning for ranking short social media posts. arxiv:1811.01013. J. Xu, X. He, H. Li, Deep learning for matching in search and recommendation, in: Tutorial at SIGIR 2018, Ann Anbor, July 2018. W. Yang, End-to-end neural information retrieval. ,https://uwspace.uwaterloo.ca/bitstream/handle/ 10012/14597/Yang_Wei.pdf?sequence 5 4&isAllowed 5 y..

153

CHAPTER EIGHT

Electrical impedance tomography image reconstruction based on autoencoders and extreme learning machines Juliana Carneiro Gomes1, Jessiane Mônica S. Pereira1, Maíra Araújo de Santana2, Washington Wagner Azevedo da Silva2, Ricardo Emmanuel de Souza2 and Wellington Pinheiro dos Santos2 1

Polytechnic School of Pernambuco, University of Pernambuco, UPE, Recife, Brazil Department of Biomedical Engineering, Federal University of Pernambuco, UFPE, Recife, Brazil

2

8.1 Introduction One of the prominent areas of medicine is medical imaging (MI), which allows the anatomical and functional visualization of various organs and tissues of the patient. MI is present in several modalities, and it is possible to highlight techniques such as computed tomography (CT) and nuclear magnetic resonance imaging (NMRI). Additionally, MI analysis has been highly improved by the application of machine learning (ML) techniques to build intelligent systems for diagnosis support [1,2]. However, despite the unquestionable benefits they bring, many of these methods use radiations that are harmful to human health, or have expensive, oversized equipments that restrict their access and use. Because of this, there are gaps in the field of medical diagnosis that have been explored by many researchers today, such as electrical impedance tomography (EIT) [38]. EIT is a low-cost, noninvasive imaging technique used to acquire an imaging of a body section based on the application of an alternating, low amplitude and high frequency electric current to pairs of surface electrodes. Then the resulting electrical potentials are measured. In the following, this information is sent to a control and acquisition system, which processes and reconstructs the data in order to generate images corresponding to the domain under study, as can be seen in Fig. 8.1. The images are obtained from the resolution of the inverse problem and may correspond to the estimation of electrical conductivity or to the permittivity distribution of the interior of the domain. Conductivity can be understood as the possibility of the Deep Learning for Data Analytics. DOI: https://doi.org/10.1016/B978-0-12-819764-6.00009-0

© 2020 Elsevier Inc. All rights reserved.

155

156

Juliana Carneiro Gomes et al.

Secção do corpo

Corrente de excitação Sistema de controle e aquisição de dados

Potencial elétrico resposta Eletrodos

Reconstrução da imagem

Figure 8.1 Basic operation of the electrical impedance tomography [9].

medium in allowing the displacement of electric charges. Permittivity measures the ease of polarization of the material. Normally, only the conductivity, the real part of the impedance, is considered and reconstructed [1012]. EIT has applications in geophysics, known as electrical resistivity tomography [13], which can be used to detect mineral deposits, for instance. In the industrial field, EIT can be used to detect tank leakage [14], and in botany, to generate images of the interior of tree trunks [15]. In the medical applications it can be used in the detection of breast cancer [16], the diagnosis of prostate cancer [17], the monitoring of mechanically ventilated lung ventilation [18,19], and the measurement of intracranial bleeding [20]. One of the most promising applications in medicine is the monitoring of mechanical ventilation, facilitated by the size and high resistivity of the lungs. Equipment for this purpose can already be found in hospitals. EIT is a technique that has several advantages, among them is that it is noninvasive, portable, and low cost. In addition, it is a safe technique that does not use ionizing radiation, which in turn can be harmful to human health. Another advantage is the good temporal response of the EIT, which allows fast detection of changes in the middle of study. Despite its benefits, the reconstruction of EIT images consists of a direct problem and a nonlinear and ill-posed inverse problem. The first one determines the electrical potentials within the body section and also at their edge, based on the alternating current pattern applied. This relation is given by the Laplace equation. On the other hand, the inverse problem estimates the conductivity distribution and electrical permittivity of the domain, being unique for a given set of electrical potentials. The solution of the inverse problem can be unstable, presenting great sensitivity to numerical errors and with electric potentials varying nonlinearly with the excitation current. Thereby, when compared with other imaging techniques, such as CT and NMRI, EIT has low spatial resolution and high computational cost of reconstruction, which means that it has not yet been strongly established [10,12]. The objective of this chapter is to propose an innovative method for the reconstruction of EIT images. The first suggestion is the use of autoencoders, a deep neural network with unsupervised training, in order to denoise the electrical potential data

Electrical impedance tomography image reconstruction based on autoencoders and extreme learning machines

measured on the surface of the domain. Second is the application of the classical backprojection algorithm to obtain EIT images from approximate sinograms by artificial neural networks (ANNs) of the extreme learning machine (ELM) type. The sinograms are constructed from the mapping of the filtered electrical potentials in the sinograms obtained through the classic backprojection. The ELM neural network is used to approximate these sinograms using nonlinear regression. The proposal was validated through a training set consisting of 4000 synthetic phantoms. Since the ELM network is a fast training machine, the proposed reconstruction method is also very fast. This chapter is organized as follows: in Section 8.2 we present a brief state-of-the-art of EIT image reconstruction methods; in Section 8.3 we present our methodology, the theoretical methods, and the description of the synthetic image database we created; we also present our proposal; in Section 8.4 we provide the results of our experiments and make qualitative and quantitative analyses of them; finally, in Section 8.5 we summarize the scientific contribution of this work and discuss the potential future work.

8.2 Related works One reconstruction method was described and tested in Ref. [21]. They use the finite element model (FEM), which is widely used to reconstruct EIT images, specially to obtain numerical solutions for the forward problem. However, it can require a large computational cost to generate the mesh, mainly for inverse problem resolution. Furthermore, conventional iterative methods, as Gauss-Newton, can also take a large amount of computational power because of large-dimensional matrices. Because of this, Ref. [21] proposed an iterative algorithm combining an adaptative mesh refinement and an adaptive Kaczmarz method. The first one is able to refine the mesh, searching a desired accuracy using a specific error measure. In this way, it can avoid unecessary refinement that does not improve the solution’s accuracy. The second one has the ability to distinguish the conductivity distribution with a prior estimation. For this, it uses an optimal current pattern generation. The authors tested their method with simulations in an Electrical Impedance and Diffuse Optical Tomography Reconstruction Software (EIDORS) environment, using a 2D circular model with 32 electrodes on the surface domain. They observed that the spatial resolution was improved and compared it with the Gauss-Newton method using four iterations, showing that the proposed method requires one-third of the memory cost. They also performed experiments on lung and heart phantoms and quantified the performance with two error metrics, the mean squared error and the

157

158

Juliana Carneiro Gomes et al.

normalized norm difference between the obtained voltages and the desired ones. However, the experiments on phantoms did not use the optimal current patterns, and the performance could not be fully observed. The work Ref. [22] proposed a strategy to limit the conductivity spectrum to a region of interest (ROI), creating an adapted method from the linear sensitivitytheorem-based conjugate gradient (SCG), namely RROI. The idea is to use all boundary informations to reconstruct an image within a specific ROI, considering that for some applications the conductivity varies only in a determined region of the domain. Thereby, the matricial equation from a direct problem decreases its order, and consequently, the computational load. The method was validated by simulation and experiments with phantoms. Finally, the authors concluded that the described method can obtain higher spatial resolution and a good accuracy when compared with images from an SCG as applied to the entire domain, even using a fewer number of electrodes and the same data acquisition system. The qualitative results were proved by two metrics: the spatial resolution comparison and the correlation coefficient. Despite of this, the proposed method is more sensitive to noise. Ref. [23] proposed an EIT reconstruction technique with a focus on reducing computational cost, maintaining a good image resolution. Their idea is a reconstruction approach based on compressive sensing (CS), which means the reduction of samples and avoiding data redundancy. However, the prior condition to this method is that the data needs to be sparse enough. In this context, they developed a reconstruction algorithm based on patch-based sparse representation as a preprocessing step. The patch-based method has shown an ability to detect local image features and also effectively to remove artifacts and noisy data, when its parameters are chosen appropriately. Tests were performed on both simulated and experimental data, using the relative error as a metric of image quality. In conclusion, the authors showed that the patch-based sparsity method achieves better results as compared with traditional conjugate gradient (CG), global sparsity, and K-singular value decomposition (K-SVD) denoising reconstruction methods, at image quality. It is also not sensitive to noise. Besides that, its computational time is still long. In addition, the new method based on CS theory achieved higher sampling rates, reducing in this way the computational time. As a result, the effective reconstruction time is reduced and high image resolution maintained. An alternative way of finding a solution to the EIT problem is to treat it as an optimization problem, by trying to minimize the relative error between the surface potentials of an object and known surface potentials, as described by our research group in Refs. [24] and [25]. We compared different approaches to solving the optimization problem. The first one was genetic algorithm (GA). It is inspired by evolution theory and genetic principles, as species adaptation and natural selection. This method randomly selects two individuals and the one who is the fittest is chosen for crossover with occasional mutation to generate new individuals. We also used the

Electrical impedance tomography image reconstruction based on autoencoders and extreme learning machines

meta-heuristic Fish School Search (FSS) in this optimization problem, where each fish represents a possible solution to the system or a point on the fitness function domain. The FSS algorithm has operators based on food and swimming. The food operator quantifies how successful a fish is in its fitness function variation, depending on how close a fish is from food, being able to increase or decrease its own weight. The swimming operator is responsible for the fish movements in their search for food. All experiments were performed using an EIDORS environment, 16 surface electrodes, and circular mesh grids with 415 elements. We aimed to detect irregular objects in three different configurations and also to compare results between only an FSS algorithm, an FSS with non-blind search, and only GA. We showed that with 500 iterations, images were anatomically consistent, but with GA we got lower relative errors and better image quality. On the other hand, the work by Ref. [26] used both ANN and particle swarm optimization (PSO) to reconstruct EIT images. By combining a nonlinear algorithm with an optimization method, they aimed to achieve a higher local and global convergence while solving a nonlinear system with a limited number of iterations. In this case, they simulated noisy data by adding white Gaussian noise and also considered distortions on the model’s boundary to approximate real-life problems. The method was validated by simulated experiments and use of phantoms, and results were compared with the one-step Gauss-Newton method, a linear inverse solver, and with a nonlinear iterative solver, the primal/dual interior point method (PDIPM). The authors concluded that the linear solver presented a large smoothness, making difficult to estimate boundaries. In addition, the described method showed higher visual fidelity and lower resolution error when compared with original images. Finally, they believe that ANN and PSO can solve EIT problems in a faster manner and it can also deal with imperfections. In the work Ref. [27], the authors described a methodology to develop a ML based model, especially ANNs, to solve the inverse problem of EIT. The choice is due to the fact that ANNs are able to model extremely nonlinear relationships. Thus, the main objective of the work was to obtain an accurate distribution of electrical conductivities of a body through four steps: (1) development of a base of 10,220 meshes with 844 elements and the addition of artifacts, representing the virtual domains; (2) simulation of the direct EIT problem on an EIDORS platform using 16 electrodes. A set of 208 measurements was obtained for each simulation; (3) development and testing of an ML regression model; (4) post-processing algorithm, where values below a determined threshold were considered as background. The results were analyzed qualitatively and quantitatively (mean absolute error), compared with well-known reconstruction methods: the PDIPM and the Improved Gauss-Newton (IGN) method. Finally, the proposed method obtained significant kappa index and accuracy (97:57% and 94:60%, respectively). Despite this, the authors recognized that the method needs improvements, especially for better detection of

159

160

Juliana Carneiro Gomes et al.

objects near the edge or near other objects, since in many cases two objects were reconstructed as just one. In this chapter, a novel method is proposed to support EIT reconstruction, improving image quality and computational cost. The proposed method combines both a linear solver, the classical backprojection algorithm, and nonlinear algorithms via ANNs. The idea is to approximate sinograms from electrical potential data using ELMs, considering their low computational complexity and good generalization performance. These sinograms can be reconstructed through backprojection and then generate EIT images. Experiments were performed with our own synthetic images database, that we developed on free software community (GNU)/Octave environment. The effectiveness of this method is measured by peak-to-noise ratio (PSNR) and the structural similarity index (SSIM), as well as visual inspection.

8.3 Materials and methods 8.3.1 Electrical impedance tomography problems and reconstruction The electrical impedance (Z) is a quantity relative to the opposition of the passage of alternating electric current in a system, having as a unit of measure the ohm (Ω) [28]. In this context, the bioimpedance is based on the fact that different tissues and organs oppose the passage of electric current in a distinct way [29]. It is the product of the effects of conductances, capacitances, and inductances of the tissues, related to the electrical properties of conductivity σ ½Sm21 , permissivity ε ½Fm21 , and permeability μ ½Hm21 . As in biological systems the current flow occurs through the diluted ions; the conductivity depends highly on the level of tissue hydration [30]. The EIT technique depends on the solution of direct and inverse problems [10,12,31,32]. The inverse problem is the estimation of the electrical conductivity distribution and permittivity of the interior of the domain. However, because the function that represents the potentials does not vary linearly with the applied current, there is no single distribution corresponding to a determined set of measured electrical potentials. Therefore, the solution may be unstable, with high sensitivity to numerical errors and experimental noise. From the mathematical point of view, the EIT problem is determined by the Poisson equation [25,31,33,34], given in Eq. (8.1): -

-

-

rU½σð u Þrφð u Þ 5 0; ’ u AΩ;

(8.1)

161

Electrical impedance tomography image reconstruction based on autoencoders and extreme learning machines

with the boundary conditions in Eqs. (8.2) and (8.3): -

-

-

φext ð u Þ 5 φð u Þ; -

-

’ u A@Ω; -

Ið u Þ 5 2 σð u Þrφð u Þ;

-

’ u A@Ω;

(8.2) (8.3)

where ~ u 5 ðx; y; zÞ is the position of the object, φð~ u Þ the potentials distribution, φext ð~ u Þ the potentials distribution on the surface of the domain, Ið~ u Þ the alternating electrical current, σð~ uÞ the electrical conductivity distribution, that is, the image of interest, Ω the domain, @Ω the edge, and nð~ u Þ the normal vector to the edge [25,31,33,34]. For the direct problem, the electrical potentials measured in its contour are determined from the current excitation pattern and the conductivity distribution, which is given by Laplace’s equation. It is modeled by the relation given in Eq. (8.4): -

-

-

φext ð u Þ 5 f ðIð u Þ; σð u ÞÞ;

-

’ u A@Ω;

(8.4)

with the boundary conditions in Eq. (8.5): σ

@φ 5 J; @~ n

(8.5)

where J is the density of electrical current [35].

8.3.2 EIT image reconstruction techniques In general, EIT reconstruction algorithms can be divided into iterative and noniterative. Noniterative algorithms treat the inverse problem of EIT as linear and the distribution of electrical conductivities as homogeneous. This allows them to be faster and with less computational complexity. These characteristics make them feasible for use in medical practice. Among these algorithms, the best known is backprojection. Projections or the radon transform of a sample are sets of line integrals at specific angles. When projections of different angles are organized in rows of a matrix, they are called sinograms, a graphical representation in which the vertical axis is the distance from the origin to the electrode selected on the surface of the domain, and the horizontal axis represents the angle at which the slice is measured. The name sinogram originates from the fact that the radon transform of a single point outside the center is a sinusoid. With all the information from a sinogram, it is thereby possible to reconstruct an image through the backprojection process or the application of the inverse radon transform. The backprojection reconstruction is a simple and fast summation method, commonly used in computerized tomography image reconstruction [3638]. In our proposal, the projections’ angles are defined by the number of electrodes. Additionally, each projection represents the sum of resistivities or conductivities.

162

Juliana Carneiro Gomes et al.

8.3.3 Autoencoders An autoencoder is a type of deep neural network used to learn data without supervision, which means that no labeled data is necessary to enable learning. The first feature of this network is that it has the same number of neurons in the input and output. Therefore, we can expect that the output is a replication of the input, but as accurately as possible. Because of this, the training process is determined by an error measurement between the input and the reconstructed output [39,40]. In practice, the autoencoder architecture is composed of encoder layers that perform data compression or dimensionality reduction by learning how to ignore noises. In the following, the decoding layers learn to decode data back into its original format as closely as possible [41]. Consider, for instance, an input vector x, the encoder maps x in another vector z, as shown below in the Eq. (8.6), z 5 hðlÞ ðW ðlÞ x 1 bðlÞ Þ

(8.6)

where ðlÞ indicates the corresponding layer, h the encoder’s transfer function, W the weight matrix, and b the bias vector. Next, the decoder maps the encoded data z into an approximation of the original image x, denoted by x. ^ x^ 5 hðlÞ ðW ðlÞ z 1 bðlÞ Þ

(8.7)

In our application, the encoder and decoder were represented by a transfer function, the logistic sigmoid function, defined by the equation: fz 5

1 1 1 exp2z

(8.8)

The number of neurons in the hidden layers is 10, and a function corresponding to the mean squared error was used as a cost function, that is, the optimization metrics. The number of iterations was defined as 1000.

8.3.4 Extreme learning machines ANNs are machines inspired by human brain functioning. They were originally designed to model the way we respond to certain functions. ANNs are endowed of parallelism, being able to be linear or nonlinear, with the ability to learn through observing the environment. ANNs are composed of basic processing units, the neurons, interacting with each other through synapses, similar to their biological precedents. The modification of this communication gives the brain plasticity and learning ability. Thus, the neural networks modify their synaptic weights in order to achieve their desired outputs. Consequently, they have the capacity to learn and to generalize,

163

Electrical impedance tomography image reconstruction based on autoencoders and extreme learning machines

that is, they can generate appropriate output for conditions or situations that were not present in the learning stage, being able to solve many complex problems [33,4244]. Canonical ELMs are supervised ANNs with a single hidden layer, known by their high speed, low computational complexity, and good generalization performance by simply changing hidden neurons’ kernels. This single-layer network can be modeled as Eq. (8.9): Nh X

β i f ðwi xj 1 bi Þ 5 ti ;

jA½1; N;

(8.9)

i51

where N is the set of samples ðxi ; ti Þ, Nh the number of neurons in the hidden layer, f ðxÞ the activation function, wi the input weights, bi the bias, β i the output weights, and ti the target output. The input weights of the hidden layer are determined randomly and independently of the training set, whereas the output weights can be determined analytically. By using the algorithm shown in the diagram of Fig. 8.2, we can get the smallest training error and the lowest weight norm among possible solutions through a simple operation of generalized inverse of the output matrices of the hidden layer [4548].

8.3.5 Proposed reconstruction method Here we propose an innovative method for the reconstruction of EIT images. An autoencoder network can replicate the input electrical potentials data but with fewer image artifacts. After that, we apply the backprojection algorithm to reconstruct sinograms artificially reconstructed by regression ELMs. After being trained, these ELMs are able to map the new input data into sinograms. This method is graphically resumed in Fig. 8.3.

8.3.6 Proposed experiments We adopted GNU/Octave, an open software with a mostly Matlab-compatible programming language. We generated 4000 synthetic 128 3 128 pixel grayscale images. Circular objects and ellipses were placed inside a circular domain in different quantities, sizes, and gray-level intensities. In addition, 16 equidistant electrodes were

Figure 8.2 Flow diagram of the calculation of the output weights of an ELM. Elaborated by the authors.

164

Juliana Carneiro Gomes et al.

Figure 8.3 Proposed reconstruction method: The measured electrical potentials data is replicated using autoencoders’ deep network. Then the new data is reconstructed by backprojection from approximate sinograms obtained with ELMs. Elaborated by the authors.

positioned at the edge of the domain. To calculate the electrical potentials, the electrical currents were considered following paths in straight lines between the electrodes. Therefore, by summing the intensities of the pixels, we can obtain the sum of the resistivities (inverse of the conductivity) between the electrodes, knowing the current of excitation. Given the excitation current, we can calculate the surface electrical potentials.

8.4 Results and discussions Using the created database, we applied the autoencoder-based deep network to generate the new potentials data. In the following, we trained and tested the ELM network with multiple activation functions and kernels to find the one that best models the problem, as well as finding the right amount of hidden layer neurons. We employed the functions sigmoidal, sine and hard limit, and radial basis function (RBF) and linear kernels, as described in Fig. 8.4. In the case of activation functions, we varied the number of neurons from 50 to 500, with steps of 150 as well. We tested all cases using a percentage split of 2=3, and performed 30 tests with each configuration. As a measure of performance, we considered the best configuration as the one with the lowest percentage error when comparing the training set with the testing set. Afterward, some test results were synthesized in the boxplots shown below. Fig. 8.5 shows the percentage error for each of the three activation functions, comparing the number of neurons. As we can see, the higher number of neurons, 500 neurons in our tests, presented smaller errors. In addition, the best case of each function and kernel was plotted in Fig. 8.5, allowing us to choose the sigmoidal function as the one that best fits the problem. See Fig. 8.6.

Electrical impedance tomography image reconstruction based on autoencoders and extreme learning machines

Figure 8.4 Functions and kernels tested in ELMs of regression type.

Figure 8.5 Percentage errors calculated with a different number of hidden neurons for each activation function.

Finally, we trained 16 ELMs corresponding to a 16-electrode EIT system placed on a circular domain. Based on previous results, we used the sigmoidal activation function, 500 neurons in the hidden layer, and all 4000 images of the synthetic database we previously generated. Thus, the sinograms could be reconstructed by backprojection. These results are depicted in Fig. 8.7. As we can see in Fig. 8.7, the original sinograms also were reconstructed, that is, those obtained directly from the images of the database (b), as well as the sinograms generated only from the training of ELMs (c). In this way, the reconstruction generated with the proposed method, where data was trained with both autoencoders and ELMs (d) can be better understood and analyzed. When comparing these results, a great similarity between the reconstructions (b) and (c) is observed qualitatively, presenting slightly greater blurry in (c), showing that the ELMs were effective in the generation of the sinograms. That is, results are fair and similar to image reconstructions obtained by the direct application of the backprojection algorithm, in case this reconstruction problem would be a classic tomographic reconstruction. On the other hand, reconstructions obtained with the proposed method (d) were not able to identify the objects inside the domain, and also presented different grayscale intensities compared with the original images (a) and reconstructions (b). Despite this, the edges of the circular domain were well reconstructed, with almost imperceptible distortions.

165

166

Juliana Carneiro Gomes et al.

Figure 8.6 Comparing different activation functions and kernels performance in the data training with ELM.

Figure 8.7 Reconstructions of EIT images using the backprojection algorithm. PSNR and SSIM metrics were used for comparison.

167

Electrical impedance tomography image reconstruction based on autoencoders and extreme learning machines

To evaluate quantitatively the reconstructed images, the PSNR and the SSIM were calculated. The PSNR is a way of analyzing the quality of the reconstructed image, as well as the loss of information, compared to a gold standard image, quantifying how much the first image is faithful to the second one [4951]. The PSNR can be determined by Eqs. (8.10) and (8.11): PSNR 5 20 log

MAXX2 MSE

(8.10)

and MSE 5

m21 X n21 1 X ½Xði; jÞ2Y ði; jÞ2 : mn i50 j50

(8.11)

X and Y represent the images to be compared, while MAXX represents the pixel of higher value from the reference image X. The SSIM, in contrast, considers the structural information by analyzing the relation of spatially close pixels. Thus, the SSIM assists in detecting the interior objects of the domain under study, as well as the edge of that domain [52]. The SSIM between two images x and y is given by 12: SSIMðx; yÞ 5

ð2μx μy 1 c1 Þð2σxy 1 c2 Þ ðμ2x 1 μ2y 1 c1 Þðσ2x 1 σ2y 1 c2 Þ

(8.12)

where μx is the mean of the image x, μy the mean of the image y, σ2x and σ2y the variances, σxy the covariance between images. The variables c1 5 ðk1 LÞ2 and c2 5 ðk2 LÞ2 are dependent on the dynamic range of the pixels values L and on k1 and k2 , chosen by default as 0; 01 and 0; 03 [51]. By analyzing the PSNR in decibels (dB) in Fig. 8.8, it is possible to confirm the visual inspection already discussed. We can observe higher-quality reconstructions when using only ELM to train the original data, indicating that the autoencoder-based deep network could not replicate the input satisfactorily. Furthermore, the SSIM close to one indicates a higher correlation between images, not only in absolute values but

PSNR (dB)

Comparison between reconstructions and original images

Autoencoders + ELM

4 5 6 7 8 9 10

PSNR (dB)

10 20 30 40

Comparison between reconstructions

ELM

Figure 8.8 Boxplots of PSNR between reconstructions.

Autoencoders + ELM

ELM

168

Juliana Carneiro Gomes et al.

Comparison between reconstructions and original images

0.1

0.2

SSIM

0.3

0.6 0.8 0.0 0.2 0.4

SSIM

1.0

Comparison between reconstructions

Autoencoders + ELM

ELM

Autoencoders + ELM

ELM

Figure 8.9 Boxplots of SSIM between recontructions.

also in the structure of the objects present in the image. However, differently than expected, the SSIM was slightly higher for reconstructions with the proposed method than the others obtained with only ELMs, when comparing with original images from the database. Probably, these numbers indicate the proximity between the circular domains reconstructed. Finally, we believe that what is necessary is a finer tuning of the autoencoderbased deep networks’ parameters to achieve reconstructions with higher fidelity. In addition, post-processing techniques can be applied in a way that softens distortions and improves image quality. See Fig. 8.9.

8.5 Conclusion EIT is a promising technique that may contribute to more precise medical diagnoses when combined with other imaging techniques. Therefore, the implementation of a fast and efficient reconstruction method is necessary. We tested an innovative approach that is able to reduce the time of EIT reconstruction, mainly due to the high speed of the ELMs, but did not generate reasonable images when an autoencoder step was incorporated. Despite this, the method should not be discarded, as it was tested in a specific synthetic database and selected parameters. However, the use of ANNs, in particular, ELMs for regression, as well as the reconstruction with backprojection algorithm has shown an interesting potential for obtaining EIT images. In future tests, post-processing techniques may be applied to improve image quality. Furthermore, other fast-ANNs can also be tested in sinogram approximation and potentials data filtration, specially deep random-weighted neural architectures. In addition, an open-source computational tool can be implemented to reconstruct EIT images using the proposed approach.

Electrical impedance tomography image reconstruction based on autoencoders and extreme learning machines

Acknowledgments The authors are grateful to the Brazilian research agencies CAPES, CNPq, and Facepe for the partial financial support of this research.

References [1] N. Dey, A.S. Ashour, S. Borra, Classification in BioApps: Automation of Decision Making, vol. 26, Springer, 2017. [2] K. Lan, D.-t Wang, S. Fong, L.-s Liu, K.K. Wong, N. Dey, A survey of data mining and deep learning in bioinformatics, J. Med. Syst. 42 (8) (2018) 139. [3] D.C.C. Barra, E.R.Pd Nascimento, Jd.J. Martins, G.L. Albuquerque, A.L. Erdmann, Evolução histórica e impacto da tecnologia na área da saúde e da enfermagem, Rev. Eletrônica Enferm. 8 (03) (2006) 422430. [4] D. Banta, The development of health technology assessment, Health Policy 63 (2) (2003) 121132. [5] R.R. Ribeiro, A.R. Feitosa, R.E. de Souza, W.P. dos Santos, A modified differential evolution algorithm for the reconstruction of electrical impedance tomography images, 5th ISSNIP-IEEE Biosignals and Biorobotics Conference (2014): Biosignals and Robotics for Better and Safer Living (BRC), IEEE, 2014, pp. 16. [6] A.R. Feitosa, R.R. Ribeiro, V.A. Barbosa, R.E. de Souza, W.P. dos Santos, Reconstruction of electrical impedance tomography images using chaotic ring-topology particle swarm optimization and non-blind search, 2014 IEEE International Conference on Systems, Man, and Cybernetics (SMC), IEEE, 2014, pp. 26182623. [7] R.R. Ribeiro, A.R. Feitosa, R.E. de Souza, W.P. dos Santos, Reconstruction of electrical impedance tomography images using chaotic self-adaptive ring-topology differential evolution and genetic algorithms, 2014 IEEE International Conference on Systems, Man, and Cybernetics (SMC), IEEE, 2014, pp. 26052610. [8] W.P. dos Santos, R.E. de Souza, R.R. Ribeiro, A.R.S. Feitosa, V.A. de Freitas, D.E. Ribeiro, et al., Image reconstruction algorithms for electrical impedance tomography based on swarm intelligence, in: Y. Tan (Ed.), Swarm Intelligence: Applications, Vol. 3, Control, Robotics and Sensors, 2018, p. 31. [9] V.A.d.F. BARBOSA, Reconstrução de imagens de tomografia por impedância elétrica utilizando busca por cardumes de peixes e evolução diferencial, Master’s thesis, Universidade Federal de Pernambuco (2017). [10] J.N. Tehrani, C. Jin, A. McEwan, A. van Schaik, A comparison between compressed sensing algorithms in electrical impedance tomography, Engineering in Medicine and Biology Society (EMBC), 2010 Annual International Conference of the IEEE, IEEE, 2010, pp. 31093112. [11] T.K. Bera, S.K. Biswas, K. Rajan, J. Nagaraju, Improving image quality in electrical impedance tomography (eit) using projection error propagation-based regularization (pepr) technique: a simulation study, J. Electr. Bioimpedance 2 (1) (2011) 212. [12] S.P. Kumar, N. Sriraam, P. Benakop, B. Jinaga, Reconstruction of brain electrical impedance tomography images using particle swarm optimization, Industrial and Information Systems (ICIIS), 2010 International Conference on, IEEE, 2010, pp. 339342. [13] G. Bouchette, P. Church, J.E. Mcfee, A. Adler, Imaging of compact objects buried in underwater sediments using electrical impedance tomography, IEEE Trans. Geosci. Remote. Sens. 52 (2) (2014) 14071417. [14] J. Jordana, M. Gasulla, R. Pallás-Areny, Leakage detection in buried pipes by electrical resistance imaging, in: 1st World Congtres on Industrial Process Tomography, April 14-17, 1999, At Buxton, England, United Kingdom, 1999, pp. 2834. [15] S.F. Filipowicz, T. Rymarczyk, Measurement methods and image reconstruction in electrical impedance tomography, Prz. Elektrotech. 88 (6) (2012) 247250. [16] D. Pak, N. Rozhkova, M. Kireeva, M. Ermoshchenkova, A. Nazarov, D. Fomin, et al., Diagnosis of breast cancer using electrical impedance tomography, Biomed. Eng. 46 (4) (2012) 154157.

169

170

Juliana Carneiro Gomes et al.

[17] Y. Wan, A. Borsic, J. Heaney, J. Seigne, A. Schned, M. Baker, et al., Transrectal electrical impedance tomography of the prostate: spatially coregistered pathological findings for prostate cancer detection, Med. Phys. 40 (6) 063102. [18] S.H. Alves, M.B. Amato, R.M. Terra, F.S. Vargas, P. Caruso, Lung reaeration and reventilation after aspiration of pleural effusions. a study using electrical impedance tomography, Ann. Am. Thorac. Soc. 11 (2) (2014) 186191. [19] A. Adler, J.H. Arnold, R. Bayford, A. Borsic, B. Brown, P. Dixon, et al., Greit: a unified approach to 2d linear eit reconstruction of lung images, Physiol. Meas. 30 (6) (2009) S35. [20] M. Dai, B. Li, S. Hu, C. Xu, B. Yang, J. Li, et al., In vivo imaging of twist drill drainage for subdural hematoma: a clinical feasibility study on electrical impedance tomography for measuring intracranial bleeding in humans, PLoS One 8 (1) (2013) e55020. [21] T. Li, D. Isaacson, J.C. Newell, G.J. Saulnier, Adaptive techniques in electrical impedance tomography reconstruction, Physiol. Meas. 35 (6) (2014) 1111. [22] L. Miao, Y. Ma, J. Wang, Roi-based image reconstruction of electrical impedance tomography used to detect regional conductivity variation, IEEE Trans. Instrum. Meas. 63 (12) (2014) 29032910. [23] Q. Wang, Z. Lian, J. Wang, Q. Chen, Y. Sun, X. Li, et al., Accelerated reconstruction of electrical impedance tomography images via patch based sparse representation, Rev. Sci. Instrum. 87 (11) (2016) 114707. [24] V.A. Barbosa, R.R. Ribeiro, A.R. Feitosa, V.L. Silva, A.D. Rocha, R.C. Freitas, et al., Reconstruction of electrical impedance tomography using fish school search, non-blind search, and genetic algorithm, Int. J. Swarm Intell. Res. (IJSIR) 8 (2) (2017) 1733. [25] R.C. de Freitas, D.E. Ribeiro, V.L.B.A. da Silva, V.A. de Freitas Barbosa, A.R.S. Feitosa, R.R. Ribeiro, et al., Electrical impedance tomography using evolutionary computing: a review, in: BioInspired Computing for Image and Video Processing, Chapman and Hall/CRC, 2018, pp. 93128. [26] S. Martin, C.T. Choi, Nonlinear electrical impedance tomography reconstruction using artificial neural networks and particle swarm optimization, IEEE Trans. Magn. 52 (3) (2016) 14. [27] X. Fernández-Fuentes, D. Mera, A. Gómez, I. Vidal-Franco, Towards a fast and accurate eit inverse problem solver: a machine learning approach, Electronics 7 (12) (2018) 422. [28] W.H. Hayt Jr, J.E. Kemmerly, S.M. Durbin, Análise de Circuitos em Engenharia-8, AMGH Editora, 2014. [29] V.J. Bolfe, S.I. Ribas, M.I.L. Montebelo, R.R.J. Guirro, Comportamento da impedância elétrica dos tecidos biológicos durante estimulação elétrica transcutânea, Rev. Bras. Fisioter. 11 (2) (2007) 153159. [30] M. Eickemberg, C.C. de Oliveira, R.A.K. Carneiro, L.R. Sampaio, Bioimpedância elétrica e sua aplicação em avaliação nutricional, Rev. Nutr. 24 (6) (2011) 873882. [31] R.R. Ribeiro, A.R. Feitosa, R.E. de Souza, W.P. dos Santos, Reconstruction of electrical impedance tomography images using genetic algorithms and non-blind search, Biomedical Imaging (ISBI), 2014 IEEE 11th International Symposium on, IEEE, 2014, pp. 153156. [32] A.R. Feitosa, R.R. Ribeiro, V.A. Barbosa, R.E. de Souza, W.P. dos Santos, Reconstruction of electrical impedance tomography images using particle swarm optimization, genetic algorithms and nonblind search, 5th ISSNIP-IEEE Biosignals and Biorobotics Conference (2014): Biosignals and Robotics for Better and Safer Living (BRC), IEEE, 2014, pp. 16. [33] S.J. Hamilton, A. Hauptmann, Deep d-bar: real time electrical impedance tomography imaging with deep neural networks, IEEE Trans. Med. Imaging 37 23672377. [34] W.P. dos Santos, R.E. de Souza, R.C. de Freitas, D.E. Ribeiro, V.L.B.A. da Silva, V.A. de Freitas Barbosa, et al., Hybrid metaheuristics applied to image reconstruction for an electrical impedance tomography prototype, in: S. Bhattacharyya (Ed.), Hybrid Metaheuristics for Image Analysis, Springer, Cham, 2018, pp. 209251. [35] R. Ogava, N. Soares, J. Gomes, V. Barbosa, R. Ribeiro, E. de Souza, et al., Algoritmo de evolução diferencial hibridizado e simulated annealing aplicados a tomografia por impedância elétrica, in: I Symposium of Innovation in Biomedical Engineering (SABIO), 2017. [36] J. Hsieh, Computed Tomography: Principles, Design, Artifacts, and Recent Advances, SPIE, Bellingham, WA, 2009.

Electrical impedance tomography image reconstruction based on autoencoders and extreme learning machines

[37] H. Wang, G. Xu, S. Zhang, W. Yan, An implementation of generalized back projection algorithm for the 2-d anisotropic eit problem, IEEE Trans. Magn. 51 (3) (2015) 14. [38] R. Guardo, C. Boulay, B. Murray, M. Bertrand, An experimental study in electrical impedance tomography using backprojection reconstruction, IEEE Trans. Biomed. Eng. 38 (7) (1991) 617627. [39] J. Maria, J. Amaro, G. Falcao, L.A. Alexandre, Stacked autoencoders using low-power accelerated architectures for object recognition in autonomous systems, Neural Process. Lett. 43 (2) (2016) 445458. [40] Q. Xu, L. Zhang, The effect of different hidden unit number of sparse autoencoder, The 27th Chinese Control and Decision Conference (2015 CCDC), IEEE, 2015, pp. 24642467. [41] P. Vincent, H. Larochelle, Y. Bengio, P.-A. Manzagol, Extracting and composing robust features with denoising autoencoders, Proceedings of the 25th International Conference on Machine Learning, ACM, 2008, pp. 10961103. [42] F.R. Cordeiro, S.M. Lima, A.G. Silva-Filho, W.P. dos Santos, Segmentation of mammography by applying extreme learning machine in tumor detection, International Conference on Intelligent Data Engineering and Automated Learning, Springer, 2012, pp. 92100. [43] W.W. Azevedo, S.M. Lima, I.M. Fernandes, A.D. Rocha, F.R. Cordeiro, A.G. da Silva-Filho, et al., Fuzzy morphological extreme learning machines to detect and classify masses in mammograms, 2015 IEEE International Conference on Fuzzy Systems (fuzz-IEEE), IEEE, 2015, pp. 18. [44] M.Ad Santana, J.M.S. Pereira, F.Ld Silva, N.Md Lima, F.Nd Sousa, G.M.Sd Arruda, et al., Breast cancer diagnosis based on mammary thermography and extreme learning machines, Res. Biomed. Eng. 34 (2018) 4553. [45] G.-B. Huang, H. Zhou, X. Ding, R. Zhang, Extreme learning machine for regression and multiclass classification, IEEE Trans. Syst. Man Cybern. B Cybern. 42 (2) (2012) 513529. [46] G.-B. Huang, Q.-Y. Zhu, C.-K. Siew, Extreme learning machine: theory and applications, Neurocomputing 70 (1-3) (2006) 489501. [47] Q. Li, T. Zhao, L. Zhang, W. Sun, X. Zhao, Ferrography wear particles image recognition based on extreme learning machine, J. Electr. Comput. Eng. (2017). [48] J. Lei, H. Mu, Q. Liu, X. Wang, S. Liu, Data-driven reconstruction method for electrical capacitance tomography, Neurocomputing 273 (2018) 333345. [49] F.A. Fardo, V.H. Conforto, F.C. de Oliveira, P.S. Rodrigues, A formal evaluation of psnr as quality measurement parameter for image segmentation algorithms, in: arXiv preprint arXiv:1605.07116, 2016. [50] D. Salomon, Data Compression: The Complete Reference, third Ed., Springer Science & Business Media, New York, 2004. [51] A.J. Zimbico, Análise comparativa de técnicas de compressão aplicadas a imagens médicas usando ultrassom, Master’s thesis, Universidade Tecnológica Federal do Paraná, 2014. [52] Z. Wang, A.C. Bovik, H.R. Sheikh, E.P. Simoncelli, Image quality assessment: from error visibility to structural similarity, IEEE Trans. Image Process. 13 (4) (2004) 600612.

171

CHAPTER NINE

Crop disease classification using deep learning approach: an overview and a case study Krishnaswamy Rangarajan Aravind1, Prabhakar Maheswari1, ´ 2 Purushothaman Raja1 and Cezary Szczepanski 1

School of Mechanical Engineering, SASTRA Deemed University, Thanjavur, India Lukasiewicz Research Network  Institute of Aviation, Warsaw, Poland

2

9.1 Introduction The explosive growth in deep learning-based models has made many aspiring researchers from a scientific field known for conducting breakthrough research in developing real-time applications that can be implemented for a variety of societal relevant areas (such as autonomous navigation, diagnosis of crop diseases, etc.) [13]. Among the many real-world scenarios, the requirement for prediction (from sowing to yield estimation) in the agricultural sector is a long-running problem where the environment is challenging and highly dynamic. Recent growth in machine learning technologies along with improvement in computing power has facilitated its application in agriculture for prediction and classification problems. Currently, active agricultural research using machine learning approaches are being conducted, in classifying weeds from the crop, crop mapping, crop characteristics, crop species classification, disease classification, etc. [1,4,5]. Among these applications, the category that inspired the authors was the diagnosis of crop diseases. It is an important problem as diagnoses by human experts are prone to error because of cognitive errors and other factors. In many cases, it involves a laboratory analysis for the confirmation of a disease, which is a time-consuming and complicated process. The availability of plant pathologist and lab facilities is also limited considering the number of farmers and the amount of land cultivated. As precision agriculture has gained momentum, a real-time prediction tool for automating the targeted application of pesticides is required to fill the need [6,7]. Traditionally shallow machine learning algorithms such as naïve Bayes, support vector machine (SVM), linear discriminant analysis (LDA), artificial neural networks (ANNs), etc., have been very popular for many decades. It involves the extraction of features that quantitatively captures the significant region of the image for discrimination Deep Learning for Data Analytics. DOI: https://doi.org/10.1016/B978-0-12-819764-6.00010-7

© 2020 Elsevier Inc. All rights reserved.

173

174

Krishnaswamy Rangarajan Aravind et al.

from other classes. It requires expert knowledge and a strong mathematical background for identifying the best features suitable for classification [8]. In recent years, deep learning architectures have become popular as the features are learned automatically. Typically, an ANN consists of an input layer, hidden layers, and an output layer. Each layer has a number of neurons (which is the basic processing unit) connected to each other. The number of hidden layers are limited to one or two layers. In the case of the deep learning algorithm, a larger number of layers with neurons are stacked in a specific architecture. There are different categories of deep learning algorithms of which the convolutional neural network (CNN) is widely used for image classification [9]. The literature that deals with the application of CNN for the classification of crop diseases is discussed in the following section.

9.1.1 Literature survey There are many standard CNN-based architectures such as AlexNet, Visual Geometry Group16 (VGG16), VGG19, GoogLeNet, Residual Network (ResNet), SqueezeNet, etc., which are applied for disease detection as shown in Table 9.1. According to Table 9.1, these standard architectures have been developed, trained, and tested with ImageNet data set (1000 categories of common objects), and they have consistently demonstrated an accuracy of greater than 95%. With the advent of these deep CNN (DCNN) models, it is possible to apply the architecture in various sectors, except the requirement for a data set which is a key issue. To counteract this problem, Hughes and Salathe [20] created an open repository of a data set known as PlantVillage for 38 different classes, including a different disease with the images of isolated leaves acquired in a controlled condition. It is one of the largest data sets with more than 50,000 images. This data set was the most widely explored with various architectures, which is evident in Table 9.1. Currently, the data set size of PlantVillage has been increased by including more disease classes [1]. The important problem in the creation of such a large data set is the correct labeling of each image to a particular class, which has to be done again manually by an expert. The annotation of images by humans is greatly affected by their visual cues, experience, and psychological factors. The accumulation of false labeling will significantly affect the learning process and will affect the classification accuracy [13]. Many authors also created their own data set and applied these standard architectures for evaluating the performance on discriminating the diseases [11,14,16,21]. Two important methodologies have been followed while using these architectures. The first one is the transfer learning where the trained model with ImageNet’s data set was reused for the application of disease classification. The process of training is repeated with the disease data set that has been shown to produce enhanced accuracy. The other method is training the architecture dedicatedly with the disease data set

Table 9.1 Literature on the application of standard deep learning models in disease diagnosis. S. No. Authors Architecture Data set Description

1.

2.

Picon et al. (Article in press) Too et al. [10]

3.

Cruz et al. [11]

4.

Barbedo [12,13]

5.

Ferentinos [14]

6.

Wang et al. [15]

7.

Amara et al. [16]

8.

Shijie et al. [17]

9.

Durmus et al. [18]

10.

Brahimi et al. [8]

11.

Mohanty et al. [19]

ResNet50

Own data set  8178 images

VGG16, Inception V4, PlantVillage ResNet 50, 101, 152 and DenseNet121 AlexNet, GoogLeNet, Own data set Inception v3, ResNet-50, ResNet101, and SqueezeNet GoogLeNet Own data set

Wheat diseases (Septoria, Tan Spot, and Rust) were classified using ResNet50 and obtained accuracy of 96% Disease detection using various CNN architecture was explored; DenseNet121 resulted in best accuracy of 99.75% Sensitivity and specificity of 98.96% and 99.40% was obtained respectively. 35.97% and 9.88% positive predictive value (PPV) for detection of grapevine yellow disease

Analyzed the effectiveness of deep learning techniques based on variation in size and variety of data sets AlexNet, GoogLeNet, Open database  Disease detection in 25 different plants using various CNN and VGG 87,848 images architecture and achieved the best accuracy of 99.53% VGG16, VGG19, PlantVillage Disease severity estimation in diseased apple leaves using Inception V3, and various CNN and VGG16 obtained the best accuracy of ResNet50 90.4% LeNet Own data set  Disease detection using LeNet architecture; compared the 3700 images results with different combination of training and test set VGG16 with SVM and PlantVillage Tomato diseases and pests were identified using VGG16 and fine-tuned VGG16 obtained accuracy of 89% AlexNet and PlantVillage Tomato disease detection using AlexNet and SqueezeNet; SqueezeNet achieved accuracy of 95.65% in AlexNet and 94.3% in SqueezeNet AlexNet and PlantVillage data Tomato leaves disease detection using CNN architecture; the GoogLeNet set  14,828 accuracy obtained is 99.18% images AlexNet and PlantVillage Disease detection using AlexNet and GoogLeNet and GoogLeNet compared the results with different combination of training and testing set. The overall accuracy obtained is 99.35%.

176

Krishnaswamy Rangarajan Aravind et al.

which has shown a marginal decrease in accuracy [19]. In another method tested, features are extracted from the CNN and given to an algorithm such as a SVM for classification [17]. These standard architectures have been shown to produce reasonable accuracy even without fine-tuning the hyperparameters. Hence with fine-tuning, accuracy is increased significantly [22]. There is a significant challenge to develop new architectures because determining the optimal structure of each layer is a time-consuming process and may not be suitable for discriminating other disease classes or classification problems [8]. In spite of these few limitations, several authors have developed their own CNN-based architecture and fine-tuned them for classification of the disease as shown in Table 9.2. In most cases, these developed CNNs have few convolution layers compared to the standard architectures. These networks have been fine-tuned and have been applied to their own or a larger data set such as PlantVillage. These shallower CNN models raise the possibility of being implemented in currently available commercial smartphones with lower processing power and random access memory (RAM). The deep architecture requires more processing power and higher memory usage; hence, it cannot be deployed on smartphones. Few studies have considered these factors and have developed unique architectures such as SqueezeNet, MobileNet, etc. [29,30]. These architectures have not been widely explored for the disease classification application. Recently, SqueezeNet has been used by Refs. [11,18] with their own data set and tomato subcategory in PlantVillage’s data set respectively. In our study, SqueezeNet with the complete PlantVillage data set has been explored. Along with that, an overview of the various architectures with a brief description on the working of the individual layers are provided in this article. The results obtained using SqueezeNet are presented and the shortcomings of the aforementioned method are discussed in the Results and Discussion section.

9.2 Overview of the convolutional neural network architectures A typical CNN architecture consists of several processing layers with different mathematical functions for mapping from the input to the output as shown in Fig. 9.1 [31]. It consists predominantly of a convolution layer followed immediately by a rectified linear unit (ReLU), which is a nonlinear activation function. Previously, a tansigmoid was widely used which becomes saturated at a certain point [32]. Other activation layers such as leaky ReLU and clipped ReLU have recently been considered, which are variants of the ReLU activation function [33,34].

177

Crop disease classification using deep learning approach: an overview and a case study

Table 9.2 Literature on the application of created CNN for disease diagnosis. S. No. Authors Architecture Dataset Description

1

Chen et al. [23]

CNN

2

Singh et al. [24]

CNN

3 4

CNN Sibiya and Sumbwanayambe [25] CNN Ma et al. [26]

5

Dechant et al. [27]

CNN

6

Lu et al. [28]

CNN

Own data set  3810 images Own data set 1070 images Own data set Own data set  14,208 images Own data set  1796 images Own data set

A CNN named LeafNet was developed to identify the plant diseases and achieved an accuracy of 90.16%. Mango leaf disease was classified with 97.13% accuracy. Maize crop disease was classified with 92.85% accuracy. Cucumber leaf disease was detected using deep CNN architecture and achieved the accuracy of 93.4%. CNN architecture was developed to identify the northern leaf blight in a maize crop and obtained an accuracy of 96.7%. Common rice diseases were identified using CNN architecture and obtained an accuracy of 95.48%.

Figure 9.1 A typical CNN for classification.

In the convolution layer a kernel of specific dimension processes information with the weights in the kernel and corresponding pixel value according to Eq. (9.1). xl 5 b 1 Σðk xl21 Þ

(9.1)

Activation map xl is obtained by applying the kernel k over the activation map or image from previous layer xl21. The mathematical operation involves a summation of the product of the weight values and corresponding pixel or activation values and by adding the bias value. Each convolution layer will have several such kernels that recognize features of different frequencies. The movement of the

178

Krishnaswamy Rangarajan Aravind et al.

kernel is controlled by the stride value that specifies for it either to skip values between the pixel or to process at each location. This also determines the dimension of the output activation map. On other occasions, a single layer of value 0 is added on the outer margin of the map to maintain the required dimension, also known as padding. These stride and padding features are hyperparameters that form an integral part of the architecture. In many architectures, the dimension of the map has to be reduced for decreasing the amount of computations and parameters. This is done by using a pooling layer which can be maxpooling or average pooling. In the case of maxpooling, the largest value on the activation map within the confined area of the filter is placed on the output activation map. In the case of average pooling, the estimated average value is placed on the output map. This process is repeated for all the values in the activation map. If the output map of size S 3 S is required and the input map is W 3 W, then the dimension of the filter f, stride s, and padding p can be determined by using the simple mathematical Eq. (9.2). S 5 ðW  f 1 2pÞ=s 1 1

(9.2)

The ReLU is an activation function that replaces all the negative values and introduces nonlinearity. When the output resulting from these convolutions and ReLU layers are visualized, it consists of a region of bright and dark areas. Whereas the bright areas correspond to the higher excitation of neurons, the dark areas show the suppressed activation. In the case of few architectures such as AlexNet and ResNet, cross channel or batch normalization is performed. During this operation, normalized value within a specified filter size and different channels are estimated which will excite the already excited neurons. This has been shown to improve classification accuracy and has been reported in the literatures [32,35]. With the several layers of convolution, ReLU, optional pooling, and cross normalization are stacked together in a specific series fashion- and fixing-appropriate hyperparameter configurations to result in improved classification accuracy. The features learned in each layer are combined in the fully connected layer which finally ends with softmax and classification layers. In general, many deeper architectures such as AlexNet, VGG16, etc., have three fully connected layers with 4096 neurons in the first two fully connected layers and the number of neurons in the last fully connected layer is equal to the number of classes. The first two fully connected layers are followed by ReLU and a dropout layer. The important problem in the development of new architecture is the depth of these layers. In the case of AlexNet, only five convolution layers are present with a filter size of different dimensions. The size of the filters significantly affects the amount of computer resources utilized for the execution of the networks with the training data. With the lower number of convolution layers, the discrimination

Crop disease classification using deep learning approach: an overview and a case study

Figure 9.2 A sample ResNet block.

ability also decreases. This has been addressed in the later architectures of VGG16 and VGG19, however, where more convolution layers are stacked [36]. In this architecture, convolution filters of size 3 3 3, which is lower than AlexNet, yielded a lower quantity of parameters and better learning due to the presence of more filters with minimum dimension and increased the ReLU between convolution layers. Because of this, more layers can be stacked and so these versions have the ability to learn more complex features as compared to their predecessors. This has resulted in the problem of vanishing or exploding gradients, which degrades the convergence of the learning process. Hence an alternative to the simple architectures of this series is the ResNet, where an alternative connection skips a layer in the main architecture and adds to the results from the original path (as shown in Fig. 9.2) [35]. This results in residual mapping, which is easier to optimize than original mapping, which is not referenced. The different versions of ResNet are 18, 50, 101, 152, and 201. The performance of ResNet architecture has produced better results using ImageNet data set. DenseNet is another architecture that is similar to ResNet but with an alternative shortcut connection and deeper with more layers [37]. The main difference is that instead of adding features with the preceding layers, concatenation is performed to result in an increase in the depth dimension of the activation map. In addition, the number of filters required per layer is lower, which decreases the number of activation map and learnable parameters. In an another approach, the width of the network has been increased instead of the depth which resulted in the one of the popular architecture known as GoogLeNet [38]. It consists of a series of inception modules stacked together in the architecture. Each inception module has processing layers such as convolution, with a maxpooling layer arranged in parallel to each other as shown in Fig. 9.3. This chapter will exclusively focus on SqueezeNet, which is simple and uses a lower number of layers. The architecture of SqueezeNet is discussed in the following section.

179

180

Krishnaswamy Rangarajan Aravind et al.

Figure 9.3 A sample inception module V1.

9.3 Architecture of SqueezeNet All of the networks discussed previously produce a reasonable accuracy rate of greater than 95% for classifying the diseases in crops using the PlantVillage data set. But in all of these networks, the parameters used are more and aren’t exportable to a mobile device for disease classification applications. Hence an efficient CNN architecture with fewer parameters and better accuracy was still needed. To fulfill this requirement, an efficient CNN architecture was developed by [29]. Called SqueezeNet it gives the same accuracy of AlexNet but with fewer parameters. SqueezeNet and AlexNet do not have the same architecture, but the SqueezeNet architecture achieves the accuracy of AlexNet with fewer parameters in an ImageNet context. The basic architecture of SqueezeNet comprises a module called a fire module, which has two layers: a squeeze and an expand layer. The structure of a fire module is shown in Fig. 9.4. In this module, the number of filters may vary as determined by the user. In this example, the squeeze layer consists of five 1 3 1 filters and an expand layer has four 1 3 1 filters and four 3 3 3 filters. The main idea behind SqueezeNet is to reduce the depth by replacing the 3 3 3 filters in the convolutional layers to 1 3 1 filters and later down sampling is used to maintain the larger feature map. The fire modules that form SqueezeNet are arranged in a stack as shown in Fig. 9.5. The initial layers of the architecture consist of a typical convolution layer with 64 filters followed by a maxpooling layer. A total of eight fire modules are arranged in a stack and the number of filters or channels increases in the deeper fire module layers. The output of the convolution operation performed in the expand layer is concatenated at the end of each fire module. At the end of the eighth fire module, a dropout layer with a probability value of 0.5 is set where at each run 50% of the neurons are disconnected randomly and learning is continued with the resulting architecture. The last convolution layer should be reconfigured according to the

Crop disease classification using deep learning approach: an overview and a case study

Figure 9.4 Fire module of SqueezeNet.

Figure 9.5 General architecture of SqueezeNet.

number of classes. It is followed by ReLU and an average pooling layer. The important advantage of SqueezeNet is the negation of a fully connected layer which consumes large memory resources as it consists of more learnable parameters. The architecture ends with a softmax and classification layers, a pattern similar to the other architectures. The implementation of the architecture along with a disease data set and hyperparameter configuration are discussed in the following section.

181

182

Krishnaswamy Rangarajan Aravind et al.

9.4 Implementation The hardware used for the study was a commercially available laptop with Intel i5 eighth-generation processor, 8 GB of RAM and 4 GB GPU. The software used for the study was MATLAB 2018a. The number of images for each class in PlantVillage data set is shown in Table 9.3. The images available in the data set are of dimension 256 3 256, which were preprocessed to a dimension of 227 3 227 according to the specification of the input layer. Table 9.4 shows the SqueezeNet architecture with different layers including the filter dimension, stride, and padding value. As the SqueezeNet architecture has been trained with ImageNet data set, layer 10, which is a convolution layer, has a channel depth of 1000 due to the 1000 categories of the object to be classified. In our case, the depth is reconfigured to 38 channels to correspond to the number of categories available in the data set. The weight and bias parameters in the newly configured layer is initialized randomly while the remaining layers have pretrained parameters. As the architecture has been pretrained, the learning rate of all the layers were kept to 0.0001 including the newly configured convolution layer. The training process consists of a forward and backward pass. In the case of a forward pass, the group of images known as a minibatch passes through each layer and a prediction score in terms of probability is estimated. The loss function evaluates error and the derivative of the loss function is estimated. By the application of chain rule, the derivative is backpropagated to find the gradient. The weight parameters are updated based on these gradients using the optimization algorithm, namely gradient descent algorithm. This modifies the learnable parameter such that it maps input and output class correctly. During the training process, validation was performed for every 1000 iteration in order to assess the changes in the validation accuracy for every epoch. The hyperparameter settings applied for SqueezeNet are shown in Table 9.5. The performance of the architecture can be evaluated using the following properties, namely precision, recall, specificity, miss rate, and fallout with the confusion matrix of validation data set in addition to the analysis with classification accuracy. The obtained results and possible reasons for misclassification are discussed in the following section.

9.5 Results and discussion The training and validation of SqueezeNet were performed separately with segmented and original images. The trials were conducted with various splitting ratios of

183

Crop disease classification using deep learning approach: an overview and a case study

Table 9.3 PlantVillage data set. Number of Crop common Class (various diseases classes name and healthy)

1 2 3

Apple

4 5 Blueberry 6 Cherry 7 8 Corn 9 10 11 12 Grape 13 14 15 16 Orange 17 Peach 18 19 Pepper 20 21 Potato 22 23 24 Raspberry 25 Soybean 26 Squash 27 Strawberry 28 29 Tomato 30 31 32 33 34 35 36 37 38 Total images

Scab Black rot Rust Healthy Healthy Healthy Powdery mildew Cercospora leaf spot Common rust Healthy Northern leaf blight Black rot Esca (Black measles) Healthy Leaf blight Haunglongbing Bacterial spot Healthy Bacterial spot Healthy Early blight Healthy Late blight Healthy Healthy Powdery mildew Healthy Leaf scorch Bacterial spot Early blight Healthy Late blight Leaf mold Septoria leaf spot Spider mites Target spot Mosaic virus Yellow leaf curl virus

Causal pathogen

Number of images

Venturia inaequalis Botryosphaeria obtuse Gymnosporangium junipervirginianae    Podosphaera clandestine Cercospora Puccinia sorghi  Exserohilum turcicum Guignardia bidwellii Togninia minima  Plasmaopara viticola Haunglongbing Xanthomonas campestris  Xanthomonas vesicatoria  Alternaria solani  Phytophthora infestans   Erysiphales  Diplocarpon earlianum Xanthomonas vesicatoria Alternaria solani  Phytophthora infestans Mycovellosiella fulva Septoria cytisi Arachnida acari Corynespora cassicola Tobamovirus Begomovirus 54,306

630 621 275 1645 1502 854 1052 513 1192 1162 985 1180 1384 423 1076 5507 2297 360 997 1478 1000 152 1000 371 5090 1835 456 1109 2127 1000 1591 1909 952 1771 1676 1404 373 5357

184

Krishnaswamy Rangarajan Aravind et al.

Table 9.4 Reconfigured SqueezeNet architecture used for the study. Layers Dimension

 Layer 1 Layer 2

Data Convolution1 and ReLu Maximum pooling1 Fire module

Layer 3

Fire module

Layer 4

Maximum pooling3 Fire module

Layer 5

Fire module

Layer 6

Maximum pooling5 Fire module

Layer 7

Fire module

Layer 8

Fire module

Layer 9

Fire module

Layer 10

Dropout, 50% Convolution and ReLU Average pooling10 Softmax Classification output

227 3 227 3 3 64 3 3 3 3 3 333 Squeeze (1 3 1) 16, ReLU Expand (1 3 1) 64, ReLU Expand (3 3 3) 64, ReLU Squeeze (1 3 1) 16, ReLU Expand (1 3 1) 64, ReLU Expand (3 3 3) 64, ReLU 333 Squeeze (1 3 1) 32, ReLU Expand (1 3 1) 128, ReLU Expand (3 3 3) 128, ReLU Squeeze (1 3 1) 32, ReLU Expand (1 3 1) 128, ReLU Expand (3 3 3) 128, ReLU 333 Squeeze (1 3 1) 48, ReLU Expand (1 3 1) 192, ReLU Expand (3 3 3) 192, ReLU Squeeze (1 3 1) 48, ReLU Expand (1 3 1) 192, ReLU Expand (3 3 3) 192, ReLU Squeeze (1 3 1) 64, ReLU Expand (1 3 1) 256, ReLU Expand (3 3 3) 256, ReLU Squeeze (1 3 1) 64, ReLU Expand (1 3 1) 256, ReLU Expand (3 3 3) 256, ReLU  38 1 3 1 14 3 14  

Stride

Padding

 2 2 1 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1  1 1  

 0 0 0 0 1 0 0 1 0,1 1 1 1 0 0 1 0,1 0 0 1 0 0 1 0 0 1 0 0 1  0 0  

training and validation set namely 70:30, 80:20, and 90:10. The difference in the accuracy for each splitting ratio was compared. Figs. 9.69.8 show the training and validation plot for all the splitting ratios with two types of images. According to the literature, an 80:20 splitting ratio was commonly used for assessing the accuracy. Hence, in this study, the accuracy obtained using the color and segmented image with the above splitting ratio was analyzed. Figs. 9.9 and 9.10 show the confusion matrix obtained for segmented and color data set.

Crop disease classification using deep learning approach: an overview and a case study

Table 9.5 Hyperparameters used for training SqueezeNet. Hyperparameters

Values

Initial learning rate Minibatch size Epoch Validation frequency Optimization method

0.0001 32 10 1000 iterations Stochastic gradient descent algorithm

Figure 9.6 Training progress for (A) segmented image with 70:30 ratio and (B) color image with 70:30 ratio.

185

186

Krishnaswamy Rangarajan Aravind et al.

Figure 9.7 Training progress for (A) segmented image with 80:20 ratio and (B) color image with 80:20 ratio.

The graphs clearly show that the convergence starts immediately at epoch 1 and the oscillation stabilizes as the number of epochs increases. There was no significant difference in the training and loss curve for the different splitting ratio carried out with the segmented and original image. The only factor was an increase in the oscillation of the curve in the case of the segmented image. The attributed reason may be due to

Crop disease classification using deep learning approach: an overview and a case study

Figure 9.8 Training progress for (A) segmented image with 90:10 ratio and (B) color image with 90:10 ratio.

the removal of the essential pixels during segmentation process that causes an increase in the variations of the minibatch accuracy and loss. The validation accuracy obtained using the splitting ratio of 70:30, 80:20, and 90:10 with the segmented image were 97.01%, 97.08%, and 97.72% respectively. There is no significant difference in the accuracy with different splitting ratios, which suggests that overfitting does not influence the architecture. Surprisingly,

187

188

Krishnaswamy Rangarajan Aravind et al.

Figure 9.9 Confusion matrix for the segmented images of PlantVillage data set.

Crop disease classification using deep learning approach: an overview and a case study

Figure 9.10 Confusion matrix for the color images of PlantVillage data set.

189

190

Krishnaswamy Rangarajan Aravind et al.

the validation accuracy using the original color image with splitting ratio resulted in 98.04%, 98.24%, and 98.49%. According to the results it is evident that the SqueezeNet performance with color image is better than the segmented images. Similar results have been reported using AlexNet and GoogLeNet by [19] with PlantVillage data set. Although the validation accuracy was higher, the training and validation loss was much closer in the case of the segmented image with the splitting ratio of 70:30. The time taken for training and validation was approximately greater than 2 hours. As the number of images in a training set increases with the splitting ratio, the approximate time taken increases. The maximum time taken for training and validation of a segmented image was 3.5 hours with the splitting ratio of 90:10. Although the overall accuracy provides insight on the discrimination ability of SqueezeNet, the interclass variation provides a clear picture on the class which affects the classification accuracy. The confusion matrix (as shown in Fig. 9.9) for segmented images shows that many misclassifications have occurred in the case of tomato crop. The misclassification has dropped in the case of color images for the above class of tomato. In the case of Cercospora leaf spot of corn, the misclassification increased with a color image data set. Table 9.6 shows accuracy for each class with a different splitting ratio using segmented and color images. When the accuracy of different class was observed, poor discrimination can be found in a few of the classes. The poor classification accuracy (65%) was observed in a Cercospora leaf spot of corn which is also evident from the confusion matrix. It was mainly misclassified as northern blight of corn (33.3%), which was significantly affecting its accuracy. When the visual symptoms were compared, the pattern of brown spots becomes similar to streak in northern corn leaf blight. This may cause a significant difficulty in learning the features of the disease. The other possible reasons may be due to the influence of the number of images in the data set. The number of images in Cercospora leaf spot is 513, whereas in the case of northern leaf blight, the number of images is 985. As the number of images in northern leaf blight is higher, the model is favoring for the misclassification of Cercospora leaf spot to northern corn leaf blight due to poor learning of the features of the prior class. The combined effect of all these factors affects the discrimination of the pattern for the above class. The other class that showed misclassification was tomato early blight. The interesting fact is that the class to which the above two classes misclassified belongs to the same crop species. This implies that it learns the features of the leaves in addition to the symptoms. This is also evident from the classification accuracy of a healthy class from different crops. The analysis on the accuracy of classification shows that SqueezeNet performance was equivalent to the AlexNet performance with reduced parameters. Fine-tuning of the architecture was not carried out in this study. Despite fine-tuning, the accuracy was greater than 96%.

191

Crop disease classification using deep learning approach: an overview and a case study

Table 9.6 Accuracy of each class for two different sets and different splitting ratio. Number of Disease common name Accuracy (%) classes Segmented Color 7030 8020 9010 7030 8020 9010

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38

Apple scab Apple black rot Apple rust Apple healthy Blueberry healthy Cherry healthy Cherry powdery mildew Corn Cercospora leaf spot Corn common rust Corn healthy Corn northern leaf blight Grape black rot Grape esca (black measles) Grape healthy Grape leaf blight Orange huanglongbing Peach bacterial spot Peach healthy Pepper bacterial spot Pepper healthy Potato early blight Potato healthy Potato late blight Raspberry healthy Soybean healthy Squash powdery mildew Strawberry healthy Strawberry leaf scorch Tomato bacterial spot Tomato early blight Tomato healthy Tomato late blight Tomato leaf mold Tomato spetoria leaf spot Tomato spider mites Tomato target spot Tomato mosaic virus Tomato yellow leaf curl virus

94.7 97.3 98.8 99.8 99.6 98.4 98.4 79.9 98.0 96.2 100 91.2 99.8 99.0 99.2 99.8 99.1 99.0 97.7 99.3 99.7 92.3 95.7 100 99.5 100 97.9 100 96.2 93.0 91.4 95.5 89.0 93.2 89.8 95.8 96.4 99.8

96.8 97.6 94.5 98.8 99.0 100 100 82.5 98.7 91.9 100 99.2 97.5 100 98.8 99.5 98.9 94.4 98.5 99.3 95.0 98.5 100 98.6 99.1 99.2 99.5 100 89.2 64.5 97.1 91.6 94.9 95.8 96.4 98.7 98.7 98.4

96.8 100 100 99.4 100 100 97.6 84.3 98.3 95.9 100 97.6 100 99.0 100 99.8 100 100 97.0 99.3 100 96.0 100 100 99.8 99.5 100 100 98.6 74.0 95.3 95.8 95.5 98.8 82.6 99.0 94.6 95.6

99.4 98.4 100 99.6 99.8 99.4 99.6 79.9 99.7 97.9 100 97.1 99.2 98.8 99.2 100 99.6 100 94.6 99.8 99.7 96.3 97.8 98.2 99.0 99.8 99.0 100 96.4 79.7 95.9 97.2 95.1 99.0 94.5 99.6 97.3 95.6

99.2 96.8 100 98.5 99.7 99.5 99.4 65.0 98.7 100 100 98.3 98.9 99.5 100 99.9 98.3 95.8 98.9 99.7 100 99.0 100 97.3 99.6 99.2 100 100 96.5 91.5 96.0 98.9 98.0 97.6 93.9 98.6 94.7 99.4

100 100 100 97.6 99.3 99.0 98.8 96.0 99.2 91.8 100 97.5 98.6 100 100 100 99.6 97.2 99.0 99.3 98.0 95.0 100 100 99.2 99.4 100 100 97.7 87.0 96.9 95.8 98.9 97.6 97.1 99.8 89.2 100

192

Krishnaswamy Rangarajan Aravind et al.

The confusion matrix was assessed with the following property values, namely precision, recall, specificity, miss rate, and fallout. In the case of segmented image, as many of the disease classes in tomato crop were misclassified, a higher fallout value was obtained for the above respective classes. Specifically, three classes, namely tomato early blight, healthy, and target spot have lower precision value. The recall value in the case of tomato early blight was 0.645 which shows poor performance. The fallout value of tomato target spot and Cercospora leaf spot was highest with 0.183 and 0.132 respectively. The miss rate for all the disease classes were significantly lower. In the case of color image, the performance improved with many of the tomato crop diseases. The tomato target spot, which reported a higher fallout value in segmented image, dropped to 0.0538, which signifies that number of false positive has reduced. Cruz et al. [11] reported an accuracy of 93.77% in predicting grapevine yellow in leaf images of grape crop. The model was also trained and tested with the limited disease class of the grape crop. In another study, [18] performed the experiment with a tomato disease data set that is part of the PlantVillage data set and reported an accuracy of 94.3%. All of these studies used the limited disease class and reported a lower accuracy. In our study an average accuracy of greater than 97% was obtained using segmented and color leaf images of the entire PlantVillage data set. Previous studies have shown that although the performance of the DCNN with the PlantVillage data set showed promising results, the performance with images from the real test images falls significantly [12,13,19]. As the images used for training does not have a complex background, the model fails to discriminate the image with a complex background and varying illumination condition. But the availability of a data set with different conditions for such large diseases and crops are limited. The fragmented plant leaf may also cause changes in the visual information on the symptoms for classification which requires further evaluation. The other major issue is the number of computer resources required for the execution of training and testing phase. According to the study by Ref. [29], the memory requirement for AlexNet and SqueezeNet with 32-bit datatype were 240 MB and 4.8 MB, respectively. In the case of SqueezeNet, the memory requirement is much lower and raises the possibility of implementation in a commercially available smartphone. With the increase in the data set of various diseases and complex backgrounds, SqueezeNet can be trained and implemented as an application in an Android-based smartphone. Further evaluation is needed to evaluate the performance of SqueezeNet with a large data set and complex background.

Crop disease classification using deep learning approach: an overview and a case study

9.6 Conclusion In this study, we have given an overview of the deep learning approach used for the application of disease diagnosis. We have also explored a case study of a least explored model, namely SqueezeNet with PlantVillage data set. As SqueezeNet provides performance equivalence to AlexNet with the reduced parameter, it opens the possibility for implementation in a commercially available, cost-effective smartphone for the real-time diagnosis of crop diseases. SqueezeNet resulted in the best classification accuracy of 98.49% with splitting ratio of 90:10 although the analysis was done with 80:20 ratio as the number of test images was lower. In addition, it is the most commonly used splitting ratio and has shown consistent performance. We explored the class that affected the classification accuracy and discussed a possible causal reason for the misclassification. The classification accuracy with a complex background falls significantly according to the literature, hence more evaluation is required with other data sets consisting of images with a complex background and varying environmental conditions. For the future, the work can be extended using SqueezeNet with different crops of a large data set and complex background. Further, the system can be implemented in smartphones such that it can identify the diseases on crops directly in fields.

Acknowledgment This research work was funded by “Department of Biotechnology (DBT), Government of India, BT/ IN/Indo-US/Foldscope/39/2015 dated 20.03.2018.”

References [1] A. Kamilaris, F.X.P. Boldu, Deep learning in agriculture: a survey, Comput. Electron. Agric. 147 (2018) 7090. [2] D.K. Kim, T. Chen, Deep neural network for real-time autonomous indoor navigation. Computer Vision and Pattern Recognition. arXiv:1511.046682, 2015. [3] Y. Zhu, R. Mottahgi, E. Kolve, J.J. Lim, A. Gupta, L.F. Fei, et al., Target-driven visual navigation in indoor scenes using deep reinforcement learning, IEEE International Conference on Robotics and Automation, IEEE, Singapore, 2017, pp. 33573364. [4] A.M. Al-Shaharni, M.A. Al-Abadi, A.S. Al-Malki, Automated system for crops recognition and classification, in: A.S. Ashour,, N. Dey, (Eds.), Computer Vision: Concepts, Methodologies, Tools and Applications, IGI Global, 2018. Available from: http://doi.org/10.4018/978-1-5225-5204-8.ch050. [5] J. Chaki, N. Dey, L. Moraru, S. Fuqian, Fragmented plant leaf recognition: bag-of-features, fuzzycolor and edge-texture histogram descriptors with multi-layer perceptron, Optik 181 (2019) 11168. [6] J.G.A. Barbedo, A review on the main challenges in automatic plant disease identification based on visible range images, Biosyst. Eng.144 (2016a) 5260. [7] J.G.A. Barbedo, L.V. Koenigkan, T.T. Santos. Identifying multiple plant diseases using digital image processing, Biosyst. Eng. 147 (2016b) 104116. [8] M. Brahimi, K. Boukhalfa, A. Moussaoui, Deep learning for tomato diseases: classification and symptoms visualization, Appl. Artif. Intell. 31 (2017) 299315. [9] Y. Guo, Y. Liu, A. Oerlemans, S. Lao, S. Wu, M.S. Lew, Deep learning for visual understanding: a review, Neurocomputing. 187 (2016) 2748.

193

194

Krishnaswamy Rangarajan Aravind et al.

[10] E.C. Too, L. Yujian, S. Njuki, L. Yingchun, A comparative study of fine-tuning deep learning models for plant disease identification, Comput. Electron. Agric. 161 (2019) 272279. [11] A. Cruz, Y. Ampatzidis, R. Pierro, A. Materazzi, A. Panattoni, L.D. Bellis, et al., Detection of grapevine yellow symptoms in vitis vinifera L. with artificial intelligence, Comput. Electron. Agric. 157 (2019) 6376. [12] J.G.A. Barbedo, Impact of dataset size and variety on the effectiveness of deep learning and transfer learning for plant disease classification, Comput. Electron. Agric. 153 (2018a) 4653. [13] J.G.A. Barbedo, Factors influencing the use of deep learning for plant disease recognition, Biosyst. Eng. 172 (2018) 8491. [14] K.P. Ferentinos, Deep learning models for plant disease detection and diagnosis, Comput. Electron. Agric. 145 (2018b) 311318. [15] G. Wang, Y. Sun, J. Wang, Automatic image-based plant disease severity estimation using deep learning, Comput. Intell. Neurosci 2017 (2017). Available from: https://doi.org/10.1155/2017/2917536. [16] J. Amara, B. Bouaziz, A. Algergawy, A deep learning-based approach for banana leaf disease classification, in: Lecture Notes in Informatics, 2017, pp. 7988. [17] J. Shijie, J. Peiyi, Hu Siping, L. Haibo, Automatic detection of tomato diseases and pests based on leaf images, Chinese Automation Congress (CAC), IEEE, Jinan, 2017, pp. 25372540. Available from: http://doi.org/10.1109/CAC.2017.8243388. [18] H. Durmus, E.O. Gunes, M. Korei, Disease detection on the leaves of the tomato plants by using deep learning, in: International Conference on Agro-Geoinformatics, Fairfax, VA, 2017, pp. 15. [19] S.P. Mohanty, D.P. Hughes, M. Salathe, Using deep learning for image-based plant disease detection, Front. Plant Sci. 7 (2016). Article ID 1419. [20] D.P. Hughes, M. Salathe, An open access repository of images on plant health to enable the development of mobile disease diagnostics. Computers and Society, in: arXiv: 1511.08060, 2015. [21] A. Picon, A. Alvarez-Gila, M. Seitz, A. Ortiz-Barredo, J. Echazarra, Deep convolutional neural networks for mobile capture device-based crop disease classification in the wild, Comput. Electron. Agric. (2018). Available from: https://doi.org/10.1016/j.compag.2018.04.002. [22] C.R. Rahman, P.S. Arko, M.E. Ali, M.A.I. Khan, S.H. Apon, F. Norwin, et al., Identification and recognition of rice disease and pests using convolutional neural networks (Unpublished). ,https:// arxiv.org/abs/1812.01043., 2019. [23] J. Chen, Q. Liu, L. Gao, Visual tea leaf disease recognition using a convolutional neural network model, Symmetry. 11 (2019) 343. Available from: https://doi.org/10.3390/sym11030343. [24] U.P. Singh, S.S. Chouhan, S. Jain, S. Jain, Multilayer convolution neural network for the classification of mango leaves infected by anthracnose disease, in: IEEE Access 7, 2019, pp. 4372143729. [25] M. Sibiya, M. Sumbwanyambe, A computational procedure for the recognition and classification of maize leaf diseases out of healthy leaves using convolutional neural networks, AgriEngineering 1 (2019) 119131. [26] J. Ma, K. Du, F. Zheng, L. Zhang, Z. Gong, Z. Sun, A recognition method for cucumber diseases using leaf symptom images based on deep convolutional neural networks, Comput. Electron. Agric. 154 (2018) 1824. [27] C. Dechant, T. Wiesner-Hanks, S. Chen, E.L. Stewart, J. Yosinski, M.A. Gore, et al., Automated identification of northern leaf blight-infected maize plants from field imagery using deep learning, Phytopathology. 107 (2017) 14261432. [28] Y. Lu, S. Yi, N. Zeng, Y. Liu, Y. Zhang, Identification of rice diseases using deep convolutional neural networks, Neurocomputing. 267 (2017) 378384. [29] F.N. Iandola, S. Han, M.W. Moskewicz, K. Ashraf, W.J. Dally, K. Keutzer, SqueezeNet: AlexNetlevel accuracy with 50x fewer parameters and ,0.5MB model size. Computer Vision and Pattern Recognition, in: arXiv: 1602.07360, 2017. [30] Z. Qin, Z. Zhang, X. Chen, C. Wang, Y. Peng, Fd-mobilenet: Improved mobilenet with a fast downsampling strategy, IEEE International Conference on Image Processing, IEEE, Greece, 2018, pp. 13631367. [31] Y. Lecun, L. Bottou, P. Haffner, Gradient-based learning applied to document recognition, Proc. IEEE 86 (11) (1998) 22782324.

Crop disease classification using deep learning approach: an overview and a case study

[32] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural network, in: Advances in Neural Information Processing Systems, USA, 2012, pp. 10971105. [33] D. Wang, X. Wang, L. Shaohe, End-to-End mandarin speech recognition combining CNN and BLSTM, Symmetry 11 (5) (2019). Available from: https://doi.org/10.3390/sym11050644. [34] B. Xu, N. Wang, T. Chen, M. Li, Empirical of rectified activation in convolutional network, in: arXiv preprint arXiv:1505.00853, 2015. [35] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: IEEE Conference on Computer Vision and Pattern Recognition, USA, 2016, pp. 770778. [36] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, in: International Conference on Learning Representations, USA, 2015. [37] G. Huang, Z. Liu, L. Maatern, K.Q. Weinberger, Densely connected convolutional networks, in: IEEE Conference on Computer Vision and Pattern Recognition, USA, 2017, pp. 22612269. [38] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, et al., Going deeper with convolution, in: IEEE International Conference on Computer Vision and Pattern Recognition, USA, 2015.

195

Index Note: Page numbers followed by “f” and “t” refer to figures and tables, respectively.

A AAMI. See Advancement of Medical Instrumentation (AAMI) Acclimatization period, 106 Accuracy (ACC), 15, 71, 91 Activation function (AF), 22 23 AD. See Alzheimer’s disease (AD) AdaRank model, 130 ADNI. See Alzheimer’s Disease Neuroimaging Initiative (ADNI) Advancement of Medical Instrumentation (AAMI), 26 AF. See Activation function (AF) AI. See Artificial intelligence (AI) AIBL. See Australian Imaging, Biomarker and Lifestyle Flagship Study of Aging (AIBL) AlexNet model, 12, 174 Alzheimer’s disease (AD), 63, 74t binary classification, 65 66 using deep learning approach, 66 67 classification results, 71 75 experimental settings, 70 71 experiments of AD vs. HC classification, 72t learning algorithms, 71t literature review, 65 67 methods, 67 70 CNN training and feature extraction, 68 70 data acquisition and preprocessing, 67 68, 68f describing general methodology, 67f training and classification with other algorithms, 70 Alzheimer’s Disease Neuroimaging Initiative (ADNI), 64 65, 82 aNMM model. See Attention based neural matching model (aNMM model) ANNs. See Artificial neural networks (ANNs) Approximations, 108 ARC-I model, 135 ARC-II model, 137 Area under the curve (AUC), 29, 71 Arrhythmia, 1 2, 22, 25 26, 28f, 29t

beat-segmented and preprocessed ECG signal, 26f category and annotation, 26t classification using support vector machine, 31 Artificial intelligence (AI), 22 Artificial neural networks (ANNs), 22 23, 156 157, 159, 173 174 Attention model, matching with, 138 139 Attention based neural matching model (aNMM model), 144 AUC. See Area under the curve (AUC) Australian Imaging, Biomarker and Lifestyle Flagship Study of Aging (AIBL), 64 65 Autoencoder, 39 41, 156 157, 162 autoencoder-based deep network, 164 network, 40f Automated ECG classification, 1 2

B Backprojection, 161 reconstruction, 161, 166f Bidirectional Encoder Representations from Transformers (BERT), 126 127, 139 141, 148 149 BM25, 125 126 Breast cancer, 99 risk factors, 99, 100t Breast lesion growth, 101 Breast thermography, 100 108, 105f images acquisition protocol, 103 108, 107f breast thermal imaging protocol, 106f patient preparation, 105 106 room preparation, 104 105 temperature changes of breasts, 104f

C CAD. See Coronary artery disease (CAD) Canonical ELMs, 163 Cardiac abnormalities, 1 Cardiac cycle, 3 CDSSM. See Convolutional DSSM (CDSSM)

197

198

Index

CEDR. See Contextualized Embeddings for Document Ranking (CEDR) Cerebrospinal fluid data (CSF), 66 CG. See Conjugate gradient (CG) CNN. See Convolutional neural network (CNN) Compressive sensing (CS), 158 Computed tomography (CT), 155 156 Conductivity, 155 156 distribution, 161 Confusion matrix, 27, 29f, 192 for color images of PlantVillage data set, 189f for segmented images of PlantVillage data set, 188f Conjugate gradient (CG), 158 Contextualized Embeddings for Document Ranking (CEDR), 140 141 Continuous wavelet transform (CWT), 2, 5 6 Conv-KNRM. See Convolutional Neural Networks for Soft-Matching N-Grams model (Conv-KNRM) Convolutional DSSM (CDSSM), 134 Convolutional neural network (CNN), 2, 6 8, 6f, 12, 22 25, 69f, 101 102, 137, 139, 173 174. See also Recurrent neural network (RNN) architecture, 176 179 for classification, 177f literature on application for disease diagnosis, 177t sample inception module, 180f sample ResNet block, 179f arrhythmia disease classification using, 27 30 CNN-based architectures, 174 CNN-based deep learning model, 2 CNN-based methods, 134 135 convolutional layer, 7, 7f fully connected layer, 8 MI disease classification using, 31 33, 32f, 33t pooling layer, 7 8, 8f process of classification of extracted features, 70f training and feature extraction, 68 70 training CNN model, 14 Convolutional Neural Networks for SoftMatching N-Grams model (ConvKNRM), 145 Coronary artery disease (CAD), 47, 49f classification accuracies, 55f comparison of related works, 58t

deep analysis, 47 56 deep ELM with HessELM kernel, 53t, 55t, 57t with Moore Penrose kernel, 53t, 54t, 56t identification performances depending on model of DBN, 56t IMF-based data sets on DBN, 54t short-term ECG on DBN, 52t Cosine similarity, 131 Crop disease classification using deep learning approach accuracy of each class for different sets and different splitting ratio, 191t CNN architecture, 176 179 implementation, 182 literature on application of standard deep learning models, 175t literature survey, 174 176 PlantVillage data set, 183t confusion matrix for color images, 189f confusion matrix for segmented images, 188f results, 182 192 SqueezeNet architecture, 180 181 hyperparameters used for training, 185t reconfigured, 184t Cross term, 130 Cross term retrieval (CRTER), 130 CS. See Compressive sensing (CS) CSF. See Cerebrospinal fluid data (CSF) CT. See Computed tomography (CT) CWT. See Continuous wavelet transform (CWT)

D Data acquisition in AD, 67 68, 68f Database, 8 9 DBNs. See Deep belief networks (DBNs) DCNN models. See Deep CNN models (DCNN models) Decision fusion, 13 14, 14f Decoder, 162 Decoding, 39 40 Decomposable model, 139 Deep autoencoder kernels, 42 47, 42f deep analysis of CAD, 47 56 deep ELM autoencoder, 45 47 Deep belief networks (DBNs), 38 39 Deep CNN models (DCNN models), 174

199

Index

Deep extreme learning machines (Deep ELMs), 38 39 autoencoder, 45 47, 46f with HessELM kernel, 53t, 55t, 57t with Moore Penrose kernel, 53t, 54t, 56t Deep learning (DL), 22 23, 37, 66 67, 79, 101 102, 125. See also Extreme learning machine (ELM) algorithms, 37 39 approach to IR, 131 146 matching function learning methods, 136 142 relevance learning methods, 142 146 representation learning-based methods, 131 136, 132f DL-based models, 173 Deep learning methods. See Deep learning models (DLMs) Deep learning models, 126 127 Deep neural network (DNN), 101 102 DNN-based methods, 132 134 models, 126 127 Deep semantic text matching, 126 127 Deep structured semantic model (DSSM), 132, 133f DeepRank model, 145 146, 146f Deep-wavelet neural networks (DWNN), 108 114, 109f breast thermography, 102 108 classification, 114 118 experimental results, 115 118 downsampling, 112 113 filter bank, 109 112 parameters set for classifier, 114t related works, 101 102 synthesis block, 113 114 DenseNet, 89 90, 89f, 179 Digital themographic images, 103 Dirichlet prior model, 128 Discrete wavelet transform (DWT), 2, 4, 5f, 23 24 DL. See Deep learning (DL) DNN. See Deep neural network (DNN) Document length concepts, 125 126 normalization, 127 Downsampling, 108, 112 113 DRMM model, 144

deep matching model, 144 DSSM. See Deep structured semantic model (DSSM) Dual-tree complex wavelet transform (DTCWT), 2 Duet model, 141 142, 143f DWNN. See Deep-wavelet neural networks (DWNN) DWT. See Discrete wavelet transform (DWT) Dynamic protocol, 101 102

E ECG. See Electrocardiogram (ECG) EEMD. See Ensemble EMD (EEMD) EIDORS. See Electrical Impedance and Diffuse Optical Tomography Reconstruction Software (EIDORS) EIT. See Electrical impedance tomography (EIT) Electrical Impedance and Diffuse Optical Tomography Reconstruction Software (EIDORS), 157 158 Electrical impedance tomography (EIT), 100, 155 156, 164 168 autoencoders, 162 experiments, 163 164 extreme learning machines, 162 163 image reconstruction techniques, 161 materials and methods, 160 164 operation, 156f problems and reconstruction, 160 161 reconstruction method, 163, 164f related works, 157 160 Electrical potentials, 161, 163 164 Electrical resistivity tomography, 156 Electrocardiogram (ECG), 1, 3, 21 22, 38 39, 47 48, 51f cardiac cycle, 3 methodology, 9 15 classification based on deep learning, 12 decision fusion, 13 14 performance parameter, 15 preprocessing, 10 12 proposed ECG classification model, 9f training CNN model, 14 QRS wave, 3 4 results, 15 17 accuracy plot of proposed CNN, 15f

200

Index

Electrocardiogram (ECG) (Continued) confusion matrix of heart abnormality classification, 16f theory related to electrocardiogram analysis CNN, 6 8 CWT, 5 6 database, 8 9 DWT, 4, 5f train and test samples considered for, 27t Electromagnetic waves, 102 ELM. See Extreme learning machine (ELM) EMD. See Empirical mode decomposition (EMD) Empirical mode decomposition (EMD), 48 Encoder, 162 Encoding, 39 40 Energy-time-frequency features, 38 39 Ensemble EMD (EEMD), 49 50 random segment at, 50f Extreme learning machine (ELM), 43, 43f, 114 115, 156 157, 162 163. See also Deep learning (DL) activation functions and kernels performance, 166f autoencoder, 43 45 calculation of output weights, 163f deep ELM autoencoder, 45 47 functions and kernels tested in, 165f percentage errors, 165f

F F1-score, 91 False negative (FN), 91 False positive (FP), 91 Feature-based BERT models, 140 Features extraction, 108 FEM. See Finite element model (FEM) Fiducial techniques, 48 Filter bank of DWNN, 108 112, 108f Finite element model (FEM), 157 Fire module of SqueezeNet, 180, 181f Fish School Search (FSS), 158 159 FLIR Systems thermographic camera, 108 FN. See False negative (FN) FP. See False positive (FP) FSS. See Fish School Search (FSS)

G GA. See Genetic algorithm (GA)

Gated recurrent unit (GRU), 136 Gauss-Newton method, 157 158 GBRank model, 130 Genetic algorithm (GA), 158 159 GM. See Gray matter (GM) GNU/Octave software, 114, 163 164 GoogLeNet, 174, 179 GPT-2, 126 127 Gray matter (GM), 65 66 GRU. See Gated recurrent unit (GRU)

H Handcrafted ranking models, 125 126 Heart rate variability (HRV), 47 48 Heat generation, 101 Hessenberg decomposition-based ELM (HessELM), 46 deep ELM with HessELM kernel, 53t, 55t, 57t HFS. See Hybrid forward sequential selection (HFS) HHT. See Hilbert Huang transform (HHT) Hierarchical Neural Matching model (HiNT), 146 High-pass filters, 108, 110 Hilbert spectral analysis, 50 51 Hilbert Huang transform (HHT), 48 HiNT. See Hierarchical Neural Matching model (HiNT) HRV. See Heart rate variability (HRV) Human-designed matching functions, 125 126 Hybrid forward sequential selection (HFS), 65

I ICA. See Independent component analysis (ICA) IDF. See Inverse document frequency (IDF) IGN method. See Improved Gauss-Newton method (IGN method) ILSVRC. See ImageNet Large Scale Visual Recognition Challenge (ILSVRC) Image augmentation, 85 ImageNet Large Scale Visual Recognition Challenge (ILSVRC), 80 81 IMFs. See Intrinsic mode functions (IMFs) Improved Gauss-Newton method (IGN method), 159 160 Incandescent lamp, 102 Independent component analysis (ICA), 2 Information retrieval (IR), 125 analyses, 146 149

201

Index

deep learning approaches to, 131 146 experimental results, 148t traditional approaches to, 127 130 basic retrieval models, 127 128 learning to rank based models, 130 semantic-based models, 128 129 term dependency-based models, 129 130 Infrared (IR) cameras, 103 images, 101 103, 115 SVM model, 130 Interaction matrix, 137 Intrinsic mode functions (IMFs), 49 50 Inverse document frequency (IDF), 125 127 IR. See Information retrieval (IR) Isovolumetric contraction period, 3 Isovolumetric relaxation period, 3 Iterative algorithms, 161

K K nearest neighbors (KNN), 1 Kappa index, 115 Kappa statistics, 116 Kelvin scale, 102 Kernal Pooling as Matching Function (K-NRM), 144 145 Kernel functions, 144 K-fold cross-validation methods, 115 k-nearest neighbors (k-NN), 70, 81 KNN. See K nearest neighbors (KNN) K-NRM. See Kernal Pooling as Matching Function (K-NRM)

L Lambda Rank model, 130 LambdaMART model, 130 Laplace equation, 156, 161 LDA. See Linear discriminant analysis (LDA) Learning to rank based models, 130 Learning vector quantization (LVQ), 1 Lesion classification, 117 118 Lesion detection, 115 117 Linear discriminant analysis (LDA), 13 14, 173 174 LDA-based document model, 129 ListMLE model, 130 ListNet model, 130 LM, 125 126

Logistic sigmoid function, 162 Long short-term memory (LSTM), 136 Low-pass filters, 108, 111f LSTM. See Long short-term memory (LSTM) LSTM-RNN model, 148 LVQ. See Learning vector quantization (LVQ)

M Machine learning (ML), 79, 155 algorithms, 173 174 approaches, 173 Magnetic resonance imaging (MRI), 64 Mallat algorithm, 108 Mammography, 100 101 Markov random field retrieval model, 129 130, 147 Match-SRNN, 138 Matching functions, 126 127 learning methods, 136 142, 137f matching with attention model, 138 139 matching with transformer model, 139 141 matching with word-level similarity matrix, 137 138 representation learning and, 141 142 MatchPyramid, 138 MCI. See Moderate cognitive impairment (MCI) McRank model, 130 Mechanical ventilation, monitoring of, 156 Medical imaging (MI), 155 mELM. See Morfological extreme learning machine (mELM) MI. See Medical imaging (MI); Myocardial infarction (MI) Minimal Interval Resonance Imaging in Alzheimer’s Disease (MIRIAD), 67, 68t ML. See Machine learning (ML) MLP. See Multilayer perceptron (MLP) MobileNet, 90, 90f, 176 Moderate cognitive impairment (MCI), 64 65 Modified AlexNet model, 12, 13f Moore Penrose inversing, 44 Moore Penrose kernel, 52 53 deep ELM with, 53t, 54t, 56t Morfological extreme learning machine (mELM), 114 115 Morlet wavelet, 5 6 MRI. See Magnetic resonance imaging (MRI) Multilayer perceptron (MLP), 1, 66, 114

202

Index

Musculoskeletal radiographs (MURA), 80 81 challenges, 85 description of data set, 83 84 total radiographs in finger, wrist, and shoulder data set, 85t total train and valid images, 83f, 84f experimental results, 91 96 comparison of performance metrics, 92t confusion matrix, 93f finger radiographic image classification, 92 93 model of wrist images, 94t models of shoulder images, 95t models of shoulder test images, 95t shoulder radiographic image classification, 94 96, 96f training accuracies, 92t wrist radiographic image classification, 93 94, 95f wrist test images, 94t pretrained CNN architecture diagram, 80f proposed methodologies, 85 90 data preprocessing, 85 DenseNet, 89 90 inception, 85 87 MobileNet, 90 V3 architecture diagram, 86f VGG-19, 87 89, 88f Xception, 87, 88f related works, 81 82 statistical indicators, 91 Myocardial infarction (MI), 22, 26 27 comparison of proposed work against literature, 33, 34t

N Naïve Bayes approach (NB approach), 173 174 Network architecture, 25 Neural networks (NNs), 125 Neural tensor network, 135, 136f n-gram embeddings, 145 NMRI. See Nuclear magnetic resonance imaging (NMRI) NNs. See Neural networks (NNs) Noniterative algorithms, 161 Nuclear magnetic resonance imaging (NMRI), 100, 155 156

O Obstructive sleep apnea (OSA) detection, 2 OC SVM model, 130 Okapi model, 128 Optimization metrics, 162

P PACRR. See Position-Aware Neural IR model (PACRR) PageRank, 125 126 Particle swarm optimization (PSO), 159 Patch-based method, 158 Patterns recognition, 108 PCA. See Principal component analysis (PCA) PCG. See Phonocardiogram (PCG) PDIPM. See Primal/dual interior point method (PDIPM) Peak-to-noise ratio (PSNR), 160, 167 boxplots of PSNR between reconstructions, 167f Permittivity, 155 156 PET. See Positron emission tomography (PET) Phonocardiogram (PCG), 1 Physikalisch-Technische Bundesansalt Diagnosis Database (PTBDB), 26 27 Pivoted normalization model, 128 PL. See Pooling layer (PL) PlantVillage data set, 174, 176, 183t confusion matrix for color images, 189f confusion matrix for segmented images, 188f PNN. See Probabilistic neural network (PNN) Poisson equation, 160 161 Pooling layer (PL), 24 25 Pooling operation, 24 25 Position-Aware Neural IR model (PACRR), 146 Positron emission tomography (PET), 64 Prank model, 130 Precision (P), 91 Preprocessing, in AD, 67 68, 68f Pretrained BERT model, 141 Pretrained network, 80 81 Primal/dual interior point method (PDIPM), 159 Principal component analysis (PCA), 65, 81 Probabilistic neural network (PNN), 2 PSNR. See Peak-to-noise ratio (PSNR) PSO. See Particle swarm optimization (PSO) PTBDB. See Physikalisch-Technische Bundesansalt Diagnosis Database (PTBDB)

203

Index

Q QRS wave, 3 4 Query-document matching, 125 126, 132 Query-document similarity measurements, 147

R Radial basis function (RBF), 72, 164 Random access memory (RAM), 176 Random forest constrained local models (RFCLMs), 81 82 Random forest regression voting (RFRV), 81 82 RankBoost model, 130 Ranking SVM model, 130 RankNet model, 130 RBF. See Radial basis function (RBF) RBMs. See Restricted Boltzmann machines (RBMs) Recall (R), 91 Rectified linear units (ReLU), 68 69, 176 Recurrent neural network (RNN). See also Convolutional neural network (CNN) RNNs based methods, 136 Region of interest (ROI), 158 Reinforcement learning (RL), 130 Relevance learning methods, 142 146 based on global distribution of matching strengths, 142 145 based on local context of matched terms, 145 146 ReLU. See Rectified linear units (ReLU) Representation learning-based methods, 131 136, 132f, 147 CNN based methods, 134 135 DNN-based methods, 132 134 matching function learning and, 141 142 RNN based methods, 136 Residual Network (ResNet), 174 Restricted Boltzmann machines (RBMs), 38 39 RFCLMs. See Random forest constrained local models (RFCLMs) RFRV. See Random forest regression voting (RFRV) RL. See Reinforcement learning (RL) RNN. See Recurrent neural network (RNN) ROI. See Region of interest (ROI)

S SA node. See Sinoatrial node (SA node) Scalogram images, 2

SCG. See Sensitivity theorem-based conjugate gradient (SCG) Scintigraphy, 100 Screening, 99 100 SDM. See Sequential dependence model (SDM) Self-learning matching function, 125 126 Semantic-based models, 128 129 Semantic-based ranking models, 125 126 Sensitivity (SEN), 15, 71 Sensitivity theorem-based conjugate gradient (SCG), 158 Sequential dependence model (SDM), 129 130 Sequential forward feature selection (SFFS), 52 SID-Termo for feature extraction, 114 Single-layer CNN experimental result and analysis, 25 33 arrhythmia disease classification, 27 30, 31t beat-segmented and preprocessed ECG signal, 26f data set description, 25 27 ROC curve for arrhythmia classification, 30f methodology, 24 25 arrhythmia and myocardial infarction classification, 25t CNN, 24 25 network architecture, 25 related works, 23 24 for cardiac disease classification, 23t Single-layer feed-forward neural network (SLFN), 43 Single-lead ECG recordings, 2 Sinoatrial node (SA node), 3 Sinograms, 161 SLFN. See Single-layer feed-forward neural network (SLFN) Smoothing language modeling, 129 130 sMRI. See Structural magnetic resonance imaging (sMRI) Soft Rank model, 130 Spatial RNN model (SRNN model), 138 Specificity (SPE), 15, 71 SqueezeNet, 174, 176, 179, 181f architecture, 180 181 fire module of, 181f hyperparameters used for training, 185t reconfigured SqueezeNet architecture, 184t SRNN model. See Spatial RNN model (SRNN model) SSIM. See Structural similarity index (SSIM)

204

Index

ST database (ST-DB), 48 Structural magnetic resonance imaging (sMRI), 64 Structural similarity index (SSIM), 160, 167 between reconstructions, 168f Subset Ranking model, 130 Support vector machine (SVM), 1, 22 23, 65 66, 81, 101 102, 114, 173 174 MAP model, 130 Synthesis block, 113 114

Transformer model, matching with, 139 141 Translation language model, 129 Trigram, 133 True negative (TN), 91 True positive (TP), 91

U Ultrasound, 100 University College London (UCL), 67

T

V

Term dependence-based ranking models, 125 126 Term dependency-based models, 129 130 Term frequency (TF), 125 127 Text representations, 126 127 TF. See Term frequency (TF) ThermaCAMTM S45 model, 108 Thermal drift, 103 Thermal radiation, 102 103 Thermograms. See Thermography Thermography, 100, 103 TN. See True negative (TN) TP. See True positive (TP) Transfer learning approach, 82

Visual Geometry Group16 (VGG16), 174 Visual Geometry Group19 (VGG19), 174 Volumes of interest (VOIs), 65 66 Voxels, 65 66

W Wavelet transform, 4 Weka, 114 Word-level similarity matrix, matching with, 137 138 World Health Organization (WHO), 99

X XLNet, 126 127, 141