Deep Learning in Healthcare: Paradigms and Applications [1st ed. 2020] 978-3-030-32605-0, 978-3-030-32606-7

This book provides a comprehensive overview of deep learning (DL) in medical and healthcare applications, including the

903 139 10MB

English Pages XIV, 218 [225] Year 2020

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Deep Learning in Healthcare: Paradigms and Applications [1st ed. 2020]
 978-3-030-32605-0, 978-3-030-32606-7

Table of contents :
Front Matter ....Pages i-xiv
Front Matter ....Pages 1-1
Medical Image Detection Using Deep Learning (María Inmaculada García Ocaña, Karen López-Linares Román, Nerea Lete Urzelai, Miguel Ángel González Ballester, Iván Macía Oliver)....Pages 3-16
Medical Image Segmentation Using Deep Learning (Karen López-Linares Román, María Inmaculada García Ocaña, Nerea Lete Urzelai, Miguel Ángel González Ballester, Iván Macía Oliver)....Pages 17-31
Medical Image Classification Using Deep Learning (Weibin Wang, Dong Liang, Qingqing Chen, Yutaro Iwamoto, Xian-Hua Han, Qiaowei Zhang et al.)....Pages 33-51
Medical Image Enhancement Using Deep Learning (Yinhao Li, Yutaro Iwamoto, Yen-Wei Chen)....Pages 53-76
Front Matter ....Pages 77-77
Improving the Performance of Deep CNNs in Medical Image Segmentation with Limited Resources (Saeed Mohagheghi, Amir Hossein Foruzan, Yen-Wei Chen)....Pages 79-94
Deep Active Self-paced Learning for Biomedical Image Analysis (Wenzhe Wang, Ruiwei Feng, Xuechen Liu, Yifei Lu, Yanjie Wang, Ruoqian Guo et al.)....Pages 95-110
Deep Learning in Textural Medical Image Analysis (Aiga Suzuki, Hidenori Sakanashi, Shoji Kido, Hayaru Shouno)....Pages 111-126
Anatomical-Landmark-Based Deep Learning for Alzheimer’s Disease Diagnosis with Structural Magnetic Resonance Imaging (Mingxia Liu, Chunfeng Lian, Dinggang Shen)....Pages 127-147
Multi-scale Deep Convolutional Neural Networks for Emphysema Classification and Quantification (Liying Peng, Lanfen Lin, Hongjie Hu, Qiaowei Zhang, Huali Li, Qingqing Chen et al.)....Pages 149-164
Opacity Labeling of Diffuse Lung Diseases in CT Images Using Unsupervised and Semi-supervised Learning (Shingo Mabu, Shoji Kido, Yasuhi Hirano, Takashi Kuremoto)....Pages 165-179
Residual Sparse Autoencoders for Unsupervised Feature Learning and Its Application to HEp-2 Cell Staining Pattern Recognition (Xian-Hua Han, Yen-Wei Chen)....Pages 181-199
Front Matter ....Pages 201-201
Dr. Pecker: A Deep Learning-Based Computer-Aided Diagnosis System in Medical Imaging (Guohua Cheng, Linyang He)....Pages 203-216
Back Matter ....Pages 217-218

Citation preview

Intelligent Systems Reference Library 171

Yen-Wei Chen Lakhmi C. Jain   Editors

Deep Learning in Healthcare Paradigms and Applications

Intelligent Systems Reference Library Volume 171

Series Editors Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland Lakhmi C. Jain, Faculty of Engineering and Information Technology, Centre for Artificial Intelligence, University of Technology, Sydney, NSW, Australia; Faculty of Science, Technology and Mathematics, University of Canberra, Canberra, ACT, Australia; KES International, Shoreham-by-Sea, UK; Liverpool Hope University, Liverpool, UK

The aim of this series is to publish a Reference Library, including novel advances and developments in all aspects of Intelligent Systems in an easily accessible and well structured form. The series includes reference works, handbooks, compendia, textbooks, well-structured monographs, dictionaries, and encyclopedias. It contains well integrated knowledge and current information in the field of Intelligent Systems. The series covers the theory, applications, and design methods of Intelligent Systems. Virtually all disciplines such as engineering, computer science, avionics, business, e-commerce, environment, healthcare, physics and life science are included. The list of topics spans all the areas of modern intelligent systems such as: Ambient intelligence, Computational intelligence, Social intelligence, Computational neuroscience, Artificial life, Virtual society, Cognitive systems, DNA and immunity-based systems, e-Learning and teaching, Human-centred computing and Machine ethics, Intelligent control, Intelligent data analysis, Knowledge-based paradigms, Knowledge management, Intelligent agents, Intelligent decision making, Intelligent network security, Interactive entertainment, Learning paradigms, Recommender systems, Robotics and Mechatronics including human-machine teaming, Self-organizing and adaptive systems, Soft computing including Neural systems, Fuzzy systems, Evolutionary computing and the Fusion of these paradigms, Perception and Vision, Web intelligence and Multimedia. ** Indexing: The books of this series are submitted to ISI Web of Science, SCOPUS, DBLP and Springerlink.

More information about this series at http://www.springer.com/series/8578

Yen-Wei Chen Lakhmi C. Jain •

Editors

Deep Learning in Healthcare Paradigms and Applications

123

Editors Yen-Wei Chen College of Information Science and Engineering Ritsumeikan University Kusatsu, Japan Zhejiang Laboratory Hangzhou, China

Lakhmi C. Jain Faculty of Engineering and Information Technology, Centre for Artificial Intelligence University of Technology Sydney, NSW, Australia Faculty of Science, Technology and Mathematics University of Canberra Canberra, ACT, Australia KES International Shoreham-by-Sea, UK Liverpool Hope University Liverpool, UK

ISSN 1868-4394 ISSN 1868-4408 (electronic) Intelligent Systems Reference Library ISBN 978-3-030-32605-0 ISBN 978-3-030-32606-7 (eBook) https://doi.org/10.1007/978-3-030-32606-7 © Springer Nature Switzerland AG 2020 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

With the rapid development of computer technologies and explosive growth of big data, a growing interest has been seen in Deep Learning (DL). DL can also be considered as a subset of Artificial Intelligence (AI). The early AI is a kind of rule-based systems. The current AI is based on machine learning. The model (network) for a specific task (e.g., classification of focal liver lesions) is first trained using labeled training inputs and their corresponding labels. After training, the output (label) of an unlabeled testing input can be estimated by the use of the trained model. In conventional machine learning (non-Deep Learning) approaches, hand-craft low-level features or mid-level features are first extracted and then the extracted features are used as input of the model (classifier) for classification or other tasks. DL involves using a neural network with many layers (deep structure) between input and output (hidden layers). The main advantage of the DL is that it can automatically learn data-driven (or task-specific), highly representative and hierarchical features and performs feature extraction and classification on one network. Deep learning techniques have achieved great success in numerous computer vision tasks including image classification, image detection and image segmentation and playing today an important role in many academic and industrial areas. Recently, DL is also widely used in medical applications, such as anatomic modelling (segmentation of anatomical structures), tumor detection, disease classification, computer-aided diagnosis and surgical planning. The aim of this book is to report the recent progress and potential futures of DL in the field of medicine and healthcare. The book consists of three parts: fundamentals of DL in healthcare; advanced DL in healthcare; and applications of DL in healthcare. The first part (fundamentals of DL) comprises four chapters. Fundamental and theoretical descriptions and current progress on DL-based medical image detection, medical image segmentation, medical image classification and medical image enhancement are summarized in Chaps. 1–4, respectively. The second part (advanced DL) comprises seven chapters, which present various new approaches aimed at solving some important problems or challenges in medical and healthcare applications. Chapter 5 focuses on improvement of DL-based semantic segmentation methods for 3D medical image segmentation with v

vi

Preface

limited training samples. Chapter 6 proposed a novel Deep Active Self-paced Learning (DASL) strategy to reduce annotation effort and also make use of unannotated samples, based on a combination of Active Learning (AL) and Self-paced Learning (SPL) strategies. Chapter 7 proposes a novel transfer learning approach called “two-stage feature transfer learning” to enable DCNNs to extract good feature representations for visual textures. The proposed method has been applied to lung HRCT analysis. Chapter 8 proposes an anatomical-landmark-based deep learning method for alzheimer’s disease diagnosis with structural magnetic resonance imaging. Chapter 9 presents a multi-scale deep convolutional neural network for accurate classification and quantification of pulmonary emphysema in CT images. Chapter 10 focuses on opacity labeling of diffuse lung diseases in CT images using unsupervised and semi-supervised learning. Chapter 11 reports a residual sparse autoencoders for unsupervised feature learning and its application to HEp-2 cell staining pattern recognition. Part III (applications of DL) comprises Chap. 12, which presents a deep learning-based computer-aided diagnosis system, called Dr. Pecker, which is an award-winning medical image analysis software product made in China, to showcase clinical applications in reading medical imaging scans. Though, above chapters do not make a complete coverage of the DL techniques for medical and healthcare applications, it provides a flavor of the important issues and the benefits of using DL techniques in medicine and healthcare. We are grateful to the authors and reviewers for their contributions. We would like to thank the Springer for their assistance during the evolution phase of the book. Kusatsu, Japan/Hangzhou, China Sydney, Australia/Canberra, Australia/ Shoreham-by-Sea, UK/Liverpool, UK August 12, 2019

Prof. Yen-Wei Chen Prof. Lakhmi C. Jain

Contents

Part I 1

2

Fundamentals of Deep Learning in Healthcare

Medical Image Detection Using Deep Learning . . . . . . . . . . . María Inmaculada García Ocaña, Karen López-Linares Román, Nerea Lete Urzelai, Miguel Ángel González Ballester and Iván Macía Oliver 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Deep Learning Architectures for Image Detection . . . . . . 1.2.1 Scanning-Based Systems . . . . . . . . . . . . . . . . . . 1.2.2 End-to-End Systems . . . . . . . . . . . . . . . . . . . . . 1.3 Detection and Localization in Medical Applications . . . . 1.3.1 Anatomical Landmark Localization . . . . . . . . . . 1.3.2 Image Plane Detection . . . . . . . . . . . . . . . . . . . 1.3.3 Pathology Detection . . . . . . . . . . . . . . . . . . . . . 1.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.....

. . . . . . . . . .

. . . . . . . . . .

Medical Image Segmentation Using Deep Learning . . . . . . . . . Karen López-Linares Román, María Inmaculada García Ocaña, Nerea Lete Urzelai, Miguel Ángel González Ballester and Iván Macía Oliver 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Challenges and Limitations When Applying Deep Learning to Medical Image Segmentation . . . . . . . . . . . . . . . . . . . . . 2.3 Deep Learning Architectures for Medical Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Supervised Deep Learning Architectures . . . . . . . . 2.3.2 Semi-supervised Deep Learning Architectures . . . . 2.3.3 From 2D to 3D Segmentation Networks . . . . . . . . 2.4 Loss Functions for Medical Image Segmentation . . . . . . . .

3

. . . . . . . . . .

3 5 6 7 10 10 11 11 13 13

...

17

...

17

...

18

. . . . .

20 20 25 25 26

. . . . . . . . . .

. . . . . . . . . .

. . . . .

. . . . .

vii

viii

3

4

Contents

2.5 Conclusions and Future Directions . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27 27

Medical Image Classification Using Deep Learning . . . . . . . . . . Weibin Wang, Dong Liang, Qingqing Chen, Yutaro Iwamoto, Xian-Hua Han, Qiaowei Zhang, Hongjie Hu, Lanfen Lin and Yen-Wei Chen 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 What Is Image Classification . . . . . . . . . . . . . . . . . . 3.1.2 What Has Been Achieved in Image Classification Using Deep Learning . . . . . . . . . . . . . . . . . . . . . . . 3.2 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Convolution Layer . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Pooling Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Fully Connected Layer . . . . . . . . . . . . . . . . . . . . . . 3.2.4 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.5 AlexNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.6 ResNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Training from Scratch . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Transfer Learning from a Pre-trained Network . . . . . 3.3.3 Fine-Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Application to Classsification of Focal Liver Lesions . . . . . . 3.4.1 Focal Liver Lesions and Multi-phase CT Images . . . 3.4.2 Multi-channel CNN for Classification of Focal Liver Lesions on Multi-phase CT Images . . . . . . . . . . . . . 3.4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

..

33

.. ..

34 34

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

35 36 36 37 38 38 39 40 42 42 42 42 43 43

. . . .

. . . .

44 47 49 50

....

53

. . . . . . .

. . . . . . .

53 54 55 56 56 57 58

.... ....

59 61

....

62

Medical Image Enhancement Using Deep Learning . . . . . . . . Yinhao Li, Yutaro Iwamoto and Yen-Wei Chen 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Convolution Layer . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Deconvolution Layer . . . . . . . . . . . . . . . . . . . . . 4.2.3 Loss Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.4 Evaluation Functions . . . . . . . . . . . . . . . . . . . . . 4.3 Medical Image Enhancement by 2D Super-Resolution . . . 4.3.1 Super-Resolution Convolutional Neural Network (SRCNN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Very Deep Super-Resolution Network (VDSR) . . 4.3.3 Efficient Sub-pixel Convolutional Neural Network (ESPCN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

Contents

ix

4.3.4

Dense Skip Connections Based Convolutional Neural Network for Super-Resolution (SRDenseNet) . . . . . . 4.3.5 Residual Dense Network for Image Super-Resolution (RDN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.6 Experimental Results of 2D Image Super-Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Medical Image Enhancement by 3D Super-Resolution . . . . . 4.4.1 3D Convolutional Neural Network for Super-Resolution (3D-SRCNN) . . . . . . . . . . . . . 4.4.2 3D Deeply Connected Super-Resolution Network (3D-DCSRN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.3 Super-Resolution Using a Generative Adversarial Network and 3D Multi-level Densely Connected Network (mDCSRN-GAN) . . . . . . . . . . . . . . . . . . . 4.4.4 Experimental Results of 3D Image Super-Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Part II 5

..

63

..

65

.. ..

66 67

..

67

..

69

..

70

.. .. ..

72 74 74

....

79

. . . .

. . . .

. . . .

. . . .

80 80 82 83

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

83 84 84 84 86 87 87 88 89 90 90 93

Advanced Deep Learning in Healthcare

Improving the Performance of Deep CNNs in Medical Image Segmentation with Limited Resources . . . . . . . . . . . . . . . . . . . Saeed Mohagheghi, Amir Hossein Foruzan and Yen-Wei Chen 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Deep Convolutional Neural Networks . . . . . . . . . . . . . . . 5.3 Medical Image Segmentation Using CNNs . . . . . . . . . . . . 5.3.1 CNN Training . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Challenges of CNNs in Medical Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Basic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Model Optimization for 3D Liver Segmentation . . . . . . . . 5.5.1 Model’s Architecture . . . . . . . . . . . . . . . . . . . . . 5.5.2 Optimization Algorithm . . . . . . . . . . . . . . . . . . . 5.5.3 Activation Function . . . . . . . . . . . . . . . . . . . . . . 5.5.4 Complexity of the Model . . . . . . . . . . . . . . . . . . 5.5.5 Parameter Tuning and Final Results . . . . . . . . . . 5.6 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x

6

7

8

Contents

Deep Active Self-paced Learning for Biomedical Image Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wenzhe Wang, Ruiwei Feng, Xuechen Liu, Yifei Lu, Yanjie Wang, Ruoqian Guo, Zhiwen Lin, Tingting Chen, Danny Z. Chen and Jian Wu 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 The Deep Active Self-paced Learning Strategy . . . . . . . . . . . . 6.3 DASL for Pulmonary Nodule Segmentation . . . . . . . . . . . . . . 6.3.1 Nodule R-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Experiments of Pulmonary Nodule Segmentation . . . . 6.4 DASL for Diabetic Retinopathy Identification . . . . . . . . . . . . . 6.4.1 Center-Sample Detector . . . . . . . . . . . . . . . . . . . . . . 6.4.2 Attention Fusion Network . . . . . . . . . . . . . . . . . . . . . 6.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.

95

. . . . . . . . . .

96 98 100 101 102 103 104 106 108 109

. . 111

Deep Learning in Textural Medical Image Analysis . . . . . . . . . . Aiga Suzuki, Hidenori Sakanashi, Shoji Kido and Hayaru Shouno 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Deep Convolutional Neural Networks (DCNNs) . . . . 7.2.2 Transfer Learning in DCNNs . . . . . . . . . . . . . . . . . 7.2.3 Two-Stage Transfer Learning . . . . . . . . . . . . . . . . . 7.3 Application of Two-Stage Feature Transfer for Lung HRCT Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 Experimental Details . . . . . . . . . . . . . . . . . . . . . . . . 7.3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 7.4 How Does Transfer Learning Work? . . . . . . . . . . . . . . . . . . 7.4.1 Qualitative Analysis by Visualizing Feature Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.2 Numerical Analysis: Frequency Response of Feature Extraction Part . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . 122 . . 125 . . 125

Anatomical-Landmark-Based Deep Learning for Alzheimer’s Disease Diagnosis with Structural Magnetic Resonance Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mingxia Liu, Chunfeng Lian and Dinggang Shen 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Materials and Image Pre-Processing . . . . . . . . . . . . . . . . . 8.3 Anatomical Landmark Discovery for Brain sMRIs . . . . . . 8.3.1 Generation of Voxel-Wise Correspondence . . . . .

. . . .

. . . . .

. . . . .

112 113 113 114 115

. . . . .

. . . . .

116 116 117 119 119

. . 120

. . . . 127 . . . .

. . . .

. . . .

127 129 130 131

Contents

9

xi

8.3.2 Voxel-Wise Comparison Between Different Groups 8.3.3 Definition of AD-Related Anatomical Landmarks . . 8.3.4 Landmark Detection for Unseen Testing Subject . . 8.4 Landmark-Based Deep Network for Disease Diagnosis . . . . 8.4.1 Landmark-Based Patch Extraction . . . . . . . . . . . . . 8.4.2 Multi-instance Convolutional Neural Network . . . . 8.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.1 Methods for Comparison . . . . . . . . . . . . . . . . . . . 8.5.2 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . 8.5.3 Results of AD Diagnosis . . . . . . . . . . . . . . . . . . . 8.5.4 Results of MCI Conversion Prediction . . . . . . . . . . 8.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6.1 Influence of Parameters . . . . . . . . . . . . . . . . . . . . . 8.6.2 Limitations and Future Research Direction . . . . . . . 8.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

Multi-scale Deep Convolutional Neural Networks for Emphysema Classification and Quantification . . . . . . . . . . Liying Peng, Lanfen Lin, Hongjie Hu, Qiaowei Zhang, Huali Li, Qingqing Chen, Dan Wang, Xian-Hua Han, Yutaro Iwamoto, Yen-Wei Chen, Ruofeng Tong and Jian Wu 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.1 Patch Preparation . . . . . . . . . . . . . . . . . . . . . . . . 9.2.2 Single Scale Architecture . . . . . . . . . . . . . . . . . . 9.2.3 Multi-scale Architecture . . . . . . . . . . . . . . . . . . . 9.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.2 Evaluation of Classification Accuracy . . . . . . . . . 9.3.3 Emphysema Quantification . . . . . . . . . . . . . . . . . 9.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . .

10 Opacity Labeling of Diffuse Lung Diseases in CT Images Using Unsupervised and Semi-supervised Learning . . . . . . . . Shingo Mabu, Shoji Kido, Yasuhi Hirano and Takashi Kuremoto 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.1 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . 10.2.2 Semi-supervised Learning . . . . . . . . . . . . . . . . . . 10.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.1 Results of Unsupervised Learning . . . . . . . . . . . . 10.3.2 Results of Iterative Semi-supervised Learning . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

132 132 133 134 135 136 138 138 139 139 141 142 142 143 144 144

. . . . 149

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

150 152 153 154 154 155 155 157 159 161 162

. . . . 165 . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

165 167 167 171 172 172 174

xii

Contents

10.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 11 Residual Sparse Autoencoders for Unsupervised Feature Learning and Its Application to HEp-2 Cell Staining Pattern Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xian-Hua Han and Yen-Wei Chen 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Autoencoder and Its Extension: Sparse Autoencoder . . . . . 11.4 Residual SAE for Self-taught Learning . . . . . . . . . . . . . . . 11.5 The Aggregated Activation of the Residual SAE for Image Representation . . . . . . . . . . . . . . . . . . . . . . . . . 11.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6.1 Medical Context . . . . . . . . . . . . . . . . . . . . . . . . . 11.6.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . 11.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Part III

. . . . 181 . . . .

. . . .

. . . .

. . . .

182 183 186 188

. . . . . .

. . . . . .

. . . . . .

. . . . . .

189 192 192 194 196 197

Application of Deep Learning in Healthcare

12 Dr. Pecker: A Deep Learning-Based Computer-Aided Diagnosis System in Medical Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Guohua Cheng and Linyang He 12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2.1 Seamless Integration with Hospital IT . . . . . . . . . . . 12.2.2 Learning on Users’ Feedback . . . . . . . . . . . . . . . . . 12.2.3 Deep Learning Cluster . . . . . . . . . . . . . . . . . . . . . . 12.2.4 Dr. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3 Clinical Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3.1 Disease Screening: Ophthalmic Screening as an Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3.2 Lesion Detection and Segmentation . . . . . . . . . . . . . 12.3.3 Diagnosis and Risk Prediction . . . . . . . . . . . . . . . . . 12.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . 203 . . . . . . .

. . . . . . .

203 204 205 205 205 206 208

. . . . .

. . . . .

208 211 214 214 216

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217

About the Editors

Prof. Yen-Wei Chen received the B.E. degree in 1985 from Kobe Univ., Kobe, Japan, the M.E. degree in 1987, and the D.E. degree in 1990, both from Osaka Univ., Osaka, Japan. He was a research fellow with the Institute for Laser Technology, Osaka, from 1991 to 1994. From Oct. 1994 to Mar. 2004, he was an associate Professor and a professor with the Department of Electrical and Electronic Engineering, Univ. of the Ryukyus, Okinawa, Japan. He is currently a professor with the college of Information Science and Engineering, Ritsumeikan University, Japan. He is also a visiting professor with the College of Computer Science, Zhejiang University and Zhejiang Lab, Hangzhou, China. His research interests include medical image analysis, computer vision and computational intelligence. He has published more than 300 research papers in a number of leading journals and leading conferences including IEEE Trans. Image Processing, IEEE Trans. Cybernetics, Pattern Recognition. He has received many distinguished awards including ICPR2012 Best Scientific Paper Award, 2014 JAMIT Best Paper Award, Outstanding Chinese Oversea Scholar Fund of Chinese Academy of Science. He is/was a leader of numerous national and industrial research projects.

xiii

xiv

About the Editors

Prof. Lakhmi C. Jain Ph.D., M.E., B.E. (Hons) Fellow (Engineers Australia) is with the University of Technology Sydney, Australia, and Liverpool Hope University, UK. Professor Jain founded the KES International for providing a professional community the opportunities for publications, knowledge exchange, cooperation and teaming. Involving around 5,000 researchers drawn from universities and companies world-wide, KES facilitates international cooperation and generate synergy in teaching and research. KES regularly provides networking opportunities for professional community through one of the largest conferences of its kind in the area of KES. www.kesinternational.org

Part I

Fundamentals of Deep Learning in Healthcare

Chapter 1

Medical Image Detection Using Deep Learning María Inmaculada García Ocaña, Karen López-Linares Román, Nerea Lete Urzelai, Miguel Ángel González Ballester and Iván Macía Oliver

Abstract This chapter provides an introduction to deep learning-based systems for object detection and their applications in medical image analysis. First, common deep learning architectures for image detection are briefly explained, including scanningbased methods and end-to-end detection systems. Some considerations about the training scheme and loss functions are also included. Then, an overview of relevant publications in anatomical and pathological structure detection and landmark detection using convolutional neural networks is provided. Finally, some concluding remarks and future directions are presented.

1.1 Introduction Accurate and fast detection of anatomical or pathological structures or landmarks is essential in the medical field for a wide variety of tasks. For instance, the localization of anatomical landmarks is necessary to guide image registration or to initialize the volumetric segmentation of organs. Lesion detection is a crucial step towards the development of Computer Aided Detection and Diagnosis (CAD) systems, which have become increasingly popular in the last decades. Besides, detection algorithms M. I. García Ocaña (B) · K. López-Linares Román · N. Lete Urzelai · I. Macía Oliver Vicomtech, San Sebastián, Spain e-mail: [email protected] K. López-Linares Román e-mail: [email protected] N. Lete Urzelai e-mail: [email protected] I. Macía Oliver e-mail: [email protected] M. Á. González Ballester Universitat Pompeu Fabra, Barcelona, Spain e-mail: [email protected] © Springer Nature Switzerland AG 2020 Y.-W. Chen and L. C. Jain, Deep Learning in Healthcare, Intelligent Systems Reference Library 171, https://doi.org/10.1007/978-3-030-32606-7_1

3

4

M. I. García Ocaña et al.

Fig. 1.1 Differences between classification, localization, detection and segmentation

are also valuable to facilitate structure tracking during interventions or to localize relevant image planes within the whole medical image volumes. Object detection algorithms differ from classification algorithms in that they not only recognize which objects or structures are present in the image, but also their exact location within the image by outputting their bounding boxes. Localization and detection are similar tasks: localization algorithms usually identify one single object in the image, while object detection algorithms are able to find the presence and location of several objects that are present in the image (as shown in Fig. 1.1). Therefore, a detection algorithm will output a bounding box for each object present in the image and, associated to each bounding box, the type of object that it contains (often a value indicating the probability that the object belongs to that class). This chapter will focus mainly on the object detection task, though some examples of landmark localization in the medical domain are also provided. Generally, algorithms for object detection in computer vision follow two steps: (1) region proposal, which refers to the extraction of several patches from the image to find potential patches containing the object of interest. The whole image can be scanned and divided into patches with a sliding window approach or with specialized region proposal algorithms to find the regions that are most likely to contain certain objects. (2) classification of the extracted patches to output bounding boxes with a certain probability of containing an object. These type of algorithms are known as scanning based systems. Traditionally, feature extraction algorithms (for instance, SIFT [1], HOG [2], LBP [3], Haar wavelets [4] or hough transform [5]) are used to characterize image patches and these features are fed to classifiers such as support vector machines (SVMs) or random forests. This approach has been used in several object detection problems in the medical domain [6–8]. The success of convolutional neural networks (CNNs) for image classification tasks [9] motivated the use of deep learning for image detection, exploiting features extracted by CNNs instead of using sets of hand-crafted features. In 2014,

1 Medical Image Detection Using Deep Learning

5

Girshick et al. [10] proposed the R-CNN (Regions with CNN features). For region proposal they used a popular algorithm known as selective search [11]. The extracted image patches are then fed to a CNN, AlexNet [9], to extract features, and finally a SVM is used for classification. Since then, many other CNN-based models for image detection have been designed and evaluated. State of the art deep learning based methods for object detection eliminate the region proposal step or extract region proposals directly from the feature map instead of the image, improving the speed and outperforming the results of traditional object detection algorithms. However, as compared to the computer vision domain, the detection task in medical imaging needs to deal with some domain-specific challenges, such as the lack of large databases with annotated data. This requires that researchers working in the medical field need to modify or develop detection algorithms particularly adapted to this field. Deep learning based object detectors have been used in a wide range of pathologies, for instance, breast cancer, prostate cancer and retinopathy, as well as for the localization of landmarks and anatomical structures, which can be used as a guide for image registration or segmentation. In this chapter, an overview of the different strategies for object and landmark localization and detection in medical images will be provided. Before focusing in the medical applications, common architectures for image detection using deep learning are summarized in Sect. 1.2. Section 1.3.1 covers deep learning for anatomical landmark localization, Sect. 1.3.2 deep learning for phatology detection and finally Sect. 1.4 summarizes the concluding remarks.

1.2 Deep Learning Architectures for Image Detection Scanning-based systems are the most common approach to object detection. They consist of a region proposal phase followed by a classification step, tackling the detection task as a patch-wise classification problem. The first CNN-based approaches for object detection were based on this schema, introducing the CNN either for feature extraction or for patch classification (see Fig. 1.2). In some cases, a CNN is used

Fig. 1.2 R-CNN system overview [10]. Region proposals are extracted form the input image using selective search, then features are computed by a CNN and used to classify regions with a SVM classifier

6

M. I. García Ocaña et al.

both for region proposal and classification, while other authors use different computer vision techniques to generate region proposals and CNNs are only employed for the classification step. In these systems, each module has to be trained separately (for region generation and for classification). On the other hand, recent object detection works propose end-to-end systems with a direct mapping between the input image and the output predictions. These architectures can be optimized end-to-end and use features from the whole image to predict the bounding boxes, instead of only considering the independent patches. In order to predict the bounding boxes, CNNs not only have to perform a classification task but also a regression of the bounding box coordinates. Usually, this is incorporated in a multi-task loss that combines classification loss and regression loss.

1.2.1 Scanning-Based Systems These systems rely on a region proposal step to extract image patches and then they classify those patches according the object that they contain. CNNs can be used for both region extraction and classification (as in Faster-RCNN, that will be explained later), or only for one of the steps involved. Motivated by the success of the AlexNet network [9] for classification tasks, Ghirshick et al. proposed the RegionCNN (R-CNN) [10] object detection system. They keep the traditional scheme of region proposal, feature extraction and classification and only introduce the CNN for the feature extraction step. They used a popular region proposal algorithm, the selective search [11], to generate different regions of interest, and fed these patches to the CNN to extract features. The features computed by the network are then used to classify image patches using a linear SVM and applied a greedy non-maximum suppression algorithm to select the final bounding boxes. While R-CNN obtained good results, its main drawback is the computation time. It performs a ConvNet forward pass for each object proposal, without sharing computation. Thus, in [12] a Fast R-CNN was designed to reduce this computational burden. Fast R-CNN only requires one main CNN to process the entire image, but it still relies on the selective search algorithm to generate region proposals. This region proposals are then fed into the network, which in this case takes the whole image together with the patches as input, and directly outputs the probability estimates. Faster R-CNN [13] included a Region Proposal Network (RPN) to directly generate region proposals and predict bounding boxes, without the need to use selective search or other algorithms. The RPN takes an image as input and outputs a set of rectangular objects with an objectness score. This is done by sliding a small network over the convolutional feature map and generating a lower-dimensional vector, which is fed into two fully connected layers: a box regression layer and a box-classification layer. The proposals are parametrized relative to k reference boxes, so that it is centered at the sliding window and associated with a scale and aspect ratio. At each sliding-window location, k region proposals are predicted simultaneously, so the regression layer encodes the coordinates of k boxes and the classification layer out-

1 Medical Image Detection Using Deep Learning

7

Fig. 1.3 Generation of region proposals in Faster R-CNN [13]. At each sliding-window location, k region proposal are generated, so the regression layer has 4k outputs encoding the coordinates of k boxes and the classification layer outputs 2k scores (it is a two-class softmax layer)

puts 2k scores, the probability of object or not object for each proposal (as depicted in Fig. 1.3). The resulting model is a combination of Fast-RCNN and the RPN, where RPNs are trained end-to-end sharing convolutional features with Fast-RCNN, therefore reducing the computational cost at test time. The training scheme has to alternate between fine tuning for the region proposal task and fine-tuning for object detection while keeping the proposals fixed. A similar approach is found in the Region-based Fully Convolutional Network (R-FCN) [14], which keeps the two-stage object detection strategy but uses ResNet [15] as a backbone architecture instead of VGG [16], with all learnable layers being convolutional and shared for the entire image. In the medical domain, having a first and separate region proposal step allows to use region proposal algorithms specifically designed for a particular task or image modality, and then train the CNN only for the classification step. For instance, Savardi et al. [17] used a region proposal algorithm that exploited their knowledge about the physical effects that hemolysis produces on the blood film, causing light variations, to extract patches that are more likely to correspond to hemolytic regions. Setio et al. [18] used candidate detectors specifically designed for solid, subsolid and large pulmonary nodules, and Teramoto et al. [19] detected pulmonary nodules combining PET and CT by using region proposal algorithms specific to each image type.

1.2.2 End-to-End Systems End-to-end learning refers to training a possibly complex learning system by applying gradient-based learning to the system as a whole. These approaches use a single model to allow complete back-propagation for training and inference, therefore resulting in

8

M. I. García Ocaña et al.

Fig. 1.4 YOLO model [20]. The image is divided into an S × S grid. Each cell predicts B bounding boxes, the confidence score and class probabilities

systems that are trained end-to-end and directly map the input image to the output. Contrarily to scanning-based methods, end-to-end systems do not rely on a previous object proposal. Redmon et al. [20] reframed object detection as a single regression problem, directly predicting bounding boxes and class probabilities from images. They unify the components of object detection into a single neural network, called YOLO (You Only Look Once). The input image is divided in a grid, and each grid cell predicts bounding boxes and confidence scores for these boxes (see Fig. 1.4). A large number of bounding boxes is predicted, therefore the non maximum suppression method has to be applied at the end of the network to merge highly overlapping bounding boxes of the same object. This system is fast and allows real-time detection. Liu et al. [21] proposed another one shot detector approach, the Single Shot Multibox Detector, that also eliminates the need for proposal generation and encapsulates all computation in a single network. Regarding landmark detection, encoder-decoder architectures, which are popular for image segmentation, can also be used with small modifications. In encoderdecoder systems, the encoder network maps the input image to feature representation and the decoder network takes this feature representation as input, produces an output and maps it back into pixels. To adapt this approach to landmark detection and localization, landmark localization is treated as a pixel-wise heat map regression problem, where heat-maps for training are created by applying Gaussian functions at the key point true positions. A popular network for medical image segmentation, the u-net [22], has been largely applied with small modifications for landmark localization in medical images [23–25] as well as other encoder-decoder architectures [26].

1 Medical Image Detection Using Deep Learning

9

Fig. 1.5 Example on how to compute IoU and IoU on different bounding boxes (bottom)

Generating annotated datasets is costly and especially difficult in the medical domain, as expert knowledge is usually required. This has motivated some authors to try weakly-supervised approaches [27–29] for object detection. They only need image-level labels, and find the location of the landmarks or objects using the attention maps generated by the network. When the network learns to distinguish images with different labels, the discriminating patterns of lesions are automatically learned and these features can be used to estimate the location of the lesion.

Training Convolutional Network Detectors The choice of a proper loss function to be optimized and the selection of the best training strategy are very important factors for the network to converge and can have a great impact on performance times. Some of the previously described approaches require a complex training strategy, for example Faster R-CNN [13] has to alternate fine-tuning for the RPN and fine-tuning for object detection, while other networks can be trained directly end-to-end. The error metrics or loss functions to be optimized are different for each approach as well, but in general, they have to be designed to quantify both the classification error and the localization error. Intersection over Union, IoU is the metric typically used to assess the results of object detection. It is a value between 0 and 1 that represents the overlapping area between the predicted box and the ground truth box (Fig. 1.5). Detection networks have not only to predict the bounding box but also the classify the object contained in that box. Therefore multi-task losses that combine metrics for classification with metrics for localization, taking into account the IoU, are proposed. For RPN [13] the classification loss is logarithmic loss over two classes (object or not) and the regression loss is smooth L1 over the parametrized coordinates of the bounding box. It is only activated for positive region proposals, that is, regions

10

M. I. García Ocaña et al.

with the highest IoU with a ground-truth box or with a IoU higher than 0.7. In YOLO [20] they optimized for the sum-squared error, weighting localization error and classification error. Each grid cell predicts several bounding boxes (see Fig. 1.4), a bounding box is considered responsible for detecting a ground truth object object if it has the highest IoU of any predictor in that grid cell. Other hybrid loss functions can be defined, e.g. [30] used logarithmic loss for classification and L 1 loss for taking into account the location information. Another important consideration for training object detectors refers to dealing with strong class imbalance. When patches or region proposals are generated, the number of proposed regions is always much higher than the number of objects, therefore, there are more negative matches than positives. Strategies for hard negative sampling have been incorporated to the training scheme of CNNs to deal with this problem. RCNN [10] incorporates the traditional hard negative sampling for SVMs [31], where the model is trained with an initial subset of negative samples and then the negative examples incorrectly classified by the initial model are used to form a new set of negative examples. A efficient way to do hard negative mining is in region-based CNNs detectors is online hard example mining, OHEM [32]. It consist in computing the loss for all region proposals and selecting the ones with the highest score; only this small number of RoIs are selected for updating the model. This is done for each mini-batch (each SGD iteration). The OHEM training scheme can be used with different architectures, as Faster R-CNN, R-FCN and SSD. Finally, data augmentation strategies are also employed in detection to tackle the problem of the lack of large image databases for training. Different transformations, as intensity scaling, elastic deformations, rotations or translations [18, 23, 25, 33] are applied to the images in order to generate new samples to feed the network.

1.3 Detection and Localization in Medical Applications 1.3.1 Anatomical Landmark Localization While most of the literature focuses on detecting pathologies, anatomical landmark detection is also important for many medical image analysis tasks such as landmarkbased registration, initialization of image segmentation algorithms and the extraction of clinically relevant planes from 3D volumes. Landmark detection is usually done under the end-to-end scheme, making use of well-known encoder-decoder type segmentation architectures. Payer et al. [23] used a CNN framework (4 different architectures were compared, including U-NET and a newly proposed Spatial-Configuration Net) to extract anatomical landmarks from hand X-Rays and hand MRIs. They directly trained the CNNs in an end-to-end manner to regress heatmaps for landmarks. 37 anatomical landmarks were detected in X-Ray images and 28 from MRI. Mader et al. [24] followed a similar approach using a U-Net and a conditional random field (CRF). They labeled 16 points for each

1 Medical Image Detection Using Deep Learning

11

rib in chest X-Ray images. The U-Net was used to generate localization hypotheses that were later refiened using the CRF to asses spatial information. Meyer et al. [25] also used a encoder-decoder architecture, based in the U-Net, to regress the distance from each image location to the landmarks of interest, the retinal optical disk and the fovea. This way they were able to jointly detect both structures. Another approach to landmark detection consist in applying patch-based methods. Cai et al. [34] fused image features from different modalities, namely MR and CT, to improve the recognition and localization of vertebrae. They used a CNN to combine different modalities and extract features from image patches and fed them to a SVM classifier. Li et al. [35] proposed the Patch-based Iterative Network (PIN) for the detection of 10 anatomical landmarks from fetal head ultrasound. Zheng et al. [36] followed a two-step approach with shallow network to generate region candidates and a CNN for the classification, in the context of carotid artery bifurcation localization in neck CT.

1.3.2 Image Plane Detection Detecting a certain image plane within a whole medical volume is an important task, which can save clinicians long searching times, and for which several solutions have been proposed. Chen et al. used CNNs to localize the standard plane in fetal ultrasound [37]. They used transferred learning to reduce the over-fitting caused by the small amount of training samples. Baumgartner et al. [38] also proposed a system for fetal standard scan plane detection. They used VGG16 [39] as the backbone architecture to design SonoNet, a network that can detect 13 fetal standard views and provide a localization of the fetal structures via a bounding box, using weak supervision based only on image-level labels [38]. Kumar et al. addressed the same problem using saliency maps and CNNs [40].

1.3.3 Pathology Detection Medical images are commonly acquired for diagnostic procedures, so identifying the presence of a pathology is a very important task in medical image analysis. The localization and classification of cancer lesions, which is usually challenging as benign and malign tumours can have similar appearance, is one of the key applications of object detection in the medical domain. The creation of a challenge for pulmonary nodule detection in CT images, LUNA16 [41], facilitated work in this field. The detection of these nodules is essential to diagnose pulmonary cancer, however, it can be challenging due to the high variability of shape, size and texture. Setio et al. [18] proposed a 2D approach that consists of density thresholding, followed by morphological opening, to get the candidates, extracted a set of 2D patches from differently oriented planes for each candidate and

12

M. I. García Ocaña et al.

fed them into a 2D convolutional network. Ding et al. [42] used a 2D Faster-RCNN [13] and a subsequent 3D CNN for false positive reduction. Dou et al. [30] used a 3D fully convolutional network and adopted a online sample filtering strategy to increase the proportion of hard training samples, in order to improve the accuracy and deal with the imbalance between hard and easy samples. Zhu et al. [43] used a Faster R-CNN, as Ding et al. [42], but their approach is fully 3D, with a 3D faster R-CNN to generate candidate nodules and a U-Net-like 3D encoder-decoder architecture to learn features. Teramoto et al. [19] incorporated information from PET apart from the CT, identifying candidates separately on the PET and the CT, and then combining candidate regions obtained from the two images. The combination of several imaging modalities or sequences is relevant for the detection of many pathologies. Multiparametric MRI findings have shown a high correlation with the histopathological examinations in prostate cancer, and the information provided by the different MRI sequences can be crucial for assessing the malignity of a detected lesion. Kiraly et al. [26] used multi-channel image-to-image encoder-decoders with Gaussian kernels located in the key points and different output channels to represent the different tumour classes. Yang et al. [33] used a network trained in a weakly-supervised manner, reducing the cost of generating annotations. They modified GoogLeNet to generate cancer response maps, model multiple classes and fuse multimodal information from ADC and T2w. CNNs have been applied for breast cancer detection as well. Platania et al. [44] adapted the YOLO model for mammographic images. Similarly, Al-masni et al. [45] proposed a CAD system also based in YOLO for simultaneous breast mass detection and classification in digital mammography. Kooi et al. [46] also worked with mammograms, but they used a scaled down version of the VGG model instead. Li et al. [35] worked with histological images to diagnose breast cancer. They used a model based in Faster R-CNN to detect mitosis, incorporating a deep verification model based on ResNet [15] to improve the accuracy. Cancer is not the only pathology where CNNs have been used. Several authors have addressed the problem of diabetic retinopathy. Dai et al. [14] used an innovative approach, the combination of clinical reports and images, to identify potential microaneurysms in retinal images. Wang et al. [28] designed Zoom-Net, an architecture that tries to mimic the zoom-in process of the clinicians when they examine the retinal images. Yang et al. [29] used a two-stages approach with two CNNs to not only to detect the lesions, but also to grade the level of diabetic retinopathy. An example of using detection networks to later initialize a segmentation can be found in [47]. They used a detection network to find the region of interest that contains the thrombus in CTA images to later perform the segmentation within the extracted region. Other clinical applications include brain lesion detection from MRI [27, 48], betahemolysis from histological images [17], detection of lesions in multiple sclerosis in MRI [49], myocardial infarction areas [50] and intracraneal hemorrages in brain CT [51].

1 Medical Image Detection Using Deep Learning

13

1.4 Conclusion Object detection is an important processing task for many medical applications, especially for lesion detection. Deep learning allows for an automatic localization of suspicious masses in several imaging modalities such as CT, MRI or US, and sometimes even for the classification of the lesions as benign or malignant, helping radiologists and providing valuable input to computer aided detection systems. Another relevant application of convolutional neural network detection systems is the automatic localization of the plane of interest, which can save practitioners a lot of time when trying to find significant structures within the whole volumes. Furthermore, localization and detection of anatomical landmarks can assist the initialization of other image processing algorithms, such as registration or segmentation. There are different approaches to object detection that can be applied to medical image processing. Scanning-based systems rely on a region proposal step to generate patches that are later classified according to the object they contain, whereas more recent systems directly generate bounding boxes from the whole input image, improving the accuracy and allowing real-time detection. However, training convolutional neural networks for medical image detection still faces important challenges. The main limitation is the lack of large public databases that can be used to train or to do transfer learning. Furthermore, when the objective is to detect pathological structures, there is a class imbalance problem, as there is usually more data from healthy patients than from a specific pathology. Data augmentation strategies are commonly used to alleviate this problem, as well as hard example mining strategies. Some authors have tried weakly supervised approaches, reducing the cost of generating annotated databases. More effort needs to be done in the creation of accessible databases and in developing training strategies that allow for the use of weakly annotated data, noisy annotations and unsupervised learning.

References 1. Lowe, David G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004) 2. Dalal, N., Triggs, B.: Histograms of Oriented Gradients for Human Detection, vol. 1, pp. 886–893. IEEE, USA (2005) 3. Ojala, T., Pietikainen, M., Harwood, D.: Performance Evaluation of Texture Measures with Classification Based on Kullback Discrimination of Distributions, vol. 1, pp. 582–585. IEEE Comput. Press, Soc., Jerusalem, Israel (1994) 4. Viola, P., Jones, M.: Rapid Object Detection Using a Boosted Cascade of Simple Features, vol. 1, pp. I–511; I–518. IEEE Comput. Soc., USA (2001) 5. Duda, Richard O., Hart, Peter E.: Use of the Hough transformation to detect lines and curves in pictures. Commun ACM 15(1), 11–15 (1972) 6. Zuluaga, M.A., Magnin, I.E., Hoyos, M.H., Delgado Leyton, E.J.F., Lozano, F., Orkisz, M.: Automatic detection of abnormal vascular cross-sections based on density level detection and support vector machines. Int. J. Comput. Assist. Radiol. Surg. 6(2),163–174 (2011)

14

M. I. García Ocaña et al.

7. Donner, R., Birngruber, E., Steiner, H., Bischof, H., Langs, G.: Localization of 3d Anatomical Structures Using Random Forests and Discrete Optimization, vol. 6533, pp. 86–95. Springer, Berlin (2011) 8. Zuluaga, M.A., Delgado Leyton, E.J.F., Hoyos, M.H., Orkisz, M.: Feature Selection for SVMBased Vascular Anomaly Detection, vol. 6533, pp. 141–152. Springer, Berlin (2011) 9. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet Classification with Deep Convolutional Neural Networks, pp. 1097–1105 (2012) 10. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, pp. 580–587. IEEE, USA (2014) 11. Uijlings, J.R.R., van de Sande, K.E.A., Gevers, T., Smeulders, A.W.M.: Selective search for object recognition. Int. J. Comput. Vis. 104(2), 154–171 (2013) 12. Girshick, R.: Fast R-CNN, pp. 1440–1448. IEEE, Santiago, Chile (2015) 13. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6), 1137–1149 (2017) 14. Dai, J., Li, Y., He, K., Sun, J.: R-FCN: Object Detection via Region-based Fully Convolutional Networks, pp. 379–387. Curran Associates, Inc. (2016) 15. He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recognition, pp. 770– 778. IEEE, Las Vegas, USA (2016) 16. Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks for Large-Scale Image Recognition (2014). arXiv:1409.1556 [cs] 17. Savardi, M., Benini, S., Signoroni, A.: β-Hemolysis Detection on Cultured Blood Agar Plates by Convolutional Neural Networks. Lecture Notes in Computer Science, pp. 30–38. Springer International Publishing (2018) 18. Setio, A.A.A., Ciompi, F., Litjens, G., Gerke, P., Jacobs, C., van Riel, S.J., Wille, M.M.W., Naqibullah, M., Sanchez, C.I., van Ginneken, B.: Pulmonary nodule detection in CT images: false positive reduction using multi-view convolutional networks. IEEE Trans. Med. Imaging 35(5), 1160–1169 (2016) 19. Teramoto, A., Fujita, H., Yamamuro, O., Tamaki, T.: Automated detection of pulmonary nodules in PET/CT images: Ensemble false-positive reduction using a convolutional neural network technique. Med. Phys. 43(6Part1), 2821–2827 (2016) 20. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You Only Look Once: Unified, Real-Time Object Detection (2015). arXiv:1506.02640 [cs] 21. Liu, W., Anguelov, D., Erhan, D., Szegedy, S., Reed, S., Fu, C.-Y., Berg, A.C.: SSD: Single Shot MultiBox Detector, vol. 9905, pp. 21–37. Springer International Publishing, Cham (2016) 22. Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional Networks for Biomedical Image Segmentation (2015). arXiv:1505.04597 [cs] 23. Payer, C., Tern, D., Bischof, H., Urschler, M.: Regressing Heatmaps for Multiple Landmark Localization Using CNNs. Lecture Notes in Computer Science, pp. 230–238. Springer International Publishing (2016) 24. Mader, A.O., von Berg, J., Fabritz, A., Lorenz, C., Meyer, C.: Localization and Labeling of Posterior Ribs in Chest Radiographs Using a CRF-regularized FCN with Local Refinement. Lecture Notes in Computer Science, pp. 562–570. Springer International Publishing (2018) 25. Meyer, M.I., Galdran, A., Mendona, A.M., Campilho, A.: A Pixel-Wise Distance Regression Approach for Joint Retinal Optical Disc and Fovea Detection, vol. 11071, pp. 39–47. Springer International Publishing, Cham (2018) 26. Kiraly, A.P., Nader, C.A., Tuysuzoglu, A., Grimm, R., Kiefer, B., El-Zehiry, N., Kamen, A.: Deep Convolutional Encoder-Decoders for Prostate Cancer Detection and Classification. Lecture Notes in Computer Science, pp. 489–497. Springer International Publishing (2017) 27. Dubost, F., Bortsova, G., Adams, H., Ikram, A., Niessen, W.J., Vernooij, M., De Bruijne, M.: GP-Unet: Lesion Detection from Weak Labels with a 3d Regression Network. Lecture Notes in Computer Science, pp. 214–221. Springer International Publishing (2017) 28. Wang, Z., Yin, Y., Shi, J., Fang, W., Li, H., Wang, X.: Zoom-in-Net: Deep Mining Lesions for Diabetic Retinopathy Detection. Lecture Notes in Computer Science, pp. 267–275. Springer International Publishing (2017)

1 Medical Image Detection Using Deep Learning

15

29. Yang, X., Wang, Z., Liu, C., Le, H.M., Chen, J., Cheng, K.-T. (Tim), Wang, L.: Joint Detection and Diagnosis of Prostate Cancer in Multi-parametric MRI Based on Multimodal Convolutional Neural Networks. Lecture Notes in Computer Science, pp. 426–434. Springer International Publishing (2017) 30. Dou, Q., Chen, H., Jin, Y., Lin, H., Qin, J., Heng, P.-A.: Automated Pulmonary Nodule Detection via 3d ConvNets with Online Sample Filtering and Hybrid-Loss Residual Learning. Lecture Notes in Computer Science, pp. 630–638. Springer International Publishing (2017) 31. Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object Detection with Discriminatively Trained Part-Based Models. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1627– 1645 (2010) 32. Shrivastava, A., Gupta, A., Girshick, R.: Training Region-Based Object Detectors with Online Hard Example Mining. pp. 761–769. IEEE, USA (2016) 33. Yang, Y., Li, T., Li, W., Wu, H., Fan, W., Zhang, W.: Lesion Detection and Grading of Diabetic Retinopathy via Two-Stages Deep Convolutional Neural Networks. Lecture Notes in Computer Science, pp. 533–540. Springer International Publishing (2017) 34. Cai, Y., Landis, M., Laidley, D.T., Kornecki, A., Lum, A., Li, S.: Multi-modal vertebrae recognition using transformed deep convolution network. Comput. Med. Imaging Graph. 51, 11–19 (2016) 35. Li, Y., Alansary, A., Cerrolaza, J.J., Khanal, B., Sinclair, M., Matthew, J., Gupta, C., Knight, C., Kainz, B., Rueckert, D.: Fast Multiple Landmark Localisation Using a Patch-Based Iterative Network. Lecture Notes in Computer Science, pp. 563–571. Springer International Publishing (2018) 36. Zheng, Y., Liu, D., Georgescu, B., Nguyen, H., Comaniciu, D.: 3d Deep Learning for Efficient and Robust Landmark Detection in Volumetric Data, vol. 9349, pp. 565–572. Springer International Publishing, Cham (2015) 37. Chen, H., Ni, D., Qin, J., Li, S., Yang, X., Wang, T., Heng, P.-A.: Standard plane localization in fetal ultrasound via domain transferred deep neural networks. IEEE J. Biomed. Health Inform. 19(5), 1627–1636 (2015) 38. Baumgartner, C.F., Kamnitsas, K., Matthew, J., Fletcher, T.P., Smith, S., Koch, L.M., Kainz, B., Rueckert, D.: SonoNet: Real-Time Detection and Localisation of Fetal Standard Scan Planes in Freehand Ultrasound. arXiv:1612.05601 [cs] (2016) 39. Ma, C., Huang, J-B., Yang, X., Yang, M.-H.: Hierarchical Convolutional Features for Visual Tracking, pp. 3074–3082. IEEE, Santiago, Chile (2015) 40. Kumar, A., Sridar, P., Quinton, A., Kumar, R.K., Feng, D., Nanan, R., Kim, J.: Plane Identification in Fetal Ultrasound Images Using Saliency Maps and Convolutional Neural Networks, pp. 791–794 (2016) 41. Setio, A.A.A., Traverso, A., de Bel, T., Berens, M.S.N., van den Bogaard, C., Cerello, P., Chen, H., Dou, Q., Fantacci, M.E., Geurts, B., van der Gugten, R., Heng, P.A., Jansen, B., de Kaste, M.M.J., Kotov, V., Yu-Hung Lin, J., Manders, J.T.M.C., Sora-Mengana, A., GarcaNaranjo, J.C., Papavasileiou, E., Prokop, M., Saletta, M., Schaefer-Prokop, C.M., Scholten, E.T., Scholten, L., Snoeren, M.M., Torres, E.L., Vandemeulebroucke, J., Walasek, N., Zuidhof, G.C.A., van Ginneken, B., Jacobs, C.: Validation, comparison, and combination of algorithms for automatic detection of pulmonary nodules in computed tomography images: the LUNA16 challenge. Med Image Anal 42, 1–13 (2017) 42. Ding, J., Li, A., Hu, Z., Wang, L.: Accurate Pulmonary Nodule Detection in Computed Tomography Images Using Deep Convolutional Neural Networks. Lecture Notes in Computer Science, pp. 559–567. Springer International Publishing (2017) 43. Zhu, W., Liu, C., Fan, W., Xie, X.: DeepLung: Deep 3d Dual Path Nets for Automated Pulmonary Nodule Detection and Classification, pp. 673–681. IEEE, Lake Tahoe, NV (2018) 44. Platania, R., Shams, S., Yang, S., Zhang, J., Lee, K., Park, S.-J.: Automated Breast Cancer Diagnosis Using Deep Learning and Region of Interest Detection (BC-DROID), pp. 536–543. ACM Press, USA (2017) 45. Al-masni, M.A., Al-antari, M.A., Park, J.-M., Gi, G., Kim, T.-Y., Rivera, P., Valarezo, E., Choi, M.-T., Han, S.-M., Kim, T.-S.: Simultaneous detection and classification of breast masses

16

46.

47.

48.

49.

50.

51.

M. I. García Ocaña et al. in digital mammograms via a deep learning YOLO-based CAD system. Comput. Methods Programs Biomed. 157, 85–94 (2018) Kooi, T., van Ginneken, B., Karssemeijer, N., den Heeten, A.: Discriminating solitary cysts from soft tissue lesions in mammography using a pretrained deep convolutional neural network. Med. Phys. 44(3), 1017–1027 (2017) Lpez-Linares, K., Aranjuelo, N., Kabongo, L., Maclair, G., Lete, N., Ceresa, M., GarcaFamiliar, A., Maca, I., Gonzlez Ballester, M.A.: Fully automatic detection and segmentation of abdominal aortic thrombus in post-operative CTA images using deep convolutional neural networks. Med. Image Anal. 46, 202–214 (2018) Dou, Q., Chen, H., Yu, L., Zhao, L., Qin, J., Wang, D., Mok, V.C., Shi, L., Heng, P.: Automatic detection of cerebral microbleeds from MR images via 3d convolutional neural networks. IEEE Trans. Med. Imaging 35(5), 1182–1195 (2016) Nair, T., Precup, D., Arnold, D.L., Arbel, T.: Exploring Uncertainty Measures in Deep Networks for Multiple Sclerosis Lesion Detection and Segmentation. Lecture Notes in Computer Science, pp. 655–663. Springer International Publishing (2018) Xu, C., Xu, L., Gao, Z., Zhao, S., Zhang, H., Zhang, Y., Du, X., Zhao, S., Ghista, D., Li, S.: Direct Detection of Pixel-Level Myocardial Infarction Areas via a Deep-Learning Algorithm. Lecture Notes in Computer Science, pp. 240–249. Springer International Publishing (2017) Kuo, W., Hne, C., Yuh, E., Mukherjee, P., Malik, J.: Cost-Sensitive Active Learning for Intracranial Hemorrhage Detection. Lecture Notes in Computer Science, pp. 715–723. Springer International Publishing (2018)

Chapter 2

Medical Image Segmentation Using Deep Learning Karen López-Linares Román, María Inmaculada García Ocaña, Nerea Lete Urzelai, Miguel Ángel González Ballester and Iván Macía Oliver

Abstract This chapter aims at providing an introduction to deep learning-based medical image segmentation. First, the reader is guided through the inherent challenges of medical image segmentation, for which actual approaches to overcome those limitations are discussed. Secondly, supervised and semi-supervised architectures are described, where encoder-decoder type networks are the most widely employed ones. Nonetheless, generative adversarial network-based semi-supervised approaches have recently gained the attention of the scientific community. The shift from traditional 2D to 3D architectures is also discussed, as well as the most common loss functions to improve the performance of medical image segmentation approaches. Finally, some future trends and conclusion are described.

2.1 Introduction Semantic image segmentation refers to the task of clustering together or isolating parts of an image that belong to the same object [34]. It is also called pixel-wise classification. In medical imaging, semantic segmentation is employed to isolate parts of body systems, starting from cells to tissues and organs, to enable a complex analysis of the region of interest. This automated segmentation is usually challenging K. López-Linares Román (B) · M. I. García Ocaña · N. Lete Urzelai · I. Macía Oliver Vicomtech, San Sebastián, Spain e-mail: [email protected] M. I. García Ocaña e-mail: [email protected] N. Lete Urzelai e-mail: [email protected] I. Macía Oliver e-mail: [email protected] M. Á. González Ballester Universitat Pompeu Fabra, Barcelona, Spain e-mail: [email protected] © Springer Nature Switzerland AG 2020 Y.-W. Chen and L. C. Jain, Deep Learning in Healthcare, Intelligent Systems Reference Library 171, https://doi.org/10.1007/978-3-030-32606-7_2

17

18

K. López-Linares Román et al.

because of the large shape and size variations of anatomy between patients and the low contrast to surrounding tissues. Traditional approaches to medical image segmentation include techniques usually designed by human experts based on their knowledge about the target domain. In this sense, general-purpose algorithms, such as intensity-based methods, shape and appearance models or hybrid methods have been widely adapted to and employed in medical image segmentation. Machine learning approaches, based on the concept of feature extraction and the use of statistical classifiers are also very popular in medical image analysis, including segmentation. Again, these systems rely on the definition of task-specific feature vectors designed by human experts [50]. These handcrafted features, which are supposed to be discriminant for a certain application, are then used to train a computer algorithm that determines the optimal decision boundary in a high dimensional feature space. Hence, the performance and success of these methods is highly influenced by the correct extraction of the most meaningful characteristics. In the last decades, deep learning based methods have given a boost to medical image analysis [33, 50], allowing to efficiently learn features directly from the imaging data, converting the feature engineering step into a learning step. Instead of relying on human-designed features, deep learning techniques require only the sets of data, from which the informative representations are directly inferred in a selftaught manner. Hence, deep learning based applications for medical imaging have clearly exceed the performance of traditional methods in complex tasks. Specifically, segmentation is the most common application of deep learning to medical imaging, where shape, appearance and context information are jointly exploited directly from the image to provide the best segmentation results. Nonetheless, applying deep learning to medical image segmentation requires dealing with domain-specific challenges, such as the segmentation of very small structures, and overcoming limitations, mostly related to the amount and quality of data and annotations.

2.2 Challenges and Limitations When Applying Deep Learning to Medical Image Segmentation Applying deep learning to medical image segmentation has some inherent limitations as compared with the computer vision domain. While large databases of natural general-purpose images are easily available and accessible for computer vision researchers, sometimes even publicly, acquiring and utilizing medical images is a significant restricting factor in the development of new deep learning based technologies [33]. Medical image databases are usually small and private, even if PACS systems are filled with millions of images acquired routinely in almost every hospital. There are two main reasons why this large amount of stored data is not directly exploitable for medical image analysis:

2 Medical Image Segmentation Using Deep Learning

19

• Ethical and privacy aspects, and legal issues: the usage of medical data is subject to specific regulations to guarantee its correct management. In order to use an image in a certain study, it is necessary to have the informed consent from the patient, as well as to follow data anonymization procedures to ensure the privacy of the patient. • Lack of annotations for the images: training a medical image segmentation algorithm requires that each pixel in the image is labelled according to its class, i.e. object or background, which is extremely time-consuming and requires expert knowledge. Thus, learning efficiently from limited annotated data is an important area of research in which engineers and researchers have to focus on. The following strategies are usually employed to increase the size of the database when developing deep learning based segmentation approaches: • Data augmentation refers to expanding the database by generating new images, either with simple operations such as translations and rotations, or with more advanced techniques to create synthetic images, such as elastic deformations [21, 40, 46], principal component analysis [44] or histogram matching [40]. • In many applications, rather than using the full-sized images, 3D medical volumes are converted to stacks of independent 2D images to train the network with more data [13, 19, 36]. Similarly, sub-volumes or image patches are also commonly extracted from the images to increase the amount of data [16, 31, 39, 41]. Yet, the obvious disadvantage is that anatomic context in the directions orthogonal to the slice plane are entirely discarded. • Recently, another approach to create synthetic images to augment the database has been proposed, based on Generative Adversarial Networks [51, 52] (see Sect. 2.3). However, the limitations not only come from the imaging data itself, but also from the quality of the associated annotations. Obtaining precise labels for each pixel in the image is time consuming and requires expert knowledge. Thus, researchers try to reduce the burden by developing semi-automatic annotations tools, sparse annotations [12] or leveraging non-expert labels via crowd-sourcing [1]. In any of these cases, dealing with label noise is challenging when developing segmentation algorithms. Labels are always human-dependent, where factors like knowledge, resolution of the images, visual perception and fatigue play an important role in the quality of the output annotations, which can be considered fuzzy. Thus, training a deep learning system on such data requires careful attention on how to deal with noise and uncertainty in the ground truth, which is still an open challenge. Finally, designing deep learning systems that can handle the challenge of class imbalance is another active area of research. In the simplest binary segmentation problem, this refers to the large difference between the amount of pixels corresponding to the structure or tissue to be segmented and the background, which is represented by the majority of the pixels. This sometimes leads to systems with extremely good accuracy results when segmenting the background, but that fail on delimiting the object of interest. Several solutions has been proposed to tackle this issue:

20

K. López-Linares Román et al.

• Pose the segmentation as a two step problem, first detecting a region of interest, then segmenting the structure of interest within that smaller region [36]. • Apply selective data oversampling, augmenting only those images or patches where the minority class is visible. • Redesign the loss function and the metric in order to favor the accuracy for the minority class, also called cost sensitive learning, which will be further explained in Sect. 2.4.

2.3 Deep Learning Architectures for Medical Image Segmentation Approaches in image semantic segmentation are mostly supervised methods. Semisupervised segmentation has also been addressed, using mostly unsupervised Generative Adversarial Networks (GANs) [61] to improve or constrain the segmentation attained in a supervised manner.

2.3.1 Supervised Deep Learning Architectures Most of the progress in semantic image segmentation is done under the supervised scheme. They directly learn from labeled training samples, extracting features and context information in order to perform a dense pixel (or voxel)-wise classification.

2.3.1.1

Fully Convolutional Network (FCN)

One of the first fully convolutional neural network (CNN) for semantic segmentation, known as Fully Convolutional Network (FCN), was introduced in [35] and it was the basis for the subsequent development of segmentation architectures both in the computer vision and the medical imaging domain. The FCN was developed by adapting classifiers for dense prediction, replacing the final fully connected layers typical from classification networks with fully convolutional layers to preserve local image relations. Trained end-to-end with the whole image, this network exceeded the state-ofthe-art in semantic segmentation, taking an input of arbitrary size and producing a correspondingly-sized output prediction. FCN networks are composed of basic components, i.e. convolution and pooling layers, and activation functions. They also introduced skip connections to combine semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. A skip-net architecture centers on a primary stream and skips or side connections are added to incorporate feature responses from different scales in a shared output layer, as shown in Fig. 2.1.

2 Medical Image Segmentation Using Deep Learning

21

Fig. 2.1 Fully convolutional networks combining coarse, high layer information with fine, low level information, using skip connections

Early applications of CNNs to medical imaging made use of this architecture, mostly in 2D. Examples of CT image segmentation approaches using early FCNs include liver and lesion segmentation [6], multi-organ segmentation [64] or pancreas segmentation [66]. In MRI, FCNs have been successfully applied to tissue differentiation [55] or cardiovascular structure segmentation [5], among others.

2.3.1.2

Encoder-Decoder Architectures

In 2015, a new CNN architecture specifically designed for biomedical image segmentation, known as the u-net, was proposed [46]. The network was built upon the FCN, modifying and extending it such that it could work with very few training images and still yield very precise segmentations. As compared to the FCN, the unet architecture has a larger number of feature channels in the upsampling part of the network, which allows to propagate context information to higher resolution layers. Thus, the expansive path is more or less symmetric to the contracting path, and yields a u-shaped or encoder-decoder architecture, as shown in Fig. 2.2. Additionally, the skip connections between the down-sampling path and the up-sampling path apply a concatenation operator instead of a sum. This encoder-decoder architecture is the basis for most of the medical image segmentation networks designed to date. It has been largely employed to segment organs and tissues in different imaging modalities [10, 29, 31, 43, 46]. Some of the main differences between encoder-decoder architectures in the literature correspond to the replacement of some building blocks with more recently designed complex ones. Examples include substituting typical convolutional blocks with dilated convolutions or inception blocks.

22

K. López-Linares Román et al.

Fig. 2.2 U-net encoder-decoder architecture specifically designed for biomedical image segmentation Fig. 2.3 Visual representation of dilated convolutions

Dilated convolutions were firstly introduced in [62] with the goal of implementing convolutional networks specifically adapted to the segmentation task, since most semantic segmentation networks were based on adaptations of networks originally designed for image classification. These networks are mainly composed of convolution and pooling layers, where the latter operation is employed to reduce the spatial dimensions of the image along the network. By applying these subsampling layers the resolution of the features is gradually lost, resulting in coarse features that can miss details of small objects that are difficult to recover even using skip connections. Hence, the goal of using dilated convolutions is to gain a wide field of view by aggregating multi-scale contextual information using expanded receptive fields, while preserving the full spatial dimension and without losing resolution. A visual representation of these dilated convolutions is depicted in Fig. 2.3. Using dilated convolutions has shown to improve the segmentation accuracy in several applications [3, 17, 22, 45, 56]. Inception modules were firstly introduced by Google [54] with the aim of improving the utilization of the computing resources inside the network. The idea consists in applying pooling operations and convolutions with different kernel sizes in parallel and concatenating the resulting feature maps before going to the next layer, as shown

2 Medical Image Segmentation Using Deep Learning

23

Fig. 2.4 Simple scheme of an inception block as proposed by [54]

in Fig. 2.4. Even if it is mostly applied to image classification, some segmentation methods also leverage its advantages [11, 24, 32]. Another modification relates to the way in which the information is passed through the network layers, mostly when the network is very deep, for which residual connections or dense connections have also been introduced, leading to new backbone architectures that are further explained in Sects. 2.3.1.3 and 2.3.1.4.

2.3.1.3

Residual Networks

Network depth, i.e. the number of stacked layers of a network, is of crucial importance to achieve good segmentation results, where the quality of the features can be enriched by the number of stacked layers. However, as a CNN becomes increasingly deep and the information about the input and gradient passes through many layers, the gradients can vanish when they propagate back to the initial layer of the network. This problem is known as vanishing gradients. In addition, when deeper networks start converging they sometimes suffer from a degradation problem in which the accuracy of the network gets saturated and starts degrading [20]. To tackle this issue, in 2015 a new architecture called residual network (ResNet) was proposed [20], which allows having much deeper architectures with a similar number of parameters. The core idea of ResNet consists in introducing an identity shortcut connection that skips one or more layers in a so-called “residual unit (RU)” (shown in Fig. 2.5a). For each RU, the shortcut connections perform identity mapping and their outputs are added to the output of the stack of layers. They hypothesize that letting the stacked layers fit a residual mapping is easier than letting them directly fit the desired underlying mapping, solving the problem of vanishing gradients. These residual units have been successfully applied in segmentation networks to improve, among others, mass segmentation in mammograms [30], intracranial carotid artery calcification extraction in CT scans [7] or the segmentation of retinal layer structures from OCT images [4].

24

K. López-Linares Román et al.

Fig. 2.5 Residual unit and densely connected block

2.3.1.4

Densely Connected Convolutional Networks

In order to improve the flow of information and gradients in deep networks, in residual architectures the information is passed through the network by creating paths between early layers to later layers. Differently, densely connected network architectures (DenseNet) were proposed to ensure maximum information flow between layers of the network by connecting all layers directly with each other [23]. Each layer obtains additional inputs from all previous layers and passes on its own feature maps to all the subsequent layers, which are combined by concatenation. This allows to have fewer parameters than a traditional CNN, as there is no need to relearn redundant feature-maps. An scheme of a densely-connected block is depicted in Fig. 2.5b. Thus, instead of taking advantage of the power of extremely deep architectures, DenseNets exploit the potential of the networks through feature reuse, yielding to condensed models that are easy to train and very efficient. Concatenating feature-maps learned by different layers increases the variability in the input of subsequent layers, which is a major difference between DenseNets and ResNets. Examples of the use of dense blocks in medical image segmentation include whole heart and vessel segmentation from MRI [63], pulmonary artery segmentation from CT scans [44], multi-organ segmentation from CT scans [17] or brain segmentation from MRI [8, 14].

2.3.1.5

Recurrent Neural Networks

Recurrent neural networks (RNN) are designed to recognize patterns in sequences of data, such as text, genomes... These algorithms take time and sequence into account, so they have a temporal dimension, meaning that their input is not just the current example they see, but also what they have perceived previously in time. Thus, it is often said that recurrent networks have memory. As previously explained, in medical image segmentation it is quite common to decompose 3D medical volumes into 2D slices or images patches to train a network. The final 3D segmentation is then reconstructed from the predictions of each 2D slice or image patch in a post-processing step, but 3D consistency is not always ensured due to the loss of contextual information when segmenting each slice and patch independently. RNNs have been utilized to address this issue by retrieving global spatial dependencies thanks to its memory capabilities [2, 9, 42, 53, 60].

2 Medical Image Segmentation Using Deep Learning

25

2.3.2 Semi-supervised Deep Learning Architectures Fully unsupervised medical image segmentation has been scarcely addressed in the literature, but there are some semi-supervised approaches that make use of a combination of supervised and unsupervised architectures, mostly relying on Generative Adversarial networks (GANs), within a unique segmentation framework. While generative methods have been largely employed in unsupervised and semi-supervised learning for visual classification tasks, very little has been done for semantic segmentation.

2.3.2.1

Generative Adversarial Networks (GANs)

Generative Adversarial Networks (GANs) [57, 61] are a special type of neural network models, where two networks are trained simultaneously: one is the generator (G), which is focused on generating synthetic images from noise vectors; the other one, the discriminator, is centered on differing between real images and fake images generated by G. GANs are trained by solving an optimization problem that the discriminator tries to maximize and the generator tries to minimize. In medical image segmentation, GANs have been mostly employed within a semisupervised segmentation approach with different purposes. Some approaches [37, 58] employ GANs to refine, constrain or impose a structurally correct segmentation by assessing whether a segmentation, or a combination of segmentation and input image, is plausible. Another application refers to the use of GANs to generate new segmentations or annotated images (image synthesis) [25, 51], which may help alleviating the burden of obtaining time-consuming pixel-wise annotations.

2.3.3 From 2D to 3D Segmentation Networks Efficient implementations of 3D convolution and pooling operations, as well as a growing GPU memory capability, have made it possible to extend architectures developed for 2D segmentation to 3D medical imaging. Training directly in 3D has some advantages compared to its counterparts in 2D: for 3D medical image modalities, such as MRI or CT, training in 3D allows leveraging all the context information and better preserving the consistency of the final segmentation. 3D segmentation networks have also shown a reasonably good generalization and convergence when trained on few volumes. On the other hand, the advantages of working in 2D are higher speed, lower memory consumption and number of parameters, and the ability to utilize pre-trained networks, either directly or via transfer learning. The first 3D network implementations for medical image segmentation were mostly trained on small 3D patches extracted form the whole image volume [39, 47], which was then reconstructed in a subsequent post-processing step. As with 2D

26

K. López-Linares Román et al.

patches, this training approach suffers in accuracy, there is a high number of redundant computations, the runtime is high, and local, as well as global, context is not adequately conserved. Most recent works reimplement previous 2D architectures in 3D and train them directly with the whole input volume, instead of patch-wise. In [15], the FCN network proposed in [35] was presented in 3D, to automatically segment the liver from CT volumes. In [12], the original 2D u-net [46] was extended to 3D and trained end-toend with the complete volumetric data. Interestingly, they showed that the network can be satisfactorily trained from sparsely annotated training data, requiring only manual annotations for few slices instead of for the whole volume. In [40], they also present a 3D encoder-decoder architecture similar to the u-net, but incorporating the concept of residuals functions [20]. The network, known as V-Net, was trained endto-end to segment the prostate from 3D MRI volumes. The large number of recent conference and journal papers presenting 3D architectures [7, 8, 15, 28, 38, 44, 59, 63] show that with the new computational capabilities there is a trend to move segmentation tasks from the two dimensional space to 3D.

2.4 Loss Functions for Medical Image Segmentation One of the most relevant design aspect to consider when implementing a deep learning system, is the definition of an appropriate loss function for each specific task. The loss function guides the learning process, determining how the error between the output of a neural network and the labeled ground truth is computed. The most commonly used loss function is the pixel-wise cross entropy loss, which evaluates the predictions for each pixel individually and then averages over all pixels, asserting equal learning to each pixel in the image. One of the main challenges when training segmentation architectures in the medical field is data imbalance, as training can be dominated by the most prevalent class. For example, in applications such as lesion segmentation the number of lesion pixels is often much lower than the number of non-lesion ones, which may result in a network with a very high accuracy, but incapable of segmenting lesions. To tackle this issue, weighted cross-entropy loss functions have been largely applied [35, 38, 41, 44, 46], where the assigned weights are inversely proportional to the probability of each class appearance, i.e. higher appearance probabilities lead to lower weights. The most popular loss function for medical image segmentation tasks is the Dice loss, which measures the overlap between two samples and is a harmonic mean of precision and recall. It was firstly introduced in [40] for binary segmentation, yielding better results than a weighted cross-entropy loss function. It is now one of the preferred loss functions in the literature both for binary and multi-class segmentation [26, 65]. Weighted Dice loss functions, where a per-class Dice coefficient is computed [49], or a generalization of the Dice coefficient, the Tversky index, have also been proposed [48].

2 Medical Image Segmentation Using Deep Learning

27

Besides, the performance of a network critically depends on the quality of the annotations used to train the network, which are sometimes noisy, ambiguous and inaccurate, as explained in Sect. 2.2. Thus, researchers have tried to consider the errors in the ground truth labels during the training process by using bootstrapped loss functions, such as bootstrapped cross-entropy. The main idea is that the loss attempts to focus the learning process on hard to segment parts of an image by dropping out loss function contribution from voxels that have already been classified to a good degree of certainty [18, 27].

2.5 Conclusions and Future Directions Medical image segmentation is an essential step to obtain meaningful information from organs, tissues, pathologies or other body structures. It enables the complex analysis of the segmented regions and the extraction of quantitative information that could be relevant towards the development of computer assisted diagnosis, surgery systems or predictive models, for instance. The unprecedented success of deep learning techniques has boosted the accuracy of complex segmentation tasks that could not be solved with traditional image processing algorithms. However, some challenges inherent to the medical imaging domain have to be dealt with to ensure the applicability of deep learning systems in the clinical practice. These refer mostly to the generalization of the segmentation approaches, which requires having large amounts of annotated medical image data, obtained with different protocols, in different settings and that cover most of the anatomical and pathological variations among patients. Thus, researchers try to compensate the lack of data and annotations with intelligent data augmentation techniques and specifically designed loss functions. According to the literature, supervised deep learning segmentation networks, specifically variants of encoder-decoder architectures, are the most widely employed ones by the medical image community. The design of new building blocks to improve the efficiency and accuracy of these networks is an active area of research. There is, however, a trend towards semi-supervised and unsupervised architectures in the form of generative adversarial networks, which could reduce the need of time-consuming expert annotations that are always a limitation for the development of segmentation systems. These new approaches, together with advances in computational capabilities, will open up a new era in medical image segmentation.

References 1. Albarqouni, Shadi, Baur, Christoph, Achilles, Felix, Belagiannis, Vasileios, Demirci, Stefanie, Navab, Nassir: AggNet: deep learning from crowds for mitosis detection in breast cancer histology images. IEEE Trans. Med. Imaging 35(5), 1313–1321 (2016) 2. Alom, Z., Taha, T.M., Asari, V.K.: Recurrent residual convolutional neural network based on u-net (R2U-Net) for medical image segmentation, p. 12

28

K. López-Linares Román et al.

3. Anthimopoulos, M., Christodoulidis, S., Ebner, L., Geiser, T., Christe, A., Mougiakakou, S.: Semantic segmentation of pathological lung tissue with dilated fully convolutional networks. IEEE J. Biomed. Health Inform. pp. 1 (2018). arXiv:1803.06167 4. Apostolopoulos, S., De Zanet, S., Ciller, C., Wolf, S., Sznitman, R.: Pathological OCT retinal layer segmentation using branch residual u-shape networks. In: Descoteaux, M., Maier-Hein, L., Franz, A., Jannin, P., Collins, D.L., Duchesne, S. (eds.) Medical Image Computing and Computer-Assisted Intervention MICCAI 2017. Lecture Notes in Computer Science, pp. 294– 301. Springer International Publishing (2017) 5. Bai, W., Sinclair, M., Tarroni, G., Oktay, O., Rajchl, M., Vaillant, G., Lee, A.M., Aung, N., Lukaschuk, E., Sanghvi, M.M., Zemrak, F., Fung, K., Paiva, J.M., Carapella, V., Kim, Y.J., Suzuki, H., Kainz, B., Matthews, P.M., Petersen, S.E., Piechnik, S.K., Neubauer, S., Glocker, B., Rueckert, D.: Automated cardiovascular magnetic resonance image analysis with fully convolutional networks. J. Cardiovasc. Magn. Reson. 20 (2018) 6. Ben-Cohen, A., Diamant, I., Klang, E., Amitai, M., Greenspan, H.: fully convolutional network for liver segmentation and lesions detection. In: Carneiro, G., Mateus, D., Peter, L., Bradley, A., Tavares, J.M.R.S., Belagiannis, V., Papa, J.P., Nascimento, J.C., Loog, M., Lu, Z., Cardoso, J.S., Cornebise, J. (eds.) Deep Learning and Data Labeling for Medical Applications, vol. 10008, pp. 77–85. Springer International Publishing, Cham (2016) 7. Bortsova, G., van Tulder, G., Dubost, F., Peng, T., Navab, N., van der Lugt, A., Bos, D., De Bruijne, M.: Segmentation of intracranial arterial calcification with deeply supervised residual dropout networks. In: Descoteaux, M., Maier-Hein, L., Franz, A., Jannin, P., Collins, D.L. and Duchesne, S. (eds.) Medical Image Computing and Computer-Assisted Intervention MICCAI 2017, Lecture Notes in Computer Science, pp. 356–364. Springer International Publishing (2017) 8. Bui, T.D., Shin, J., Moon, T.: 3D densely convolutional networks for volumetric segmentation (2017) 9. Cai, J., Lu, L., Zhang, Z., Xing, F., Yang, L., Yin, Q.: Pancreas segmentation in MRI using graph-based decision fusion on convolutional neural networks. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G. and Wells, W. (eds.) Medical Image Computing and ComputerAssisted Intervention MICCAI 2016, Lecture Notes in Computer Science, pp. 442–450. Springer International Publishing (2016) 10. Carneiro, G., Zheng, Y., Xing, F., Yang, L.: Review of deep learning methods in mammography, cardiovascular, and microscopy image analysis. In: Lu, L., Zheng, Y., Carneiro, G., Yang, L. (eds.) Deep Learning and Convolutional Neural Networks for Medical Image Computing: Precision Medicine. High Performance and Large-Scale Datasets, Advances in Computer Vision and Pattern Recognition, pp. 11–32. Springer International Publishing, Cham (2017) 11. Chudzik, P., Majumdar, S., Caliva, F., Al-Diri, B., Hunter, A.: Exudate segmentation using fully convolutional neural networks and inception modules. In: Medical Imaging 2018: Image Processing, vol. 10574, pp. 1057430. International Society for Optics and Photonics (2018) 12. Çiçed, Ö., Abdulkadir, A., Lienkamp, S.S. Brox, T., Ronneberger, O.: 3D U-net: learning dense volumetric segmentation from sparse annotation (2016). arXiv:1606.06650 13. Ciresan, D., Giusti, A., Gambardella, L.M., Schmidhuber, J.: Deep neural networks segment neuronal membranes in electron microscopy images. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 25, pp. 2843– 2851. Curran Associates, Inc. (2012) 14. Dolz, J., Gopinath, K., Yuan, J., Lombaert, H., Desrosiers, C., Ayed, I.B.: HyperDense-Net: a hyper-densely connected CNN for multi-modal image segmentation (2018) 15. Dou, Q., Chen, H., Jin, Y., Yu, L., Qin, J., Heng, P.A.: 3D deeply supervised network for automatic liver segmentation from CT volumes. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.), Medical Image Computing and Computer-Assisted Intervention MICCAI 2016, Lecture Notes in Computer Science, pp. 149–157. Springer International Publishing (2016) 16. Fritscher, K., Raudaschl, P., Zaffino, P., Spadea, M.F., Sharp, G.C., Schubert, R.: Deep neural networks for fast segmentation of 3D medical images. In: Ourselin, S., Joskowicz, L., Sabuncu,

2 Medical Image Segmentation Using Deep Learning

17.

18.

19.

20. 21.

22.

23. 24. 25. 26.

27.

28.

29.

30. 31.

32.

29

M.R., Unal, G., Wells, W. (eds.) Medical Image Computing and Computer-Assisted Intervention MICCAI 2016, Lecture Notes in Computer Science, pp. 158–165. Springer International Publishing (2016) Gibson, E., Giganti, F., Hu, Y., Bonmati, E., Bandula, S., Gurusamy, K., Davidson, B.R., Pereira, S.P., Clarkson, M.J. and Barratt, D.C.: towards image-guided pancreas and biliary endoscopy: automatic multi-organ segmentation on abdominal CT with dense dilated networks. In: Descoteaux, M., Maier-Hein, L., Franz, A., Jannin, P., Collins, D.L., Duchesne, S. (eds.), Medical Image Computing and Computer Assisted Intervention MICCAI 2017, Lecture Notes in Computer Science, pp. 728–736. Springer International Publishing (2017) Guerrero, R., Qin, C., Oktay, O., Bowles, C., Chen, L., Joules, R., Wolz, R., Valds-Hernndez, M.C., Dickie, D.A., Wardlaw, J., Rueckert, D.: White matter hyperintensity and stroke lesion segmentation and differentiation using convolutional neural networks. NeuroImage: Clinical 17, 918–934 (2017) Havaei, M., Davy, A., Warde-Farley, D., Biard, A., Courville, A., Bengio, Y., Pal, C., Jodoin, P.M., Larochelle, H.: Brain tumor segmentation with deep neural networks. Med. Image Anal. 35, 18–31 (2017). arXiv:1505.03540 He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition (2015) Heinrich, L., Funke, J., Pape, C., Nunez-Iglesias, J., Saalfeld, S.: Synaptic cleft segmentation in non-isotropic volume electron microscopy of the complete drosophila brain. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-Lpez, C., Fichtinger, G. (eds.) Medical Image Computing and Computer Assisted InterventionMICCAI 2018, Lecture Notes in Computer Science, pp. 317–325. Springer International Publishing (2018) Heinrich, M.P., Oktay, O.: BRIEFnet: deep pancreas segmentation using binary sparse convolutions. In: Descoteaux, M., Maier-Hein, L., Franz, A., Jannin, P., Collins, D.L., Duchesne, S. (eds.) Medical Image Computing and Computer-Assisted Intervention MICCAI 2017, Lecture Notes in Computer Science, pp. 329–337. Springer International Publishing (2017) Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks (2016). arXiv:1608.06993 Hussain, S., Anwar, S.M., Majid, M.: Segmentation of glioma tumors in brain using deep convolutional neural network. Neurocomputing 248–261 (2018). arXiv:1708.00377 Iqbal, Talha, Ali, Hazrat: Generative adversarial network for medical images (MI-GAN). J. Med. Syst. 42(11), 231 (2018) Jog, A., Fischl, B. (2018) Pulse sequence resilient fast brain segmentation. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-Lpez, C., Fichtinger, G. (eds.) Medical Image Computing and Computer Assisted Intervention MICCAI 2018, Lecture Notes in Computer Science, pp. 654–662. Springer International Publishing (2018) Keshwani, D., Kitamura, Y., Li, Y.: Computation of total kidney volume from CT images in autosomal dominant polycystic kidney disease using multi-task 3D convolutional neural networks, p. 8 Koziski, M., Mosinska, A., Salzmann, M., Fua, P.: Learning to segment 3D linear structures using only 2D annotations. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-Lpez, C., Fichtinger, G. (eds.) Medical Image Computing and Computer Assisted Intervention MICCAI 2018, Lecture Notes in Computer Science, pp. 283–291. Springer International Publishing (2018) Kumar, A., Agarwala, S., Dhara, A.K., Nandi, D., Thakur, S.B., Bhadra, A.K., Sadhu, A.: Segmentation of lung field in HRCT images using u-net based fully convolutional networks, p. 10 Li, H., Chen, D., Nailon, W.H., Davies, M.E., Laurenson, D.: Improved breast mass segmentation in mammograms with conditional residual u-net (2018). arXiv:1808.08885 Li, J., Sarma, K.V., Ho, K.C., Gertych, A., Knudsen, B.S., Arnold, C.W.: A multi-scale u-net for semantic segmentation of histological images from radical prostatectomies. In: AMIA Annual Symposium Proceedings, 2017, pp. 1140–1148 (2018) Li, Rongjian, Zeng, Tao, Peng, Hanchuan, Ji, Shuiwang: Deep learning segmentation of optical microscopy images improves 3-D neuron reconstruction. IEEE Trans. Med. Imaging 36(7), 1533–1541 (2017)

30

K. López-Linares Román et al.

33. Litjens, G., Kooi, T., Bejnordi, B.E., Setio, A.A.A., Ciompi, F., Ghafoorian, M., Van Der Laak, J.A., Van Ginneken, B., Sanchez, C.I.: A survey on deep learning in medical image analysis. Med. Image Anal. 42, 60–88 (2017). arXiv:1702.05747 34. Liu, X., Deng, Z., Yang, Y.: Recent progress in semantic image segmentation. Artif. Intell. Rev. (2018). arXiv:1809.10198 35. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation, p. 10 36. Lopez-Linares, K., Aranjuelo, N., Kabongo, L., Maclair, G., Lete, N., Ceresa, M., GarciaFamiliar, A., Macia, I., Ballester, M.A.G.: Fully automatic detection and segmentation of abdominal aortic thrombus in post-operative CTA images using deep convolutional neural networks. Med. Image Anal. 46, 202–214 (2018) 37. Luc, P., Couprie, C., Chintala, S. and Verbeek, J.: Semantic segmentation using adversarial networks (2016). arXiv:1611.08408 38. Meng, Q., Roth, H.R., Kitasaka, T., Oda, M., Ueno, J., Mori, K.: Tracking and segmentation of the airways in chest CT using a fully convolutional network. In: Descoteaux, M., MaierHein, L., Franz, A., Jannin, P., Collins, D.L., Duchesne, S. (eds.) Medical Image Computing and Computer-Assisted Intervention MICCAI 2017, Lecture Notes in Computer Science, pp. 198–207. Springer International Publishing (2017) 39. Milletari, F., Ahmadi, S.A., Kroll, C., Plate, A., Rozanski, V., Maiostre, J., Levin, J., Dietrich, O., Ertl-Wagner, B., Botzel, K., Navab, N.: Hough-CNN: deep learning for segmentation of deep brain regions in MRI and ultrasound (2016). arXiv:1601.07014 40. Milletari, F., Navab, N., Ahmadi, S.A.: V-Net: Fully convolutional neural networks for volumetric medical image segmentation (2016). arXiv:1606.04797 41. Moeskops, P., Wolterink, J.M., van der Velden, B.H., Gilhuijs, K.G., Leiner, T., Viergever, M.A., Isgum, I.: Deep learning for multi-task medical image segmentation in multiple modalities. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) Medical Image Computing and Computer-Assisted Intervention MICCAI 2016, Lecture Notes in Computer Science, pp. 478–486. Springer International Publishing (2016) 42. Novikov, A.A., Major, D., Wimmer, M., Lenis, D., Buhler, K.: Deep sequential segmentation of organs in volumetric medical scans (2018). arXiv:1807.02437 43. Novikov, A.A., Lenis, D., Major, D., Hladuvka, J., Wimmer, M. and Buhler, K.: Fully convolutional architectures for multi-class segmentation in chest radiographs (2017). arXiv:1701.08816 44. Onieva, J., Andresen, L., Holsting, J.Q., Rahaghi, F.N., Ballester, M.A.G., Estepar, R.S.J., Roman, K.L.L., de La Bruere, I.: 3D pulmonary artery segmentation from CTA scans using deep learning with realistic data augmentation 45. Perone, Christian S., Calabrese, Evan, Cohen-Adad, Julien: Spinal cord gray matter segmentation using deep dilated convolutions. Sci. Rep. 8(1), 5966 (2018) 46. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention (MICCAI), vol. 9351 of LNCS, pp. 234–241. Springer, Berlin (2015). arXiv:1505.04597 47. Roth, H.R., Shen, C., Oda, H., Oda, M., Hayashi, Y., Misawa, K., Mori, K.: Deep learning and its application to medical image segmentation (2018). arXiv:1803.08691 48. Salehi, S.S.M., Erdogmus, D., Gholipour, A.: Tversky loss function for image segmentation using 3D fully convolutional deep networks (2017). arXiv:1706.05721 49. Shen, C., Roth, H.R., Oda, H., Oda, M., Hayashi, Y., Misawa, K., Mori, K.: On the influence of Dice loss function in multi-class organ segmentation of abdominal CT using 3D fully convolutional networks (2018). arXiv:1801.05912 50. Shen, Dinggang, Guorong, Wu, Suk, Heung-Il: Deep learning in medical image analysis. Annu. Rev. Biomed. Eng. 19, 221–248 (2017) 51. Shin, H.C., Tenenholtz, N.A., Rogers, J.K., Schwarz, C.G., Senjem, M.L., Gunter, J.L., Andriole, K.P., Michalski, M.: Medical image synthesis for data augmentation and anonymization using generative adversarial networks (2018). arXiv:1807.10225 52. Shrivastava, A., Pfister, T., Tuzel, O., Susskind, J., Wang, W., Webb, R.: Learning from simulated and unsupervised images through adversarial training (2016). arXiv:1612.07828

2 Medical Image Segmentation Using Deep Learning

31

53. Stollenga, M.F., Byeon, W., Liwicki, M., Schmidhuber, J.: Parallel multi-dimensional LSTM, with application to fast biomedical volumetric image segmentation (2015). arXiv:1506.07452 54. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going Deeper with Convolutions (2014). arXiv:1409.4842 55. Tai, L., Ye, H., Ye, Q., Liu, M.: PCA-aided fully convolutional networks for semantic segmentation of multi-channel fMRI (2016). arXiv:1610.01732 56. Vesal, S., Ravikumar, N., Maier, A.: Dilated convolutions in neural networks for left atrial segmentation in 3D gadolinium enhanced-MRI (2018). arXiv:1808.01673 57. Wolterink, J.M., Kamnitsas, K., Ledig, C. and Isgum, I.: Generative adversarial networks and adversarial methods in biomedical image analysis (2018). arXiv:1810.10352 58. Yang, D., Xu, D., Zhou, S.K., Georgescu, B., Chen, M., Grbic, S., Metaxas, D., Comaniciu, D.: Automatic liver segmentation using an adversarial image-to-image network. In: Descoteaux, M., Maier-Hein, L., Franz, A., Jannin, P., Collins, D.L. Duchesne, S. (eds.) Medical Image Computing and Computer-Assisted Intervention MICCAI 2017, Lecture Notes in Computer Science, pp. 507–515. Springer International Publishing (2017) 59. Yang, L., Zhang, Y., Guldner, I.H., Zhang, S., Chen, D.Z.: 3D segmentation of glial cells using fully convolutional networks and k-terminal cut. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) Medical Image Computing and Computer-Assisted Intervention MICCAI 2016, Lecture Notes in Computer Science, pp. 658–666. Springer International Publishing (2016) 60. Yang, X., Yu, L., Li, S., Wang, X., Wang, N., Qin, J., Ni, D., Heng, P.A.: Towards automatic semantic segmentation in volumetric ultrasound. In: Descoteaux, M., Maier-Hein, L., Franz, A., Jannin, P., Collins, D.L., Duchesne, S. (eds.) Medical Image Computing and Computer Assisted Intervention MICCAI 2017, Lecture Notes in Computer Science, pp. 711–719. Springer International Publishing (2017) 61. Yi, X., Walia, E., Babyn, P.: Generative adversarial network in medical imaging: a review (2018). arXiv:1809.07294 62. Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions (2015). arXiv:1511.07122 63. Yu, L., Cheng, J.Z., Dou, Q., Yang, X., Chen, H., Qin, J., Heng, P.A.: Automatic 3D cardiovascular MR segmentation with densely-connected volumetric convnets. In: Descoteaux, M., Maier-Hein, L., Franz, A., Jannin, P., Collins, D.L., Duchesne, S. (eds.) Medical Image Computing and Computer-Assisted Intervention MICCAI 2017, Lecture Notes in Computer Science, pp. 287–295. Springer International Publishing (2017) 64. Zhou, X., Ito, T., Takayama, R., Wang, S., Hara, T., Fujita, H.: Three-dimensional CT image segmentation by combining 2D fully convolutional network with 3D majority voting. In Carneiro, G., Mateus, D., Peter, L., Bradley, A., Tavares, J.M.R.S, Belagiannis, V., Papa, J.P., Nascimento, J.C., Loog, M., Lu, Z., Cardoso, J.S., Cornebise, J. (eds.) Deep Learning and Data Labeling for Medical Applications, Lecture Notes in Computer Science, pp. 111–120. Springer International Publishing (2016) 65. Zhou, Y., Xie, L., Fishman, E.K., Yuille, A.L.: Deep supervision for pancreatic cyst segmentation in abdominal CT scans. In: Descoteaux, M., Maier-Hein, L., Franz, A., Jannin, P., Collins, D.L., Duchesne, S. (eds.) Medical Image Computing and Computer-Assisted Intervention MICCAI 2017, Lecture Notes in Computer Science, pp. 222–230. Springer International Publishing (2017) 66. Zhou, Y., Xie, L., Shen, W., Wang, Y., Fishman, E.K., Yuille, A.L.: A fixed-point model for pancreas segmentation in abdominal CT scans. In: Descoteaux, M., Maier-Hein, L., Franz, A., Jannin, P., Collins, D.L., Duchesne, S. (eds.) Medical Image Computing and Computer Assisted Intervention MICCAI 2017, Lecture Notes in Computer Science, pp. 693–701. Springer International Publishing (2017)

Chapter 3

Medical Image Classification Using Deep Learning Weibin Wang, Dong Liang, Qingqing Chen, Yutaro Iwamoto, Xian-Hua Han, Qiaowei Zhang, Hongjie Hu, Lanfen Lin and Yen-Wei Chen

Abstract Image classification is to assign one or more labels to an image, which is one of the most fundamental tasks in computer vision and pattern recognition. In traditional image classification, low-level or mid-level features are extracted to represent the image and a trainable classifier is then used for label assignments. In recent years, the high-level feature representation of deep convolutional neural W. Wang · Y. Iwamoto · Y.-W. Chen (B) The Graduate School of Information Science and Engineering, Ritsumeikan University, Kusatsu, Japan e-mail: [email protected] W. Wang e-mail: [email protected] Y. Iwamoto e-mail: [email protected] D. Liang · L. Lin College of Computer Science and Technology, Zhejiang University, Hangzhou, China e-mail: [email protected] L. Lin e-mail: [email protected] Q. Chen · Q. Zhang · H. Hu Department of Radiology, Sir Run Run Shaw Hospital, Zhejiang University, Hangzhou, China e-mail: [email protected] Q. Zhang e-mail: [email protected] H. Hu e-mail: [email protected] X.-H. Han Yamaguchi University, Yamaguchi, Japan e-mail: [email protected] Y.-W. Chen Zhejiang Lab, Hangzhou, China © Springer Nature Switzerland AG 2020 Y.-W. Chen and L. C. Jain, Deep Learning in Healthcare, Intelligent Systems Reference Library 171, https://doi.org/10.1007/978-3-030-32606-7_3

33

34

W. Wang et al.

networks has proven to be superior to hand-crafted low-level and mid-level features. In the deep convolutional neural network, both feature extraction and classification networks are combined together and trained end-to-end. Deep learning techniques have also been applied to medical image classification and computer-aided diagnosis. In this chapter, we first introduce fundamentals of deep convolutional neural networks for image classification and then introduce an application of deep learning to classification of focal liver lesions on multi-phase CT images. The main challenge in deep-learning-based medical image classification is the lack of annotated training samples. We demonstrate that fine-tuning can significantly improve the accuracy of liver lesion classification, especially for small training samples.

3.1 Introduction 3.1.1 What Is Image Classification Image classification is to assign one or more labels to an image, which is one of the most fundamental problems in computer vision and pattern recognition [1], and has a wide range of applications, for example, image and video retrieval [2], video surveillance [3], web content analysis [4], human-computer interaction [5], and biometrics [6]. Feature coding is a crucial component of image classification, which has been studied over the past several years, and many coding algorithms have been proposed. In general, the process of image classification is to extract image features then classify the extracted features. Therefore, how to extract image features and analyze image features is the key point of image classification. The traditional classification methods use low-level or mid-level features to represent an image. The low-level features usually are based on grayscale density, color, texture, shape, and position information, which are defined by human (also known as hand-crafted features). The mid-level features, as well as learning-based features, commonly are distilled by bag-of-visual word (BoVW) algorithms [7, 8], which are effective and popular in image classification or retrieval framework in the past few years. In computer vision, after extracting the features, a classifier (e.g. SVM [9], random forest [10] etc.) is usually used to assign the label to the different type of objects. The traditional image classification is shown in Fig. 3.1a. Different from the traditional image classification method, the deep learning method combines the process of image feature extraction and classification on one network. The deep learning classification process is shown in Fig. 3.1b. The high-level features representation of deep learning has proven to be superior to hand-crafted low-level features and mid-level features and achieved good results in image recognition and image classification. This concept lies at the basis of the deep learning model (network), which is composed of many layers (such as convolutional layers and fully connected layers) that transforms input data (e.g. images) to outputs (e.g. classification result) while learning increasingly high-level features [11]. The main advantage

3 Medical Image Classification Using Deep Learning

35

Fig. 3.1 Image classification frameworks. a Traditional classification method. b Deep learning method

of the deep learning is that it can automatically learn data-driven (or task-specific), highly representative and hierarchical features and performs feature extraction and classification on one network, which is trained in an end-to-end manner. The details of the common architectures in deep learning will be described in Sect. 3.1.2.

3.1.2 What Has Been Achieved in Image Classification Using Deep Learning Before we detail the achievement of deep learning in image classification, we will first introduce one of the most important datasets in image classification ImageNet [12]. ImageNet is a dataset of over 15 million categorized high-resolution images belonging to roughly 22,000 categories. The images were collected from the web and marked by human labelers using Amazons Mechanical Turk crowd-sourcing tool. Starting in 2010, as part of the Pascal Visual Object Challenge, an annual competition called the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) has been held. ILSVRC uses a subset of ImageNet with roughly 1000 images in each of 1000 categories. In all, there are roughly 1.2 million training images, 50,000 validation images, and 150,000 testing images. In ILSVRC-2012, Alex Krizhevsky developed a convolutional neural network (CNN) with five convolution layers and three fully connection layers, which is known as AlexNet, achieved the top result with a top-5 error of 15.3%, while the top-5 error rate of the runner up using a non-deep learning method was 26.2% [13]. Details of the architecture of AlexNet will be described in Sect. 3.2.5. The AlexNet is a milestone which represents the change of the method used in the field of image classification, which is from high-dimensional shallow feature encoding [14] (ILSVRC-2011) to deep convolutional neural networks (ILSVRC2012∼2017) [15–19].

36

W. Wang et al.

In ILSVRC-2014, Simonyan [20] proposed a very deep convolutional neural network, which is known as VGG network, for large-scale image recognition. Their team secured the first and the second places in the localization and classification tracks of ImageNet Challenge 2014, respectively. The top-1 and the top-5 error rate are 23.7% and 6.8%, respectively. At the same time, Szegedy [17] proposed the GooLENet, which is a 22 layers deep network, achieved the first place with a top-5 error rate of 6.67%. In ILSVRC-2015, He et al. [18] proposed a residual learning framework, which is known as ResNet. The ResNet won the first place with a top-5 error rate of 3.75% on the ImageNet test set. Their 152-layer residual net also has excellent generalization performance on other recognition tasks, and lead them to win the 1st places on: ImageNet detection, ImageNet localization, COCO detection and COCO segmentation in ILSVRC COCO2015 competitions. The proposed residual mapping is easier to be optimized than the original architecture. As we mentioned above, deep learning techniques have achieved state-of-the-art classification accuracy on ImageNet. It is also widely used in medical applications including medical image classification. In 2017, Esteva [19] applied deep neural networks to classification of skin cancers. In their work, they first build a large data set containing 129,450 clinical images, which is two orders of magnitude larger than previous datasets. Then, they focus on the two critical binary classification cases: keratinocyte carcinomas versus benign seborrheic keratoses; and malignant melanomas versus benign nevi. The CNN (Google Inception V3) achieved dermatologist-level performance across both tasks.

3.2 Network Architecture The convolutional neural network structure is an improvement of the traditional artificial neural network (ANN), generally including convolution layers, pooling layers, and fully connected layers. In fact, the convolutional neural network is still a hierarchical network as the ANN, but the function and form of the layer have changed. It can be divided into two parts: feature extraction part (convolution layers and pooling layers) and classification part (fully connected layers). The image is first passed through a series of convolution, pooling layers for feature extraction and then is passed through fully connected layers for classification.

3.2.1 Convolution Layer The image through the convolution layer can be seen as a process of extracting features of the image. Before understanding the convolution layer, let’s compare the difference between images in human vision and computer vision. For example, an apple’s grayscale image is visually identified by brightness, size, and contour. In computer vision, this apple image is a matrix with only numbers, as shown in Fig. 3.2.

3 Medical Image Classification Using Deep Learning

37

Fig. 3.2 Images in human vision (left) and computer vision (right)

Fig. 3.3 Convolution layer

When a computer learns an image, it needs to extract the features of the image from this matrix. Convolution of the image is such a process. Taking a 5 × 5 image as an example, we choose a 3 × 3 matrix called a filter (or convolution kernel) that slides along the image with a step size of 1. Each time the filter slides, the filter multiply its values by the value of the image and all these multiplications are summed up. The resulting value is an element of the feature matrix. After passing the entire image, the feature matrix of the image can be finally obtained. The process of convolution is shown in Fig. 3.3.

3.2.2 Pooling Layer In convolutional neural networks, a pooling layer is often added between convolution layers. The pooling layer can very effectively reduce the size of the parameter matrix and reduce the number of parameters in the last fully connected layer. Using the pooling layer can speed up the calculation and prevent over-fitting. In the field of image recognition, sometimes the size of the trained image is too large, we need to add a pooling layer between the convolution layers to reduce the

38

W. Wang et al.

Fig. 3.4 Pooling layer

number of training parameters. Pooling is done in every depth dimension, so the depth of the image remains the same. The most common pooling form is the max pooling. The process of max pooling is as follows: We do the max pooling of a 4 × 4 matrix. The filter size is 2 × 2, the step size is set to 2, and the filter slides along the matrix. For each step, the maximum value in the filter region is used as an element of the pooled matrix. Repeat this process until the filter goes through the entire matrix. The pooling process is shown in Fig. 3.4.

3.2.3 Fully Connected Layer Fully connected layer is often used in classification task, which is the final part of convolutional neural network, it takes the outputs of formal layers as inputs, and maps them into the targets of classification task. For instance,as shown in Fig. 3.5, let’s say we got 5 outputs from the formal convolution layers and pooling layers, and we will map them into three categories, the 5 outputs, as we know, that is the key features which can help us to determine the input image belongs to which category, and the three categories are the targets for the classification task, which is also the outputs of fully connected layer. Weights and bias of fully connected layer, together with the key features will do the linear combinations to output the 3 categories, in order to finish the classification task.

3.2.4 Loss Function Before we begin to discuss the training of neural networks, the last thing we need to define is the loss function. The loss function can reflect the quality of the model prediction, which tell us how the neural network performs on a particular task. During the training of the network, the network outputs predicted values through each layer

3 Medical Image Classification Using Deep Learning

39

Fig. 3.5 Fully connected layer

of operations, and then uses loss function to calculate the difference between the predicted value and the true value. Training the neural network is to reduce this difference (loss). Common loss functions in deep learning are: mean squared error, cross entropy loss and hinge loss. If the loss function is large, then our neural network performance is not good. The loss should be as small as possible.

3.2.5 AlexNet In 2012, Alex et al. proposed a deep convolutional neural network (DCNN) and won the ImageNet Large Scale Visual Recognition Competition (ILSVRC) [21]. The high-level feature representation of DCNN has been demonstrated to be a superior method for image classification. With the development of neural networks, researchers have started applying DCNN to medical fields. The network structure of AlexNet is shown in Fig. 3.6. AlexNet is an 8-layer structure, in which the first 5 layers are convolution layers, and the latter 3 layers are fully connected layers. There are 60 million learning parameters in the network and 650,000 neurons. AlexNet is connected in a separate GPU at Layers 2, 4, and 5, and Layer 3 is fully connected to the previous 2 GPUs. The size of the kernel in the same convolution layer is the same. For example, AlexNet’s first convolution layer contains 96 kernels that are 11 × 11 × 3 in size. The first two convolution layers are followed by the overlapping pooling layer, and the third, fourth, and fifth convolution layers are all directly connected. The fifth convolution layer is followed by an overlapping max pooling layer whose output enters two fully connected layers. The final fully connected layer provides 1000 types of tags to the softmax. The max pooling layer is typically used to downsample the width and height of the tensor while maintaining the depth. The overlapping pooling layer is similar to the max pooling layer, except that adjacent pooled windows overlap each other.

40

W. Wang et al.

Fig. 3.6 The network structure of AlexNet

The pooling window used by AlexNet is a window with a size of 3 × 3 and an adjacent window stride of 2. In the case of the same output size, compared to the non-overlapping pooled window with a size of 2 × 2 and an adjacent window stride of 2, the overlapping pooled windows can respectively reduce the first and fifth error rates by 0.3% and 0.4%.

3.2.6 ResNet The depth of a deep learning network has a large influence on the final classification and recognition effect; therefore, the conventional approach is to make the network design as deep as possible; however, in fact, this is not the case. The traditional conventional network has the form of a plain network. When the network is deep with less training samples, the classification and recognition effects get worse. In particular, the accuracy of the training set of a traditional CNN decreases as the network deepens. However, a shallow network cannot significantly improve the recognition effect of the network. Therefore, the challenge is to avoid the disappearance of the gradient in the case of a deepening network. In 2015, He et al. proposed a deep residual network (ResNet) to solve this problem [18]; this is a state-of-the-art network with a good classification performance for natural images. In traditional CNN, the output of the upper convolution layer is the input for the next convolution layer, as shown in Fig. 3.7. Conversely, ResNet uses a shortcut connection, as shown in Fig. 3.8. Here, x represents the input feature, and F(x) represents the residual knowledge. If we assume that the output is H(x), then the residual knowledge F(x) = H(x) − x. It is simpler for a network to learn residual knowledge than to directly learn the original features. In addition, new feature knowledge can be learned via the shortcut connection. This can effectively solve the problem of network degradation and improve network performance.

3 Medical Image Classification Using Deep Learning

41

Fig. 3.7 Traditional CNN connection

Fig. 3.8 Shortcut connection

ResNet has also recently been applied to the field of medical image analysis. Bi et al. applied ResNet to liver lesion segmentation and achieved fourth place in the ISBI 2017 Liver Tumor Segmentation Challenge by the submission deadline [22]. Liang et al. proposed a ResNet with global and local pathways for FLL classification [23, 24]. Peng et al. proposed a multi-scale ResNet for the classification of emphysema [25]. The effectiveness of ResNet for medical image analysis has therefore been demonstrated.

42

W. Wang et al.

3.3 Training 3.3.1 Training from Scratch When we build the network model, the next step is to choose the strategy for training the model. In general classification problems, learning from scratch is a commonly used learning strategy. Before training the network, we need to initialize the parameters inside the network (generally random initialization parameters). Obviously, the initialized parameters will not give good results. In the process of training, we want to start with a very bad neural network to get a network with high accuracy, which requires a lot of training samples and time.

3.3.2 Transfer Learning from a Pre-trained Network As the number of network model layers deepens and the training data sample size be-comes smaller, the parameters of the subsequent layers in the model are difficult to effectively train. In our experiment, if we use a small data set to train a deep network, it is easy to over-fit the data. To solve this problem, transfer learning is commonly used. The basic idea of transfer learning is that we first train the network with a large data set (e.g. ImageNet) and then to use a target data set to retrain the pre-trained network. It has been proven that transfer learning can perform better in many cases with smaller training data set. The CNNs used for image classification consist of two parts: convolution layers, which are used for feature extraction, and fully connection layers, which are used for classification. Therefore, we can directly use the pre-trained neural network to extract features from the image and then the extracted feature vector is used as an input to train a new fully connected layer to address other classification problems. In the process of using a small amount of data for retraining, only the parameters of the last fully connected layer are updated, and the parameters of the other layers of the model are consistent with the parameters of the pre-trained model.

3.3.3 Fine-Tuning Fine-tuning is different from transfer learning even though transfer learning and fine-tuning both first pre-train the network with a large data set, such as ImageNet, and then re-train the pre-trained network with a small target image data set. The core idea of fine-tuning is to store the various layer parameters of the pre-training model (except the last fully connected layer) as initialization parameters for the retraining.

3 Medical Image Classification Using Deep Learning

43

In the process of retraining, new training data will update the parameters of each layer of the model. In general, transfer learning applies to situations where the pre-training data are similar to the new data. Fine-tuning, conversely, is suitable for situations where the pre-training data and the new data are not very similar. Even though both medical images and natural images belong to image groups, they have significant differences. Tajbakhsh et al. found in their experiments that knowledge concerning natural images in deep learning can be transferred to medical images [26]. Even if there are significant differences between natural and medical images, this knowledge transfer is still effective. Wang et al. applied a fine-tuning method to the segmentation of liver tumors, demonstrating a good performance for this method [27]. Like-wise, in this study, we applied fine-tuning to the classification of liver tumors and achieved a good performance according to our experiment result.

3.4 Application to Classsification of Focal Liver Lesions 3.4.1 Focal Liver Lesions and Multi-phase CT Images Liver cancer is one of the leading causes of death worldwide. Contrast-enhanced computed tomography (CT) is the most important imaging modality employed to detect and characterize focal liver lesions (FLLs). Contrast-enhanced CT scans are divided into four phases before and after the injection of contrast. A non-contrast enhanced (NC) scan is performed before contrast injection. After-injection phases include the arterial (ART) phase (30–40 s after contrast injection), portal venous (PV) phase (70–80 s after contrast injection) and delay (DL) phase (3–5 min after contrast injection). In this chapter, we focus our study on classification of four common types of focal liver lesions (FLLs): cysts, focal nodular hyper-plasia (FNH), hepatocellular carcinoma (HCC), and hemangioma (HEM). Typical images of these FLLs over three-phases (NC, ART, PV) are shown in Fig. 3.9. The traditional method of diagnosing liver cancer is based on the experience and expertise of the doctor observing the CT images of the patient. This requires doctors to have sufficient experience and expertise, in addition to requiring a large amount of time to diagnose different patients. Many organizations are currently studying how to use computers to aid the diagnoses of other conditions. In general, computer-aided diagnoses consist of three parts: feature extraction, feature analysis, and classification. Originally, in the feature extraction stage, low-level features such as tumor contours and tumor sizes were manually extracted from CT images by a radiologist, which is time-consuming. Currently, bag-of-visual-words methods are used to extract features from FLLs [28–30]. Features extracted using this method are called mid-level features, which has been proven to be effective and feasible for the retrieval and classification of FLLs in CT images.

44

W. Wang et al.

Fig. 3.9 Typical images of FLLs over three-phases

Fig. 3.10 Overview of our method

3.4.2 Multi-channel CNN for Classification of Focal Liver Lesions on Multi-phase CT Images We used a ResNet model with 50 layers [18, 31] as our baseline network for FLL classification using multi-phase CT images. A multi-phase CT scan includes three phase images: the non-contrast (NC) phase scanned prior to contrast injection and the arterial (ART) and portal venous (PV) phases scanned at different times after the contrast injection. The three phase images are used as the input images of three channels in ResNet, in a manner similar to red, green, and blue images. Our proposed framework is shown in Fig. 3.10. For the preprocessing step, we extracted the regions of interest (ROIs) from these three phases according to the contours of the liver tumors labeled by experienced radiologists. After this extraction, we performed the registration according to their center points. Each ROI in the phase images was resized to 227 × 227 pixels using linear interpolation. We merged the three resized images into a new three-channel image (227 × 227 × 3 pixels) as the input for our CNN model. Each channel in ResNet corresponds to a CT phase image. After feature extraction using a ResNet block with 49 convolutional layers, we obtained the high-level features of the input multi-phase CT image and performed FLL classification via a fully connected layer.

3 Medical Image Classification Using Deep Learning Table 3.1 Network structure table Layer name

45

Kernel size, output depth [7 × 7, 64] 3×3 [(1 × 1,64); (3 × 3,64); (1 × 1,256)] × 3 [(1 × 1,128); (3 × 3,128); (1 × 1,512)] × 4 [(1 × 1,256); (3 × 3,256); (1 × 1,1024)] × 6 [(1 × 1,512); (3 × 3,512); (1 × 1,2048)] × 3 4-D, softmax

Cov1 Max Pooling Cov2_x Cov3_x Cov4_x Cov5_x FC

The architecture of the residual CNN block is shown in Fig. 3.11. Table 3.1 shows the details of our network structure: a total of 50 layers, including 49 convolution layers and one fully connected layer. Cov1 is a convolutional layer of a 7 × 7 convolution kernel with a depth of 64 and a stride of 2. Cov2_x refers to a group of three convolutional layers, a total of three groups. Similarly, Cov3_x, Cov4_x, and Cov5_x represent different sets of convolutional layers. In the last layer, we used a fully connected layer to classify the extracted advanced features. The final output is from 0 to 3 and represents the classification results for the four types of liver tumors. The loss function that we used is the mean squared error. Let N be the number of samples and xiN C , xiA RT , and xiP V be the three channels of the i-th (i = 1, 2, . . . N) sample (the ROI). We used W to represent the weight of the entire network.p(j| xiN C , xiA RT , xiP V ; W ) represents the probability that the i-th ROI belongs to the class j. Because we classify the tumor to four different classes (cyst, FNH, HCC, and HEM), the output of the network is a 4D vector. The j-th element of the output vector is p( j | xiN C , xiA RT , xiP V ; W ) (j = 1, 2, 3, 4). Let ti = [ti (1), ti (2), ti (3), ti (4)] (i = 1, 2, . . . N) be the teaching signal (the label vector of the i-th training sample). If the i-th sample belongs to class j, only the j-th element (ti ( j)) in ti is 1, and the other elements are 0. The loss function is as follows: L=

N 4    1   p j|x N C , x A RT , x P V ; W − ti ( j)2 i i i 2N i=1 j=1

(3.1)

In our network model, we used the ImageNet dataset, which contains over 1 million natural images, to train the model and saved the weights of the convolution layer (the residual CNN block). Because the pre-trained ImageNet is a classification of 1000 types of images, we changed the output of the fully connected layer of the model to four categories. Then, we used our medical data to retrain the model. In our fine-tuning model, the parameters of all layers were updated. Our pre-training and retraining processes are shown in Fig. 3.12.

46

Fig. 3.11 Architecture of residual CNN Block

W. Wang et al.

3 Medical Image Classification Using Deep Learning

47

Fig. 3.12 Pre-training and retraining process Table 3.2 The distribution of the dataset Type Cyst FNH Set1 Set2 Set1 Training 98 Testing 21 Total 119

96 23

56 15 71

Set2 58 13

HCC Set1 82 21 103

Set2 78 25

HEM Set1 84 11 95

Set2 76 19

3.4.3 Experimental Results In our experiments, 388 multi-phase CT images were collected from 2015 to 2017 at Sir Run Run Shaw Hospital, Zhejiang University, China. Each multi-phase CT image contains three phases (NC, ART, and PV). The resolution of the CT images is 512 × 512 pixels, and the thickness of each slice is 7 mm. The experimental data were marked and categorized by experienced radiologists. The 388 liver CT images used in our experiment included four types of liver tumors: cysts, FNH, HCC, and HEM. In our experiments, we randomly divided the data into two sets, as shown in Table 3.2. Then, each dataset was divided into two parts: training (approximately 80%) and testing (approximately 20%). We used these two data sets to perform comparative experiments to verify the effectiveness of the proposed method. We first performed experiments with fine-tuning (transfer learning) and without fine-tuning (learning from scratch) using AlexNet [13] and ResNet [31]. The comparison results are shown in Table 3.3. In this group of comparative experiments, the accuracy of migration learning is only 60.04%, which is not suitable for the classification of liver tumors. The accuracy of learning from scratch using AlexNet and ResNet is 78.23% and 83.67%. We can see that fine-tuning significantly improved the classification accuracy of AlexNet from 78.23% to 82.94% and that of our model (ResNet) from 83.67% to 91.22%. Let’s compare AlexNet and ResNet. Whether

48

W. Wang et al.

Table 3.3 Comparison of fine-tuning, learning from scratch and transfer learning Method Cyst FNH HCC HEM Total accuracy (Multi-phase) AlexNet (learning from scratch ) AlexNet (with fine-tuning) ResNet50 (learning from scratch ) ResNet50 (transfer learning) ResNet50 (with fine-tuning)

83.96 ± 3.0

92.82 ± 0.5

77.33 ± 10.67 58.13 ± 5.5

78.23 ± 1.76

92.86 ± 7.1

92.81 ± 0.5

78.57 ± 11.9

64.59 ± 17.2

82.94 ± 2.0

93.27 ± 1.9

82.82 ± 9.5

84.47 ± 3.5

75.57 ± 7.1

83.67 ± 1.32

37.99 ± 14.19 47.69 ± 32.31 82.86 ± 2.86

62.2 ± 16.75

60.04 ± 1.22

95.44 ± 0.2

85.64 ± 3.8

91.22 ± 0.03

88.98 ± 4.3

91.24 ± 0.7

it’s learning from scratch or fine-tuning, the accuracy of using ResNet is higher than using AlexNet.The results show that Deeper network (ResNet) achieves better results and fine-tuning can effectively improve the accuracy of a CNN model for the classification of liver tumors. Due to the deepening of the network, the image features extracted by the convolutional neural network model are more detailed and advanced. In our experiments, not only the image classification on ImageNet, the classification accuracy with ResNet for the medical image dataset is much higher than the accuracy with AlexNet. Since liver tumors have different performances in different phases, we compared the three phases as separate training data to train our network, the results are shown in Table 3.4. The experiment results showed that the accuracy of training using ART (83.27%) was higher than that of NC (73.86%) and PV (80.44%). In order to effectively use the information of the three phases, we combine the three phases as our training data, and the detailed image pre-processing process is referred to 1.4.2. The classification accuracy of the combined three phases is higher than that of the single phase, indicating that the combination of the three phases can effectively improve the classification accuracy of liver tumors. We also compared our proposed method (ResNet with fine-tuning) [31] to stateof-the-art methods [23, 24, 32, 33] in Table 3.5. The accuracy (91.22%) of our model with fine-tuning is higher than the accuracies of the state-of-the-art methods.

3 Medical Image Classification Using Deep Learning Table 3.4 Comparison of single phase and multiphase Method Cyst FNH HCC ResNet50 (with fine-tuning, single phaseNC) ResNet50 (with fine-tuning, single phaseART) ResNet50 (with fine-tuning, single phasePV) ResNet50 (with fine-tuning, Multiphase)

49

HEM

Total accuracy 73.86 ± 2.61

88.72 ± 1.76

67.44 ± 5.9

65.33 ± 1.33

70.58 ± 2.16

90.68 ± 4.97

77.44 ± 15.9

82.1 ± 5.9

79.66 ± 11.24 83.27 ± 2.02

90.89 ± 0.41

67.44 ± 5.9

76.1 ± 0.1

83.02 ± 1.2

80.44 ± 0.44

95.44 ± 0.2

88.98 ± 4.3

91.24 ± 0.7

85.64 ± 3.8

91.22 ± 0.03

Table 3.5 Comparison of our proposed method with the state-of-the-art deep learning methods Method Cyst FNH HCC HEM Total accuracy Frid-Adar et al. [32] Yasaka et al. [33] ResGLNet [24] ResGLBDLSTM [23] Our model (ResNet with fine-tuning)

100.0 ± 0.0

78.20 ± 0.5

84.37 ± 16.6

40.67 ± 16.2

76.16 ± 0.6

97.92 ± 2.9

82.26 ± 25.1

86.82 ± 2.32

85.16 ± 0.7

87.26 ± 7.7

97.92 ± 2.9

81.99 ± 5.9

85.11 ± 15.6

85.42 ± 2.9

88.05 ± 4.8

100.0 ± 0.0

86.74 ± 4.1

88.82 ± 10.3

87.75 ± 5.5

90.93 ± 0.7

95.44 ± 0.2

88.98 ± 4.3

91.24 ± 0.7

85.64 ± 3.8

91.22 ± 0.03

3.5 Conclusion Recently, multiple studies on deep learning have performed well for image classification; however, these studies require large quantities of training data. For the task of liver tumor image classification, it is not feasible to obtain large amounts of valid data. Therefore, we proposed a fine-tuning method and achieved high-accuracy liver tumor classification solving problems arising from lack of sufficient training

50

W. Wang et al.

data. In addition, we demonstrated that fine-tuning can significantly improve the classification accuracy for liver lesions and that our model with fine-tuning outperforms state-of-the-art methods. In the future, we intend to develop a novel network to achieve more accurate classifications of liver lesions. Acknowledgements We would like to thank Sir Run Run Shaw Hospital for providing medical data and helpful advice on this research. This work is supported in part by the Grant-in Aid for Scientific Research from the Japanese Ministry for Education, Science, Culture and Sports (MEXT) under the Grant Nos. 18H03267, 18K18078, in part by Zhejiang Lab Program under the Grant No. 2018DG0ZX01, in part by the Key Science and Technology Innovation Support Program of Hangzhou under the Grant No. 20172011A038.

References 1. Huang, Y., et al.: Feature coding in image classification: a comprehensive study. IEEE Trans. Pattern Anal. Mach. Intell. 36(3), 493–506 (2014) 2. Vailaya, A., et al.: Image classification for content-based indexing. IEEE Trans. Image Process. 10(1), 117–130 (2001) 3. Collins, T.R., et al.: A system for video surveillance and monitoring. VSAM final report, pp. 1–68 (2000) 4. Kosala, R., Hendrik, B.: Web mining research: a survey. ACM SIGKDD Explor. Newsl. 2(1), 1–15 (2000) 5. Pavlovic, I.V., Rajeev, S., et al.: Visual interpretation of hand gestures for human-computer interaction: a review. IEEE Trans. Pattern Anal. Mach. Intell. 7, 677–695 (1997) 6. Jain, A.K., Arun, R., Salil, P.: An introduction to biometric recognition. IEEE Trans. Circuits Syst. Video Technol. 14(1), 4–20 (2004) 7. Cheng, G., Guo, L., Zhao, T., et al.: Automatic landslide detection from remote-sensing imagery using a scene classification method based on BoVW and pLSA. Int. J. Remote Sens. 34(1), 45–59 (2013) 8. Csurka, G., et al.: Visual categorization with bags of keypoints. In: Workshop on Statistical Learning in Computer Vision, ECCV, vol. 1. no. 1–22 (2004) 9. Chang, C., Lin, C.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011) 10. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001) 11. Litjens, G., et al.: A survey on deep learning in medical image analysis. Med. Image Anal. 42, 60–88 (2017) 12. Deng, J., et al.: Imagenet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, 2009, CVPR 2009. IEEE (2009) 13. Alex, K., Sutskever, I., Hinton, E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (2012) 14. Perronnin, F., Jorge, S., Thomas, M.: Improving the fisher kernel for large-scale image classification. In: European Conference on Computer Vision. Springer, Berlin, Heidelberg (2010) 15. Zeiler, D.M., Rob, F.: Visualizing and understanding convolutional networks. In: European Conference on Computer Vision. Springer, Cham (2014) 16. Sermanet, P., et al.: Overfeat: integrated recognition, localization and detection using convolutional networks (2013). arXiv:1312.6229 17. Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015) 18. He, K., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)

3 Medical Image Classification Using Deep Learning

51

19. Esteva, A., et al.: Dermatologist-level classification of skin cancer with deep neural networks. Nature 542(7639), 115 (2017) 20. Simonyan, K., Andrew, Z.: Very deep convolutional networks for large-scale image recognition (2014). arXiv:1409.1556 21. Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10) (2010) 22. Bi, L., Kim, J., Kumar, A., et al.: Automatic Liver Lesion Detection using Cascaded Deep Residual Networks (2017). arXiv:1704.02703 23. Liang, D., et al.: Combining convolutional and recurrent neural networks for classification of focal liver lesions in multi-phase CT images. In: International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI2018) (2018) 24. Liang, D., et al.: Residual convolutional neural networks with global and local path-ways for classification of focal liver lesions. In: Pacific Rim International Conference on Artificial Intelligence. Springer, Cham (2018) 25. Peng, L., et al.: Classification and quantification of emphysema using a multi-scale residual network. IEEE J. Biomed. Health Inform. (2019) (in press) 26. Tajbakhsh, N., et al.: Convolutional neural networks for medical image analysis: full training or fine tuning? IEEE Trans. Med. Imaging 35(5), 1299–1312 (2016) 27. Wang, G., Li, W., Zuluaga, M.A., et al.: Interactive medical image segmentation using deep learning with image-specific fine-tuning. IEEE Trans. Med. Imaging (2018) 28. Xu, Y., et al.: Texture-specific bag of visual words model and spatial cone matching based method for the retrieval of focal liver lesions using multiphase contrast-enhanced CT images. Int. J. Comput. Assis. Radiol. Surg. 13, 151–164 (2018) 29. Wang, J., et al.: Tensor-based sparse representations of multi-phase medical images for classification of focal liver lesions. Pattern Recognit. Lett. (2018) 30. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012) 31. Wang, W., et al.: Classification of focal liver lesions using deep learning with fine-tuning. In: Proceedings of Digital Medicine and Image Processing (DMIP2018), pp. 56–60 (2018) 32. Frid-Adar, M., et al.: Modeling the intra-class variability for liver lesion detection using a multiclass patch-based CNN. In: International Workshop on Patch-Based Techniques in Medical Imaging, Springer, Cham (2017) 33. Yasaka, K., et al.: Deep learning with convolutional neural network for differentiation of liver masses at dynamic contrast-enhanced CT: a preliminary study. Radiology 286(3), 170706 (2017)

Chapter 4

Medical Image Enhancement Using Deep Learning Yinhao Li, Yutaro Iwamoto and Yen-Wei Chen

Abstract This chapter aims to introduce medical image enhancement technology using 2-dimentional and 3-dimentional deep learning. The article starts from basic methods about convolutional layer, deconvolution layer, loss function and evaluation functions for beginners to easily understand. Then, typical state-of-the-art superresolution methods using 2D or 3D convolution neural networks will be introduced. From the experimental results of the network introduced in this chapter, readers can not only make a comparison about the network structure but also have a general understanding about network performance.

4.1 Introduction In recent decades, images with higher quality are desired and required because a high-quality image can provide more details that provides more accurate and effective information or reference for medical diagnosis made by physicians and computer aided diagnosis. Image enhancement is the process of adjusting digital images (e.g. super-resolution, noise reduction, deblurring, contrast improvement) so that the results are more suitable for display or further image analysis such as classification, detection and segmentation [1]. For instance, when we need to identify key features from a low-quality image, after removing noise, sharpening, or increasing the density of pixels in an image, it is going to be easy.

Y. Li · Y. Iwamoto · Y.-W. Chen (B) Ritsumeikan University, Kusatsu, Japan e-mail: [email protected] Y. Li e-mail: [email protected] Y. Iwamoto e-mail: [email protected] © Springer Nature Switzerland AG 2020 Y.-W. Chen and L. C. Jain, Deep Learning in Healthcare, Intelligent Systems Reference Library 171, https://doi.org/10.1007/978-3-030-32606-7_4

53

54

Y. Li et al.

A promising and classical approach that use signal processing techniques to obtain an HR image from one or more observed low-resolution (LR) images is called superresolution (SR) image reconstruction. SR processing is an inverse problem that combines denoising, deblurring, and scaling-up tasks, aiming to recover a high quality signal from degraded versions. It has been one of the most active research areas due to its wide range of applications [2, 3]. Methods for SR can be classified into two sorts: (i) The classical multi-frame super-resolution [4–8], and (ii) Single image super-resolution (SISR) [9–22]. In the classical multi-frame SR, a set of LR images of the same scene are needed to be taken and displacement information between multiple LR images is used. First, registration between a plurality of LR images with misalignment is performed. Next, based on the information on the pixel values of the plurality of LR images, the acquired positioning information is used to restore the HR image. If enough LR images are available, then the set of equations become determined and to recover the high-resolution image is practicable. However, this approach is limited to a slight of improvement in resolution (by factors smaller than 2) [23–25]. Although conventional multi-frame SR is also useful in medical imaging such as computed tomography (CT) and magnetic resonance imaging (MRI) since the acquisition of multiple images is possible while the resolution quality is limited. Considering that medical images have high cost of acquisition, the SISR method that learning mapping functions from external low- and high-resolution exemplar pairs is more popular among medical image super-resolution problem. Furthermore, recently learning-based methods using deep learning (which often uses large quantities of pairs of LR and HR training images from public natural image databases) have been proven to be more precise and rapid compared with classical multi-frame SR. Fundamental of SR will be illustrated in part 3. In this chapter, medical image enhancement methods based on super-resolution and deep learning will be introduced ranging from the basic structure of convolutional neural network (CNN) to typical state-of-the-art CNNs of super-resolution.

4.2 Network Architecture Convolutional neural network (CNN) is one of the successful ideas for learning a neural network with many layers. By previously creating a coupling structure corresponding to the task, the degree of freedom of the connection weight is reduced and learning becomes easier. In addition, thanks to the appearances of modern powerful GPUs and The easy access to an abundance of data (like ImageNet [26]), convergence has become much faster [27].

4 Medical Image Enhancement Using Deep Learning

55

Fig. 4.1 Schematic diagram of 2D convolution operation

4.2.1 Convolution Layer 2D Convolution In the case of a CNN, the convolution is performed on the input data using a filter or kernel to then produce a feature map in that it is one of the main building blocks of a CNN. The operation of 2D convolution is shown in Fig. 4.1: the filter (3 × 3 yellow square) slides over the input (blue square) and the sum of the convolution goes into the feature map (red square). The area of the filter is also called the receptive field. In a CNN for SR, numerous convolutions will be performed on the input where each operation uses a different filter and thus results in different feature maps. Then, all of these feature maps will be concatenated as the output of the convolution layer. 3D Convolution Different from 2D convolution, 3D convolution (illustrated in Fig. 4.2) applies a 3 dimensional filter to the dataset and the filter moves 3-direction (x, y, z) to calculate the low level feature representations. Their output shape is a 3 dimensional volume space such as cube. Compared to conventional 2D convolution, 3D convolution can combine more pixels, and continuous pixel information between slices can be effectively maintained. In recent years, CNNs based on 3D convolution is becoming more and more popular in medical image enhancement, lesion classification and lesion detection of CT or MR image.

56

Y. Li et al.

Fig. 4.2 Simple schematic diagram of 3D convolution operation

Fig. 4.3 Illustration of convolution and deconvolution operations

4.2.2 Deconvolution Layer Deconvolution is a commonly used method to find a set of kernels and feature maps that let them rebuild the image. It is a very useful method for magnifying images in super-resolution processing because every pixel value in HR space will be predicted and need to be computed correctly. As shown in Fig. 4.3, the deconvolution works like inverse processing of convolution.

4.2.3 Loss Layer The loss layer of a neural network compares the output of the network with the ground truth. For instance, in image processing problem, the difference between the processed images and reference patches will be compared through this layer by computing the loss function [28]. A loss function is a method of evaluating how well an algorithm models a dataset. For an error function ε, the loss for a patch P can be written as L (P) =

1  ε( p) , N p∈P

where N is the number of pixels in the patch P.

(4.1)

4 Medical Image Enhancement Using Deep Learning

57

The mean squared error (MSE), 2 norm, is very popular among optimization problems due to its convenient properties. Given Y as a vector of n predictions generated from a sample of n data points on all variables, and X is the represented vector of the ground-truth data, then the within-sample MSE of the predictor is computed as n 1 (X i − Yi )2 . (4.2) MSE = n i=1 Recently the 2 norm has become one of the most widely used loss function in SR reconstruction. However, regrettably 2 norm does not capture the intricate characteristics of the human visual system, so 2 norm and the peak signal-to-noise ratio (which will be introduced in fifth part of this chapter) do not correlate well with humans perception of image quality [29]. Thus, using 1 norm (mean absolute error, MAE) instead of 2 is commonly adopted in the SR reconstruction problem for reducing the artifacts emerged by 2 loss function. The equation X with 1 norm is simply: M AE =

n 1 |X i − Yi | . n i=1

(4.3)

4.2.4 Evaluation Functions Peak signal-to-noise ratio, abbreviated as PSNR, is a ratio between the maximum possible power of a signal and the power of corrupting noise that affects the fidelity of its representation [30]. Furthermore, it is the most commonly used evaluation function in image reconstruction problem. PSNR is defined via the MSE. Given a M × N monochrome high resolution image Y (ground truth) and its noisy approximation X, MSE is defined as: MSE =

m−1 n−1 1  [X (i, j) − Y (i, j)]2 . mn i=0 j=0

(4.4)

The PSNR (in dB) is defined as:  P S N R = 10 · log10

M AX 2 MSE

 .

(4.5)

Here, MAX is the maximum possible pixel value of the image. When the pixels are represented using 8 bits per sample, MAX is 255. For color images with three RGB values per pixel, the definition of PSNR is the same except the MSE is the sum over all squared value differences divided by image size and by three [31, 32].

58

Y. Li et al.

Typical values for the PSNR in lossy image are between 30 and 50 dB, while the bit depth is 8 bits, and higher is better. Typical values for the PSNR of 16-bit data are between 60 and 80 dB [33, 34]. If the value of the PSNR index is larger, the similarity of the two images is higher. Structural similarity (SSIM) index is a method for predicting the perceived quality of digital television and cinematic pictures, as well as other kinds of digital images and videos. The basic concept of structural similarity is that natural images are highly structured [35], that is, there is a strong correlation between adjacent pixels in natural images, and such associations carry structural information of objects in the scene. The human visual system is already used to extract such structural information when viewing images. Therefore, the measurement of structural distortion is an important part when designing image quality metrics to measure image distortion. SSIM is such a method designed for measuring the similarity between two images. Given two images x and y, the structural similarity between the two is defined as: SS I M(x, y) = [l(x, y)]α [c(x, y)]β [s(x, y)]γ , l(x, y) =

2μx μ y + C1 2σx σ y + C2 σ x y + C3 , c(x, y) = 2 , s(x, y) = , (4.6) μ2x + μ2y + C1 σx + σ y2 + C2 σ x y + C3

where l(x, y), c(x, y), and s(x, y) compare luminance, contrast, and structure between x and y respectively. α, β, and γ (all of them should be larger than zero) are the parameters for adjusting the relative importance of l(x, y), c(x, y), and s(x, y). μx and μ y , σx and σ y are the mean and standard deviation of x and y. σx y is the covariance of x and y. C1 , C2 and C3 are constants for maintain stability. If the value of the structural similarity index is larger, the similarity of the two images is higher.

4.3 Medical Image Enhancement by 2D Super-Resolution The resolution downgrading process from an HR image X to a LR image Y can be presented as: Y = f (X ) , (4.7) where f is the function causing a loss of resolution. The SISR process is to find an inverse mapping function g(·) ≈ f −1 (·) to recover HR image X from an LR image Y: (4.8) X = g(Y ) = f −1 (Y ) + R , where f −1 and R, is the inverse of f and the reconstruction residual respectively. In a CNN SISR approach, three different steps are optimized together: feature extraction, non-linear mapping, and reconstruction. During the training, the differ-

4 Medical Image Enhancement Using Deep Learning

59

Fig. 4.4 Basic structure of convolution neural network for image super-resolution

ence between reconstructed images and ground truth images is not only used to adjust reconstruction layer to restore better images from manifold, but also to guide extraction of accurate image features. In this part, typical 2D CNNs proposed in recent years for super-resolution will be introduced.

4.3.1 Super-Resolution Convolutional Neural Network (SRCNN) This structure proposed by Dong et al. [36, 37] in 2014 is the first convolutional neural network for super-resolution reconstruction. An overview of the Super-Resolution Convolutional Neural Network (SRCNN) is depicted in Fig. 4.4, which shows the basic structure of CNN for super-resolution. In [38], Dong et al. proposed a improved version for a faster SR processing of RGB images. To upscale a single LR image first to the desired size using bicubic interpolation as pre-processing is required. The goal is to recover target image F(Y) from the interpolated image Y as similar as possible comparing to the ground-truth HR image X. Despite Y has the same size as X, it is still a low-resolution image. Learning the map-ping function F by convolutional neural network consists of the following three steps: (i) patch extraction and representation, (ii) Non-linear mapping, and (iii) Re-construction. Next, details of each operation will be demonstrated.

4.3.1.1

Patch Extraction and Representation

This operation extracts patches from the low-resolution image Y and represents each patch as a high dimensional vector. These vectors comprise a set of feature maps, and the number equals to the dimensionality of the vectors.

60

Y. Li et al.

A popular strategy in image restoration is to densely extract patches and then represent them by a set of pretrained bases, such as [39]. That is equivalent to convolving the image by a set of filters, each of which is a basis. Dong et al. involve the optimization of these bases into the optimization of the network [37]. The first layer can be expressed as an operation F1 : F1 (Y) = max(0, W1 ∗ Y + B1 ) ,

(4.9)

where W1 and B1 represent the filters and biases respectively. * denotes the convolution operation. Here, W1 corresponds to n1 filters of support c × f1 × f1 , where c is the number of channels in the input image, f1 is the spatial size of a filter. The output is composed of n1 feature maps. b1 is an n1 -dimensional vector, of which each element is associated with a filter. Then rectified linear units (max (0, x)) [40] are commonly adopted on the filter responses.

4.3.1.2

Non-linear Mapping

The first layer extracts an n1 -dimensional feature for each patch. In the second operation, each of these n1-dimensional vectors is mapped into an n2 -dimensional vector. This is equivalent to applying n2 filters with a trivial spatial support 1 × 1. This interpretation is only valid for 1 × 1 filters. The operation of the second layer is: F2 (Y) = max(0, W2 ∗ F1 (Y) + B2 ) ,

(4.10)

where W2 contains n2 filters of size n1 × f2 × f2 , and B2 is in n2 -dimension. Each of the output (n2 -dimensional vectors) is a representation of an HR patch which will be used for reconstruction. Recently, adding more convolutional layers for increasing the non-linearity has been demonstrated that can effectively improve the accuracy of results. But it tends to increase the complexity of the model, and thus demands more training time and memory of GPU.

4.3.1.3

Reconstruction

This operation merges the above HR patch-wise representations to generate the final HR, which is expected to be similar to the ground truth X. In traditional SR methods, for producing the final full image, the predicted HR patches are often overlapped and averaged. The averaging can be considered as a predefined filter on a set of feature maps, so a convolutional layer to produce the final high-resolution image was defined as: F(Y) = W3 ∗ F2 (Y) + B3 ,

(4.11)

4 Medical Image Enhancement Using Deep Learning

61

where W3 corresponds to c filters of a size n2 × f3 × f3 , and B3 is a c-dimensional vector. When W3 is designed as a set of linear filters, it works like traditional averaging processing in reconstruction part. Although the above three operations (patch extraction or representation, nonlinear mapping and reconstruction) are motivated by different intuitions, they all lead to the same form as a convolutional layer. Thus, Dong et al. put all three operations together and form a convolutional neural network called SRCNN as shown in Fig. 4.4. All the filtering weights and biases are to be optimized in this model during training by machine.

4.3.2 Very Deep Super-Resolution Network (VDSR) The SRCNN introduced in part Sect. 4.3.1 successfully applied a deep learning technique into the SR problem via deep learning. Nevertheless, it still has limitations in three aspects: • It relies on the context of small image regions. In other word, the respective field is too small. • Small learning rate should be set in advance, in that training converges too slowly. • The network only works for a single scale. For resolving the above three issues, Kim et al. proposed a new method by CNN called Very Deep Super-Resolution network (VDSR) [41]. The network structure, which uses a very deep CNN inspired by Simonyan and Zisserman [42], is outlined in Fig. 4.5. D layers are set, and in which layers except the first and the last are of the same type: 64 filters of the size 3 × 3 × 64. Each filter operates on 3 × 3 spatial regions across 64 channels, and thus 64 feature maps will be reconstructed by 64 feature maps from the previous layer. The first layer operates on the input image as patch extraction illustrated in Sect. 4.3.1.1. The last layer consists of a single filter of size 3 × 3 × 64, is used for image reconstruction, which is similar to the part introduced in Sect. 4.3.1.3. This network also takes an interpolated LR image with desired size as input and predicts image details like SRCNN. However, VDSR is much deeper than SRCNN.

Fig. 4.5 Structure of VDSR

62

Y. Li et al.

One problem using a very deep network to predict outputs is that the size of the feature map gets reduced smaller and smaller while convolutional operations are applied. For instance, given an input of size (n + 1) × (n + 1) is applied to a network with receptive field size m × n, the output will be 1 × 1. In order to resolve this issue, padding zeros before convolutions has been demonstrated to be surprisingly effective to keep the sizes of all feature maps (including the output image) the same. Due to that, pixels near the image boundary can also be correctly predicted by this method, which outperforms the conventional methods. Main improvement points can be summarized as follows: • VDSR uses a larger receptive field in size 41 × 41 compared to previous research and takes a larger image context into account since information contained in a small patch is not sufficient for detail recovery for a large scale factor. • Because LR image and HR image almost share the same information (low frequency components), to learn and model the residual image (difference between HR and LR images) is advantageous. Besides, initial learning rate can be set 104 times higher than SRCNN by residual-learning and gradient clipping, in that training progress becomes relatively accurate and fast.

4.3.3 Efficient Sub-pixel Convolutional Neural Network (ESPCN) While super-resolve a LR image into HR space, to increase the resolution of the LR image to match that of the HR image at some point is necessary and significant. A popular approach is adding a preprocessing that increase the resolution before input them into the network [37, 43, 44]. However, this approach has some drawbacks. Firstly, increasing the resolution of the LR images before the image enhancement step increases the computational complexity. Secondly, interpolation methods for upscaling images such as bicubic interpolation, do not bring additional information to solve the ill-posed reconstruction problem. Thus, Shi et al. proposed a novel network called Efficient Sub-Pixel Convolutional Neural Network (ESPCN), which increases the resolution from LR to HR only at the last layer of whole network (which is often called subpixel convolution layer or pixel shuffle layer) and reconstruct HR data from LR feature maps. As shown in Fig. 4.6, to avoid upscaling LR image before feeding it into the network, a conventional CNN like SRCNN with feature map extraction and nonlinear mapping layers is applied directly to the LR image, and then a subpixel convolution layer that upscales the LR feature maps to produce SR image replace the original convolution layer (for reconstruction). This effective way proposed in [45] can be indicated as: I S R = f L (I S R ) = PS (W L ∗ f L−1 (I L R ) + b L ) ,

(4.12)

4 Medical Image Enhancement Using Deep Learning

63

Fig. 4.6 Structure of ESPCN. Three convolution layers are usually set for feature map extraction and non-linear mapping. Subpixel convolution layer aggregates the feature maps from LR space and reconstruct SR image as the final result

where PS is an periodic shuffling operator that rearranges the elements of a H × W × (C × r 2 ) tensor to a tensor of shape r H × r W × C. The effects of this operation are illustrated in Fig. 4.6. Mathematically, this operation can be described in the following way: PS (T )x,y,c = Tx/r ,y/r ,C·r ·mod(y,r )+C·r ·mod(x,r )+c .

(4.13)

Thus, the convolution operator W L has shape n L−1 × r 2 C × k L × k L . In this work, upscaling is handled in the last layer by pixel suffering. Different from previous networks, each LR image is directly fed to the network as the input and feature extraction occurs through nonlinear convolutions in LR space. Due to the reduced input resolution, filters with smaller size become practicable to integrate the same information while maintaining a given contextual area. Moreover, the reduction of resolution and filter size drastically lower the computational and memory complexity to allow super-resolution processing of high definition videos in real time.

4.3.4 Dense Skip Connections Based Convolutional Neural Network for Super-Resolution (SRDenseNet) Deeper networks have been demonstrated to be able to achieve good performance in SR, because larger receptive field takes more contextual information from LR images which promotes to predict information in HR space. Nevertheless, it is challenging to effectively train a very deep CNN due to the vanishing gradient problem. Using skip connection which create short paths from top layers to bottom layers is a good choice. In most of previous works, such as SRCNN and VDSR, only high-level features at top layers are used for reconstructing HR images. However, features at low levels can potentially provide additional information to reconstruct the high-frequency details

64

Y. Li et al.

in HR images and image SR may benefit from the collective knowledge of features at different levels. In consequence, Tong et al. proposed a novel SR method termed Dense Skip Connections based Convolutional Neural Network for Super-Resolution(SRDenseNet) in which the dense connected convolutional networks are employed. The particularity and advantages of SRDenseNet can be summarized as follows: • Dense connections effectively improve the flow of information through the network, alleviating the gradient vanishing problem. • For avoiding the re-learning of redundant features, the reuse of feature maps from preceding layers is allowed. • Dense skip connections are utilized to combine the low-level features and highlevel features in order to provide richer information for the SR reconstruction. • Deconvolution layers are integrated to recover the image details and to speed up the reconstruction process. Figure 4.7 shows the structure of SRDenseNet. Inspired by the DenseNet structure first proposed in [46], after applying a convolution layer to the input LR images for learning low-level features, several DenseNet blocks are adopted for learning the high-level features. Different from ResNets proposed in [47], the feature maps are concatenated in DenseNet rather than directly summed. The ith layer receives the feature maps of all preceding layers as input: X i = max(0, wi ∗ [X 1 , X 2 , . . . , X i−1 ] + bi ) ,

(4.14)

where [X 1 , X 2 , . . . , X i−1 ] represents the concatenation of the feature maps generated in the preceding convolution layers 1, 2, . . . , i − 1. This kind of dense connection structures strengthen the flow of information through deep networks, and alleviate the vanishing-gradient problem. The structure of each DenseNet block can be seen in Fig. 4.8. Specifically, there are 8 convolution layers in one DenseNet block in this work. If each convolution layer produce k feature maps as output, the total number of feature maps generated by one DenseNet block is k × 8, where k is referred to as growth rate. The growth rate k regulates how much new information each layer contributes to the final reconstruction. The growth rate k is experimentally set to 16

Fig. 4.7 The structure of DenseNet for super-resolution. For reconstructing HR images, features in all levels are combined via skip connections

4 Medical Image Enhancement Using Deep Learning

65

Fig. 4.8 The structure of one dense block. This is a standard sample consists of 8 convolution layers with growth rate 16 and the output has 128 feature maps

to prevent the network from growing too wide. Thus, a total number of 128 feature maps can be created from one DenseNet block. The deconvolution layer can learn diverse upscaling kernels that work jointly for predicting the HR images and it can be considered as an inverse operation of a convolution layer. Using the deconvolution layers for upscaling has two advantages. • The SR reconstruction process can be accelerated. Whole computational process is performed in the LR space before deconvolution processing in that computational cost dropped significantly. • A large amount of contextual information from the LR images is extracted and learned to infer the high frequency details. Then, two successive deconvolution layers with 3 × 3 kernels and 256 feature maps are trained for upscaling. In a word, all feature maps in the network are concatenated, yielding large quantities of feature maps for the subsequent deconvolution layers.

4.3.5 Residual Dense Network for Image Super-Resolution (RDN) As the network depth grows, features in each convolutional layer would be hierarchical with different receptive fields. However, previous works for SR neglect to fully use information of each convolutional layer since the local convolutional layers do not have direct access to the subsequent layers. Thus, Zhang et al. proposed the Residual Dense Network (RDN) shown in Fig. 4.9, which consists of four parts: shallow feature extraction net, residual dense blocks (RDBs), dense feature fusion, and the up-sampling net [48]. Supposing there are D residual dense blocks, the output Fd of the dth RDB can be obtained by Fd = H R D B,d (Fd−1 ) = H R D B,d (H R D B,d−1 (· · · (H R D B,1 (F0 ) · · · )) ,

(4.15)

66

Y. Li et al.

Fig. 4.9 The architecture of residual dense net (RDN) for image super-resolution

Fig. 4.10 Residual dense block (RDB) architecture

where H R D B,d denotes the operations of the dth RDB. H R D B,d can be a composite function of operations, such as convolution and ReLU. Fd can be viewed as local features since it is produced by the dth RDB which utilize each convolutional layers within the block. As shown in Fig. 4.10, the proposed RDB contains dense connected layers, local feature fusion, and local residual learning, leading to a contiguous memory mechanism. Then, extracted local and global features of LR space will be stacked in HR space via pixel shuffle layer. In summary, this work has three aspects of novelty: • A new frame work RDN for high-quality image SR was proposed, in which all the hierarchical features from the original LR image are fully used. • Residual dense block can not only read state from the preceding RDB via a contiguous memory mechanism, but also fully utilize all the layers within it via local dense connections. The accumulated features are then preserved by local feature fusion. • Shallow features and deep features are combined together by global residual learning, resulting in global dense features from the original LR image.

4.3.6 Experimental Results of 2D Image Super-Resolution In this part, the comparison of results about 2D image SR reconstruction will be demonstrated. Parameters in each network are set according to corresponding original

4 Medical Image Enhancement Using Deep Learning

67

Table 4.1 The results of PSNR(dB) and SSIM for scale factor 2 of five networks and bicubic interpolation Method PSNR SSIM Bicubic SRCNN VDSR ESPCN SRDenseNet RDN

38.04 39.69 40.55 39.18 32.74 40.78

0.9581 0.9664 0.9742 0.9428 0.8201 0.9744

papers, such as patch size, batch size, optimizer, learning rate, and activate function. Inspired by Zhangs paper [48], DIV2K dataset [49] with 800 images is used for training above five networks. Since each original image has extremely large size, and thus over 150 thousand patches can be yielded, data expansion is not necessary. For test, 80 volume data in IXI dataset [50] are evaluated by 5 deep learning methods and conventional bicubic interpolation. The PSNR and SSIM of the results are shown in Table 4.1. Samples of these results from Z direction are shown in Fig. 4.11 for qualitative evaluation. The results show that deeper and more complex networks tend to deliver better results when training samples are sufficient.

4.4 Medical Image Enhancement by 3D Super-Resolution In this part, typical state-of-the-art 3D CNNs proposed in recent years for MRI super-resolution will be introduced.

4.4.1 3D Convolutional Neural Network for Super-Resolution (3D-SRCNN) Although SRCNN (introduced in Sect. 4.3.1) has been originally designed for 2D image processing, many medical images are 3D volumes, and 2D SR networks like SRCNN work slice-by-slice [51–53] without taking advantage of continuous structures in 3D. A 3D model would be preferable, as it directly extracts 3D image features, considering objects across multiple slices. Consequently, Pham et al. proposed a 3D-SRCNN for SR of MRI. The 3DSRCNN consists of 3 layers: n 1 filters with voxel size f 1 × f 1 × f 1 , n 2 filters with voxel size f 2 × f 2 × f 2 and 1 filter with voxel size f 3 × f 3 × f 3 . Chao et al. [37] shows that the performance may be improved while use a larger filter size of the sec-

68

Y. Li et al.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

Fig. 4.11 Comparison of various kind of methods and GND: a bicubic interpolation; b SRCNN; c VDSR; d ESPCN; e SRDenseNet; f RDN; g GND

4 Medical Image Enhancement Using Deep Learning

69

ond layer, but the complexity and cost memory will increase as well as the deployment speed tends to decrease. In order to avoid increasing complexity of the network and ensure the quality of the results, following parameters of 3D-SRCNN are set empirically: f 1 = 9, f 2 = 1, f 3 = 5, n 1 = 64 and n 2 = 32. For 3D-SRCNN, it has been observed that the use of 3D training patches of large size is not as stable as its small size, and requires higher computational time. Thus, 3D patch size of 25 × 25 × 25 is small enough to lead to convergence of training phase and has enough information for reconstruction [54].

4.4.2 3D Deeply Connected Super-Resolution Network (3D-DCSRN) Recently deeper networks have been demonstrated that often bring better results in previous 2D CNNs for SR. However, stacking extremely deep network or multiple convolutional neural networks into 3D version may result large quantities of parameters and thus faces challenges in memory allocation. In that, recent research on SR using CNN mainly focus on structure improvement in terms of efficiency. As one of the representative work, Chen et al. proposed a 3D Densely Connected Super-Resolution Networks (DCSRN) [55], which got inspired from Densely Connected Convolutional Networks [46]. It demonstrated again that more sophisticated network structures with skip connections and layer reusing benefit not only performance and speed, but also reduces training time. There are three major benefits to use DCSRN: • Each path in the proposed network is much shorter, so back-propagation is more efficient and training progress is faster. • The model is light-weight and efficient due to weight sharing. • The number of parameters is greatly reduced and features are reused heavily, so overfitting hardly occurs. The network structure of DCSRN is shown in Fig. 4.12. Most typical particularity in this network is that patches extracted from the whole 3D image will be densely connected and then fed into next block or layer. Firstly, a convolutional layer with kernel size of 3 and filter number of 2k is applied to the input image followed by a densely-connected block with 4 units. Each unit has a batch normalization layer and an exponential linear units (ELUs) activation. Then, a convolution layer with k filters will compress feature maps into an appropriate number. Finally, a convolution layer is used as reconstruction layer to provide final SR output. According to the original authors words, the densely-connected block has four 3 × 3 × 3 convolutional layers with 48 filters as the first layer output in a growth rate (k) of 24, which will generate the best results.

70

Y. Li et al.

Fig. 4.12 Framework of the 3D Densely Connected Super-Resolution Networks (DCSRN)

4.4.3 Super-Resolution Using a Generative Adversarial Network and 3D Multi-level Densely Connected Network (mDCSRN-GAN) Most previous research using deep-learning approaches have not fully solved the puzzle in the medical image SR problem in these aspects. First, many medical images are 3D volumes but previous CNNs work slice by slice, discarding information from continuous structures in the third dimension. Second, 3D models have far more parameters than 2D models, which raise a challenge in memory consumption and computational expenses, so 3D CNN becomes less practical. Finally, the most widely used optimization objective of CNN is pixel/voxel-wise error such as MSE between the estimation of the model and the reference HR. But as mentioned in [56], MSE and its derivative PSNR do not directly represent the visual quality of restored images. Thus, using MSE tends to cause overall blurring and low perceptual quality. In this part, a 3D Multi-level Densely Connected Super-Resolution Networks (mDCSRN) [57] proposed by Chen et al. is shown in Fig. 4.13 for fully solving the above problems will be briefly introduced. The mDCSRN is extremely light-weight by utilizing a densely connected network. Then, when trained with a Generative Adversarial Network (GAN) [58], it improves sharper and more realistic-looking images. By the way, it provides the state-of-art performance until year 2018 while optimized by intensity difference. As shown in Fig. 4.13b, each DenseBlock takes the output from all previous DenseBlocks and is directly connected to reconstruction layer. Those skip connections have been proven to be more efficient and less overfit due to applying direct access to all former layers. Different from the original DenseNet, pooling layers in mDCSRN have been removed in that it can make full use of information in full resolution. Furthermore, 1 × 1 × 1 convolutional layers are set as compressors before all the following DenseBlocks. Satisfactory results have proved that information compression can effectively force the model to learn universal features without overfitting.

4 Medical Image Enhancement Using Deep Learning

71

(a) DenseBlock

(b) Generator

(c) Discriminator Fig. 4.13 Architecture of a DenseBlock with 3 × 3 × 3 convolutions and b, c mDCSRN-GAN network

In addition, the compressor layers can adjust the network to the same width for each DenseBlock. In this network, loss function is the sum of two parts: intensity loss lossint and GAN’s discriminator loss lossG AN : loss = lossint + λlossG AN ,

(4.16)

72

Y. Li et al.

where λ is a hyperparameter that experimentally set to 0.001. The absolute difference (1 norm) between the output SR and ground truth HR images is defined as the intensity loss: lossint

L  H  W   HR  SR  I = lossell1 /L H W = x,y,z − I x,y,z /L H W ,

(4.17)

z=1 y=1 x=1 SR HR where Ix,y,z and Ix,y,z are the SR output from the deep learning model and the ground truth HR image patch respectively. GANs discriminator loss is used as the additional loss to the SR network:

lossG AN = lossW G AN ,D = −DW G AN ,θ (I S R ) ,

(4.18)

where DW G AN ,θ is the discriminators output digit from gradient penalty variants of Wasserstein GAN [59] for SR images.

4.4.4 Experimental Results of 3D Image Super-Resolution In this part, the comparison of results about 3D MR image SR reconstruction will be demonstrated. Parameters in each network are also set according to corresponding original paper. 500 images in IXI dataset are used for training and 80 images are used for test. The PSNR and SSIM of the results are shown in Table 4.2. Because a workstation is needed for training the whole model of mDCSRN-GAN, here we just compare generator of mDCSRN-GAN with other 3D models. Samples of these results from three directions are shown in Fig. 4.14 for qualitative evaluation. Since, the processing for SR by deep 3D CNN requires large quantities of training samples, although mDCSRN is deeper and more complex, the effect is not better than DCSRN without discriminator.

Table 4.2 The results of PSNR(dB) and SSIM for scale factor 2 of five networks and bicubic interpolation Method PSNR SSIM Bicubic 3D SRCNN 3D DCSRN 3D mDCSRN

31.91 34.16 35.46 35.41

0.9817 0.9897 0.9924 0.9922

4 Medical Image Enhancement Using Deep Learning

73

(a)

(b)

(c)

(d)

(e)

Fig. 4.14 Comparison of various kind of methods and GND in 3 directions: a Tricubic interpolation; b 3D-SRCNN; c 3D-DCSRN; d mDCSRN; e GND

74

Y. Li et al.

4.5 Conclusion In this chapter, we demonstrated the basic convolutional neural network for SR and kinds of typical state-of-the-art methods proposed in recent years. In addition, since quality requirements for medical images is getting higher and higher, results of conventional 2D CNNs are not good enough and 3D CNNs have demonstrated superiority in 3D medical image enhancement. Of course, with the development of deep learning, image processing technology will become faster and more complete. In the future, there will be a better way to use the neural networks for image enhancement, which is worth doing research and exploring. Acknowledgements This work is supported in part by the Grant in Aid for Scientific Research from the Japanese Ministry for Education, Science, Culture and Sports (MEXT) under the Grant Nos. 18K18078, 18H03267, in part by Zhejiang Lab Program under the Grant No. 2018DG0ZX01.

References 1. Elad, M., Arie, F.: Restoration of a single superresolution image from several blurred, noisy, and undersampled measured images. IEEE Trans. Image Process. 6(12), 1646–1658 (1997) 2. Park, S., Park, M., Kang, M.: Super-resolution image reconstruction: a technical overview. IEEE Signal Process. Mag. 20(3), 21–36 (2003) 3. Protter, M., et al.: Generalizing the nonlocal-means to super-resolution reconstruction. IEEE Trans. Image Process. 18(1), 36–51 (2009) 4. Cui, Z., Chang, H., Shan, S., Zhong, B., Chen, X.: Deep network cascade for image superresolution. In: European Conference on Computer Vision, pp. 49–64 (2014) 5. Freedman, G., Fattal, R.: Image and video upscaling from local self-examples. ACM Trans. Graph. 30(11), 12 (2011) 6. Glasner, D., Bagon, S., Irani, M.: Super-resolution from a single image. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 349–356 (2019) 7. Huang, J.B., Singh, A., Ahuja, N.: Single image super-resolution from transformed selfexemplars. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5197–5206 (2015) 8. Yang, J., Lin, Z., Cohen, S.: Fast image super-resolution based on in-place example regression. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1059–1066 (2013) 9. Bahrami, K., et al.: Convolutional neural network for reconstruction of 7T-like images from 3T MRI using appearance and anatomical features. Deep Learning and Data Labeling for Medical Applications, pp. 39–47. Springer, Cham (2016) 10. Bevilacqua, M., Roumy, A., Guillemot, C., Morel, M.L.A.: Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In: British Machine Vision Conference, pp. 1–10 (2012) 11. Chang, H., Yeung, D.Y., Xiong, Y.: Super-resolution through neighbor embedding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2004) 12. Dai, D., Timofte, R., Van Gool, L.: Jointly optimized regressors for image super-resolution. Eurographics 7, 8 (2015) 13. Freeman, W.T., Pasztor, E.C., Carmichael, O.T.: Learning low-level vision. Int. J. Comput. Vis. 40(11), 25–47 (2000) 14. Jia, K., Wang, X., Tang, X.: Image transformation based on learning dictionaries across image spaces. IEEE Trans. Pattern Anal. Mach. Intell. 35(11), 367–380 (2013)

4 Medical Image Enhancement Using Deep Learning

75

15. Kim, K.I., Kwon, Y.: Single-image super-resolution using sparse regression and natural image prior. IEEE Trans. Pattern Anal. Mach. Intell. 32(6), 1127–1133 (2010) 16. Schulter, S., Leistner, C., Bischof, H.: Fast and accurate image upscaling with super-resolution forests. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3791–3799 (2015) 17. Timofte, R., De Smet, V., Van Gool, L.: Anchored neighborhood regression for fast examplebased super-resolution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1920–1927 (2013) 18. Timofte, R., De Smet, V., Van Gool, L.: A+: adjusted anchored neighborhood regression for fast super-resolution. In: Asian Conference on Computer Vision, pp. 111–126 (2014) 19. Yang, J., Lin, Z., Cohen, S.: Fast image super-resolution based on in-place example regression. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1059-1066 (2013) 20. Yang, J., Wang, Z., Lin, Z., Cohen, S., Huang, T.: Coupled dictionary training for image superresolution. IEEE Trans. Image Process. 21(11), 3467–3478 (2012) 21. Yang, J., Wright, J., Huang, T.S., Ma, Y.: Image super-resolution via sparse representation. IEEE Trans. Image Process. 19(11), 2861–2873 (2010) 22. Zeyde, R., Elad, M., Protter, M.: On single image scale-up using sparse-representations. In: International Conference on Curves and Surfaces, pp. 711–730 (2012) 23. Glasner, D., Bagon, S., Irani, M.: Super-resolution from a single image. In: 12th International Conference on Computer Vision (2009) 24. Baker, S., Kanade, T.: Limits on super-resolution and how to break them. IEEE Trans. Pattern Anal. Mach. Intell. 9, 1167–1183 (2002) 25. Lin, Z., Shum, H.: Fundamental limits of reconstruction-based superresolution algorithms under local translation. IEEE Trans. Pattern Anal. Mach. Intell. 26(1), 83–97 (2004) 26. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Li, F.: ImageNet: a large-scale hierarchical image database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) 27. Krizhevsky, A., Sutskever, I., Hinton, G.: ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, pp. 1097–1105 (2012) 28. Zhao, H., Gallo, O., Frosio, I., Kautz, J.: Loss functions for image restoration with neural networks. IEEE Trans. Comput. Imaging 3(1), 47–57 (2017) 29. Zhang, L., Zhang, L., Mou, X., Zhang, D.: A comprehensive evaluation of full reference image quality assessment algorithms. In: IEEE International Conference on Image Processing, pp. 1477–1480 (2012) 30. https://en.wikipedia.org/wiki/Peak_signal-to-noise_ratio 31. Oriani, E.: qpsnr: a quick PSNR/SSIM analyzer for Linux. Accessed 6 April 2011 32. Pnmpsnr User Manual. Accessed 6 April 2011 33. Welstead, S.: Fractal and Wavelet Image Compression Techniques, pp. 155–156. SPIE Optical Engineering Press (1999) 34. Raouf, H., Dietmar, S., Barni, M. (ed.): Fractal Image Compression. Document and image compression, vol. 968, pp. 168–169. CRC Press, Boca Raton. ISBN 9780849335563. Accessed 5 April 2011 35. Wang, Z., Bovik, A., Sheikh, H., Simoncelli, E.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process 13(4), 600–612 (2004) 36. Chao, D., et al.: Learning a deep convolutional network for image super-resolution. In: European Conference on Computer Vision. Springer, Cham (2014) 37. Chao, D., et al.: Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 38(2), 295–307 (2016) 38. Dong, C., Loy, C.C., Tang, X.: Accelerating the super-resolution convolutional neural network. In: European Conference on Computer Vision (2016) 39. Aharon, M., Elad, M., Bruckstein, A.: K-SVD: an algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans. Signal Process. 54(11), 4311–4322 (2006)

76

Y. Li et al.

40. Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning, pp. 807–814 (2010) 41. Kim, J., Lee, J., Lee, K.: Accurate image super-resolution using very deep convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1646–1654 (2016) 42. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015) 43. Wang, Z., Liu, D., Yang, J., Han, W., Huang, T.: Deeply improved sparse coding for image super-resolution 2(3), 4 (2015). arXiv:1507.08905 44. Chen, Y., Pock, T.: Trainable nonlinear reaction diffusion: a flexible framework for fast and effective image restoration. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1256–1272 (2017) 45. Shi, W., et al.: Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1874–1883 (2016) 46. Huang, G., et al.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017) 47. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 48. Tong, T., et al.: Image super-resolution using dense skip connections. In: Proceedings of the IEEE International Conference on Computer Vision (2017) 49. Eirikur, A., Radum, T.: NTIRE 2017 challenge on single image super-resolution: dataset and study. In: The IEEE Conference on Computer Vision and Pattern Recognition Workshops (2017) 50. https://brain-development.org/ixi-dataset/ 51. Greenspan, H., et al.: MRI inter-slice reconstruction using super-resolution. Magn. Reson. Imaging 20(5), 437–446 (2002) 52. Greenspan, H.: Super-resolution in medical imaging. Comput. J. 52(1), 43–63 (2008) 53. Litjens, G., et al.: A survey on deep learning in medical image analysis. Med. Image Anal. 42, 60–88 (2017) 54. Pham, C., et al.: Brain MRI super-resolution using deep 3D convolutional networks. In: IEEE 14th International Symposium on Biomedical Imaging (2017) 55. Chen, Y., et al.: Brain MRI super resolution using 3D deep densely connected neural networks. In: IEEE 15th International Symposium on Biomedical Imaging (2018) 56. Ledig, C., et al.: Photo-realistic single image super-resolution using a generative adversarial network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017) 57. Chen, Y., et al.: Efficient and accurate MRI super-resolution using a generative adversarial network and 3d multi-level densely connected network. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, Cham (2018) 58. Goodfellow, I., Pouget-Abadie, J., Mirza, M., et al.: Generative adversarial nets. Advances in Neural Information Processing Systems, pp. 2672–2680 (2014) 59. Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein gan (2017). arXiv:1701.07875

Part II

Advanced Deep Learning in Healthcare

Chapter 5

Improving the Performance of Deep CNNs in Medical Image Segmentation with Limited Resources Saeed Mohagheghi, Amir Hossein Foruzan and Yen-Wei Chen

Abstract Convolutional neural networks (CNNs) have obtained enormous success in image segmentation, which is substantial in many clinical treatments. Even though CNNs have achieved state-of-the-art performances, most researches on semantic segmentation using the deep learning methods are in the field of computer vision, so the research on medical images is much less mature than that of natural images, especially, in the field of 3D image segmentation. Our experiments on CNN segmentation models demonstrated that, with modifications and tuning in network architecture and parameters, modified models would show better performances in the selected task, especially with limited training dataset and hardware. We have selected the 3D liver segmentation as our goal and presented a pathway to select a state-of-the-art CNN model and improve it for our specific task and data. Our modifications include the architecture, optimization algorithm, activation functions and the number of convolution filters. With the designed network, we used relatively less training data than other segmentation methods. The direct output of our network, with no further post-processing, resulted in the dice score of ~99 in training and ~95 in validation images in 3D liver segmentation, which is comparable to the state-of-the-art models that used more training images and post-processing. The proposed approach can be easily adapted to other medical image segmentation tasks. Keywords Deep learning · Convolutional neural network · Model improvement · Medical image segmentation · 3D liver segmentation

S. Mohagheghi · A. H. Foruzan (B) Department of Biomedical Engineering, Engineering Faculty, Shahed University, Tehran, Iran e-mail: [email protected] S. Mohagheghi e-mail: [email protected] Y.-W. Chen College of Information Science and Engineering, Ritsumeikan University, 525–8577 Kusatsu-shi, Japan e-mail: [email protected] © Springer Nature Switzerland AG 2020 Y.-W. Chen and L. C. Jain, Deep Learning in Healthcare, Intelligent Systems Reference Library 171, https://doi.org/10.1007/978-3-030-32606-7_5

79

80

S. Mohagheghi et al.

5.1 Introduction Image segmentation is a substantial step in treatment planning procedures such as computer-aided surgery and radiotherapy. However, there are several challenges when dealing with medical images including the low-contrast between an organ and its surrounding tissues, noise and artifacts, large variations in the shape and appearance of the objects, and insufficient annotated training data. To overcome these limitations, researchers have incorporated domain knowledge, included constraints, employed advanced preprocessing techniques, and developed unsupervised techniques. Due to the current progress of software and hardware technologies, many researchers tend to use deep neural networks to overcome the above limitations. Deep learning is, in fact, a new approach to conventional neural networks. It is considered as a machine learning method in the field of artificial intelligence [1]. The main steps of a typical machine learning technique consist of feature representation, learning, and testing stages. The features of the input data are first extracted in the feature representation step, and then in the learning phase, they are used to train the desired algorithm. Deep learning methods learn feature hierarchies and complex functions from data without depending on human-crafted features [2]. In deep learning methods, both feature representation and algorithm learning are performed together, and the learning process is end-to-end from input data to the result.

5.2 Deep Convolutional Neural Networks Based on the types of layers and connections in a network, various models are introduced in the context of deep learning, including auto-encoder (AE) networks [3], deep belief networks (DBNs) [4], convolutional neural networks (CNNs) [5], recurrent neural networks (RNNs) [6] and generative adversarial networks (GANs) [7]. Researchers use these models in various applications such as computer vision, image and video processing, speech recognition, natural language processing, and text analysis. Convolutional neural networks are considered a revolution in the field of image processing and computer vision. CNN models have been in the first rank of the IMAGENET Large Scale Visual Recognition Challenge (ILSVRC) since 2012 [5, 8–12]. The majority of the state-of-the-art methods in object detection and semantic segmentation tasks have used a form of CNN. CNNs are specially adapted for with multi-dimensional data such as 2D/3D images and video frames. CNNs can capture the hierarchy of features and are invariant to translation, rotation and small deformations of the object as well. The basic structure of CNNs was for classification, where, for example, the input of the model is an image, and the output is the class of the detected object in the image. Afterward, the application of these models has extended to object detection (where the output is a bounding box around the detected

5 Improving the Performance of Deep CNNs in Medical Image …

81

Fig. 5.1 A simple CNN structure for classification. Conv: convolutional layer. Pooling: downsampling layer. FC: fully-connected layer

object) and semantic segmentation (where each pixel in the image would have a class label in the output, and the output is a precise boundary around the detected objects). The essential part of a CNN is the convolutional (Conv) layer (Fig. 5.1). Conv layers apply a window of weights on the input image, and the output is a result of sliding this window over the entire area of the image. The weight window (convolution filter or kernel) is the learnable parameters of the Conv layer, and the output of a Conv layer is the feature map. With the use of several filters in a convolutional layer, the output would be the concatenation of several feature maps, which would be the input for the next layer. There is a common approach to apply a non-linear function (called the activation function) on the Conv layer’s output. Popular activation functions include the sigmoid and specially designed function for the deep CNNs like the Rectified Linear Unit (ReLu) [13]. It is also common to use a down-sampling layer (e.g., Max pooling) after a specific number of Conv layers (Fig. 5.1). The advantages of down-sampling layers are (i) reducing the size of the data, so the number of Conv filters can be increased and (ii) increasing the field of view of the Conv filters of the next layers that would lead to features that are more global. The last part of a typical CNN for classification is one or more fully-connected or dense layers ending in the same number of output neurons as the number of data classes, followed by a softmax (for multi-class classification) or a sigmoid (for binary classification) function. There are other types of layers, which we can use in a deep CNN. Dropout [14] and Batch Normalization (BN) [15] are the most popular ones. Dropout layers prevent the network from overfitting the training data by dropping out neurons with a probability function. However, using dropout after Conv layers would not improve the results, and we often use them between fully-connected layers. BN layers normalize every batch of the data and can be used with either fully-connected or Conv layers. While we use BN layers in our model, we can benefit from higher learning rates and reduce the strong dependence on weight initialization.

82

S. Mohagheghi et al.

5.3 Medical Image Segmentation Using CNNs The architecture of segmentation CNNs is different from classification CNNs. A segmentation model predicts a class for each point (pixel or voxel) of the input image and the output is a mask or label image, while a classification CNN assigns one class for the whole input. In a segmentation task, we usually have paired images {x, y}, where is the observed intensity image as the input, and is desired output or the ground-truth label L = {1, 2, . . . , C} representing different tissue. A segmentation model should estimate y having observed x. CNNs perform this task by learning a discriminative function that models the underlying conditional probability distribution P( y|x, θ), where θ is the network parameters. Currently, the most popular CNN models used in medical image segmentation are in the form of Fully Convolutional Networks (FCNs) [16]. FCNs are a class of CNNs designed for segmentation tasks that only include locally connected layers (i.e., convolution, pooling, and up-sampling). There is no fully-connected layer in this kind of architecture so that the model can be used with different image sizes in both training and test steps. This architecture also reduces the number of parameters and computation time. A segmentation FCN is composed of two parts: (i) feature extraction (or analysis) similar to a classification CNN like VGG-Net [9] and (ii) reconstruction (or synthesis) as in the case of U-Net [17] and 3D U-Net [18]. Figure 5.2 shows a schematic of a segmentation network. In the analysis path, we usually have an increase in the number of filters and feature maps along with a decrease in the size of the data. In the synthesis path, we reduce the number of feature maps and grow back the size of the data to the original input size using up-sampling. We can also replace the pooling and up-sampling layers with Conv layers. In this kind of architecture, the down-sampling is done via strides ≥ 2 in Conv layers, and the up-sampling is performed by transposed convolutional (also called UpConv) layers with strides ≥ 2.

Fig. 5.2 Typical structure for a segmentation FCN. All layers are locally connected, and no fullyconnected layer is used

5 Improving the Performance of Deep CNNs in Medical Image …

83

5.3.1 CNN Training During the training process, the CNN model learns to estimate the class densities P( y|x, θ) by assigning to each pixel or voxel xi the probability of belonging to each of the C classes, yielding C sets of class feature maps f c . Then, we obtain the result for class labels by applying a softmax function to the extracted class − f (c,i) feature maps yˆi = Ce e− f( j,i) . The network learns the mapping between intensities j=1    and labels by optimizing the average of a loss function N1 i L yˆi , yi (e.g., crossentropy or Dice loss), between the predicted label map yˆi and the ground-truth mask image yi . The optimization process is done using back propagation [19] and an optimization algorithm to minimize this loss. A sample diagram of a training procedure is illustrated in Fig. 5.3.

Fig. 5.3 Training diagram of a segmentation CNN. img: input image. SEG: segmentation network. pred: predicted label image. GT: ground truth mask. DCE: Dice coefficient loss

Optimization algorithms are used to update the internal parameters of the network (weights and biases) to minimize the loss function. There are two types of algorithms based on the learning rate: algorithms with constant learning rate like the stochastic gradient descent (SGD), and algorithms with adaptive learning rate like RMSProp [20] and Adam [21]. Currently, the most popular optimization algorithms actively in use are SGD, RMSProp, AdaDelta [22], Adam and their variants [1]. An overview of optimization algorithms can be found in [23].

5.3.2 Challenges of CNNs in Medical Image Segmentation In most areas of deep learning, many questions have remained unanswered, and there is a need to design new methods and theoretical foundations. The current situation creates an incentive for us to develop new approaches for network architecture design and training strategies. For automatic segmentation methods, there are specific challenges in working with medical images. These challenges include the noise and artifacts, large variations in the shape and appearance of the objects, low contrast between the object and adjacent tissues, and the presence of different pathologies (such as tumors and cysts). We also would encounter other problems like the limited available labeled data for training

84

S. Mohagheghi et al.

and the large size of the images (especially in volumetric data), which would lead to hardware memory limitations during the training process. In this chapter, we aim to design a proper network architecture and training strategy based on the current stateof-the-art CNN models to optimize the selected model for 3D liver segmentation in CT scan images. We first introduce the 3D U-Net model, which is a popular model in 3D medical image segmentation. Then, we propose some techniques and modifications that would make the model optimized for 3D liver segmentation task, with limited available hardware resources and labeled images. Our proposed strategy can be adapted to any other models and other tasks. The contributions of our method are as follows: (i) We use a small number of images to train a high-performance segmentation model. (ii) Our improved model has less number of parameters than the basic model and can be trained on medium hardware. (iii) We improved both the convergence speed and the performance of the model on our data. (iv) We trained our data end-to-end with no special pre-processing or post-processing.

5.4 Materials and Methods 5.4.1 Experimental Setup We used two datasets for our task, each containing 20 abdominal CT images. The first set was obtained from the MICCAI 2007 Grand Challenge workshop [24], and the second set (3D-IRCADb) belonged to IRCAD Laparoscopic Training center [25]. We crop each image using the bounding box extracted from the corresponding liver mask image. All cropped images have the same size of 128 × 128 × 128 voxels and contain entire liver. We also map the intensities of all images to the [0 255] range. We used TensorFlow [26] software library to implement all models and training sessions in a personal computer with a GeForce® GTX™ 1070Ti graphics card with 8 GB dedicated memory, a 3.3 GHz dual-core Intel® Pentium® G4400 CPU and 16 GB DDR4 RAM. Almost all computations were performed on the Graphics Processing Unit (GPU).

5.4.2 Basic Model We design an initial model that is similar to the basic 3D U-Net except that we reduce the number of filters to fit to our hardware specifications. Then, we train this model (which we call it the Basic model) on our data. 3D U-Net [18] is an FCN model, and it is an extension of the standard U-Net [17] with all 2D operations replaced with their 3D counterparts. The analysis path of the model has four resolution steps; each step contains two Conv blocks including a batch normalization (BN) and a rectified

5 Improving the Performance of Deep CNNs in Medical Image …

85

Fig. 5.4 The architecture of the basic model (3D U-Net). 3D boxes represent feature maps. Green arrows represent skip connections. (Figure adapted from [18])

linear unit (ReLu) and ends with a max-pooling layer. The synthesis path also has four resolution steps; each step starts with an up-sampling layer, followed by two sets of the Conv blocks. In the last layer, a 1 × 1 × 1 Conv layer reduces the number of output channels to the number of labels [18]. The architecture of the basic model is illustrated in Fig. 5.4. This model benefits from skip connections (also known as shortcut or residual connections) as well. These connections bridge between layers of similar resolutions in the analysis and synthesis paths. Skip connections are proved to be useful in U-Net [17], 3D U-Net [18] and V-Net [27], and they are also used in residual networks [11, 28]. Drozdzal et al. [29] evaluated the benefits of skip connections in creating deep architectures for medical image segmentation. A common approach to evaluate deep models is to split the available dataset into three groups: training, validation, and test. The train set contains most of the data and is used to run the learning algorithm. The validation set consists of a small number of the available data (also known as development or dev set), which we use to monitor the performance of the network, tune parameters, and make other modifications [30]. We use the test set to benchmark the proposed model. In this work, we shuffle the two input datasets and select 35 images as the train set, four images as the validation set, and one image for testing. We keep the training and validation sets fixed for all sessions to have a fair comparison between the variations of the models. We used only negative Dice Similarity Coefficients (DSC) as loss function and the evaluation metric. The Dice coefficient for two binary images y and ˆy, which are the ground-truth and the predicted mask images respectively, is defined as DSC = 2| y. ˆy| . The measure is always between 0 and 1 but the results of Dice coefficient | y|+| ˆy|

86

S. Mohagheghi et al.

Fig. 5.5 The loss plot of the basic model on our data (−1 is the optimum value). The divergence of the validation from training loss shows the probability of over-fitting

are usually reported as percent (DSC × 100). Since the optimization function tries to minimize the loss, we have used negative Dice coefficient so that the best value would be at −1. As mentioned in [30], having a single-number evaluation metric speeds up the whole process, and gives a clear preference ranking among all models. Figure 5.5 shows the loss plot of the basic model after running on our data.

5.5 Model Optimization for 3D Liver Segmentation Monitoring the changes of training and validation losses during training is one of the essential tools to inspect the performance of a network, and it would help us to diagnose the problems and decide on the next approaches for the performance improvement. If we have high training error, we can say that our model is underfitted and is not appropriate for our task. Thus, we should increase the size or complexity of the model or change its architecture [30]. As we can see in Fig. 5.4, the main problem of our model is high validation error. When the validation loss does not converge and does not follow the training loss, we can say that the model is over-fitted and it has a poor generalization. In this case, we could reduce the number of parameters of the model or use more training data. Other options to have a lower validation loss are using regularization techniques to constrain the optimization process, and changing the model parameters and hyper-parameters such as the batch size, the learning rate of the optimization algorithm and the layers of the model.

5 Improving the Performance of Deep CNNs in Medical Image …

87

Using more training data is not our solution because we try to optimize the selected model for the available data. Another reason for not using more data (even with data augmentation methods) is that when we plan to evaluate several models using a lower number of training images. We can take some actions to prevent the model from overfitting and improve its performance for our task. We will discuss these steps in the following.

5.5.1 Model’s Architecture We start the model modifications with replacing the max-pooling layers with the strides of two in their previous Conv layers. We also remove the up-sampling layers and replace the first Conv layers after them to the transposed Conv layers with the stride of two. Since pooling and up-sampling layers are fixed operations, these changes have no effect on the number of parameters, but it reduces the number of computations, and also make the size changing operations trainable (especially the up-sampling process). A detailed explanation of the Conv and transposed Conv layers and their effect on the data can be found in [31]. In the basic model when we try to increase the batch size, it will cause memory error in the GPU unit; or when we change the learning rate, the model will fail to converge. However, with the new architecture, we use the batch size of three and any other learning rates without the convergence or memory problems. The results are slightly better than the basic model.

5.5.2 Optimization Algorithm Various optimization algorithms have been proposed in recent years; The 3D U-Net model uses the SGD optimization algorithm. However, recent studies have revealed that the algorithms with adaptive learning rates could help the model converge faster and also avoid it from getting stuck in saddle points [1, 23]. We use Adam [21] for our model because it is computationally efficient, has little memory requirements, and has shown its effectiveness in our experiences on CNN models. Adam is one of the most commonly used optimization algorithms and is, in fact, the combination of RMSprop [4] and momentum techniques. Regarding its two adjustable parameters β1 and β2 , we use the recommended default values of 0.9 and 0.999 respectively.

88

S. Mohagheghi et al.

5.5.3 Activation Function Another parameter to refine is the activation function, which has a significant effect on both the speed of the convergence and the accuracy of the model. ReLu and its variants have shown great advantages over conventional activation functions (e.g., sigmoid) in CNNs [5], but another activation function named Exponential Linear Unit (ELU) has been introduced recently [32], which would help the model converge faster and lead to more accurate results [33]. ELU has the same identity function in the non-negative values of the input as ReLu, but it acts differently on the negative inputs. The plots of ReLu and ELU functions is shown in Fig. 5.6. We evaluated both ReLu and ELU activation functions in our modified model, and the results are illustrated in Fig. 5.7.

Fig. 5.6 ReLu versus ELU activation functions

Fig. 5.7 Comparison of the training and validation loss plots of the modified model with ReLu and ELU activation function in the Conv layers (−1 is the optimum value)

5 Improving the Performance of Deep CNNs in Medical Image …

89

By looking at the loss plots in Fig. 5.7, we can say that the training loss converges faster when we use ELU, and the validation loss continues to decrease and follow the training loss. Thus, the overfitting problem is resolved, and the generalization of the model will improve if we continue the training process.

5.5.4 Complexity of the Model Another approach to prevent the model from overfitting is reducing the model’s complexity. Regardless of the size of the data, the complexity of a deep CNN depends on the number of layers and the size and number of convolution filters. We divided the number of filters of all Conv layers (except the last one) by 2, 4 and 8 and trained the model in each configuration. As we can see in Fig. 5.8, the speed of convergence reduces in models with the less number of Conv filters. However, the validation loss plot in Fig. 5.8 shows that when we divided the number of filters by 4 and 8, the validation loss would keep decreasing (the model did not overfit). Moreover, if we continued training for more batches or even with a higher learning rate, we would obtain better results. The results of the training with a different number of convolution filters are listed in Table 5.1.

Fig. 5.8 Loss plots of the modified model with a different number of filters. Each plot is the result of a model with the number of convolution filters in each layer divided by division factor d

90

S. Mohagheghi et al.

Table 5.1 Results of the training in three different configurations. The number of convolution filters is divided by the division factor (d) in each row Division factor

# of parameters (million)

Training time (min)

Min. training error

Min. validation error

d=2

~8

54

−0.994

−0.949

d=4

~5

33

−0.991

−0.945

d=8

~4

25

−0.984

−0.941

5.5.5 Parameter Tuning and Final Results There are no fixed rules for setting the number of parameters, the number of epochs and learning rate. Frequently, we find the proper values by trial and error. We evaluated the model with a different number of filters, learning rates and batch sizes to find the best configuration. For the number of epochs, we continue the training until the validation loss plot stays flat for a long time. The variations in learning rate are handled by the optimization algorithms with adaptive learning rate, but the value of the initial learning rate affects the overall speed of the convergence. We tried different learning rates from 0.00001 to 0.001. Another parameter that has an impact on the generalization ability of the model is the batch size. Since we are dealing with large volumetric images, we are limited to small batch sizes, and we could only try the batch sizes of 1–7. After the evaluation of the model with different configurations, the best results achieved with the division factor of 4 (d = 4), the batch size of 3 and the learning rate of 0.0005. Figure 5.9 shows the loss plots of the Basic model and the final improved model for comparison, and Table 5.2 contains the specifications of the basic and the modified model. Figure 5.10 and Fig. 5.11 show the results of the segmentation on the test and validation images in both basic and improved models. As it is obvious in Figs. 5.9, 5.10 and 5.11 and Table 5.2, by our modifications and improvements, we could obtain less validation error and more accurate segmentation results on an unseen image, which means that the generalization of the model is increased. We also reduced the number of computations and training time for the improved model.

5.6 Discussion and Conclusion We presented a pathway to select a state-of-the-art CNN model and improved it for our specific task and data. We selected the 3D U-Net model and refined it for the task of 3D liver segmentation in CT scan images. We modified the architecture of the model, optimization algorithm, activation functions, and the complexity of the network. We also tuned other parameters like the learning rate and the batch size. We showed that the deep 3D CNNs could achieve better results without requiring a large number of training images. We used only 35 training images on our proposed model

5 Improving the Performance of Deep CNNs in Medical Image …

91

Fig. 5.9 Validation loss plots of the basic and the modified models

Table 5.2 Comparison of Basic and Modified models. Both models have trained for 200 epochs Model

# of parameters (million)

Training time (min)

Max training DSC (%)

Max validation DSC (%)

Basic

~8

127

99.2

87.9

Modified

~5

50

99.4

95.2

and obtained the results which were comparable to the state-of-the-art models with more training images and heavy post-processing on their results [34]. Using more training data or simple post-processing on the output of our model, we outperform the state-of-the-art results. The proposed approach can also be used when enough training data and hardware resources are available. We can select a small part of a big dataset, follow the improvement process, and reach to a few model candidates. Then we can train and evaluate the final candidates with all images and the desired number of epochs. It is essential to check the dataset images one by one to ensure that there are no corrupted or problematic images. A simple pre-processing like the intensity normalization or smoothing could have a significant impact on the results. This preprocessing should be applied to the images in both the train and test phases. It is also important to perform multiple runs for each model and get the average of the losses to have results that are more reliable. There are also other methods to improve deep neural networks, like the knowledge distillation method [35] and other methods of integrating prior knowledge into deep neural networks. These methods usually require a large training dataset and powerful

92

S. Mohagheghi et al.

Fig. 5.10 Typical segmentation of a test image. img: input image; GT: ground-truth mask image; Basic: output of the basic model; Improved: output of the modified model

Fig. 5.11 3D reconstruction of the segmentation of a validation image. GT: ground-truth mask image; Basic: output of the basic model; Improved: output of the modified model

hardware. Therefore, we did not include these methods in this work because of our assumption in having a limited dataset and hardware that requires a separate study to evaluate them.

5 Improving the Performance of Deep CNNs in Medical Image …

93

References 1. Goodfellow, I., et al.: Deep Learning, vol. 1. MIT press, Cambridge (2016) 2. Bengio, Y.: Learning deep architectures for AI. Found. Trends® Mach. Learn. 2(1), 1–127 (2009) 3. Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006) 4. Hinton, G.E., Osindero, S., Teh, Y.-W.: A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006) 5. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (2012) 6. Graves, A.: Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308. 0850 (2013) 7. Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems (2014) 8. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: European Conference on Computer Vision. Springer (2014) 9. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014) 10. Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015) 11. He, K., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016) 12. Szegedy, C., et al.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: AAAI (2017) 13. Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10) (2010) 14. Srivastava, N., et al.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014) 15. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015) 16. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015) 17. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer (2015) 18. Çiçek, Ö., et al.: 3D U-Net: learning dense volumetric segmentation from sparse annotation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer (2016) 19. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature 323(6088), 533 (1986) 20. Hinton, G.: Neural networks for machine learning. Coursera, [video lectures] (2012) 21. Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412. 6980 (2014) 22. Zeiler, M.D.: ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701 (2012) 23. Ruder, S.: An overview of gradient descent optimization algorithms. arXiv preprint arXiv: 1609.04747, 2016 24. Van Ginneken, B., Heimann, T., Styner, M.: 3D segmentation in the clinic: a grand challenge. In: 3D Segmentation in the Clinic: a Grand Challenge, pp. 7–15 (2007) 25. Soler, L., et al.: 3D Image reconstruction for comparison of algorithm database: a patient specific anatomical and medical image database (2010) 26. Abadi, M., et al.: Tensorflow: a system for large-scale machine learning. In: OSDI (2016)

94

S. Mohagheghi et al.

27. Milletari, F., Navab, N., Ahmadi, S.-A.: V-Net: fully convolutional neural networks for volumetric medical image segmentation. In: 3D Vision (3DV), 2016 Fourth International Conference on. IEEE (2016) 28. Chen, H., et al.: VoxResNet: deep voxelwise residual networks for volumetric brain segmentation. arXiv preprint arXiv:1608.05895 (2016) 29. Drozdzal, M., et al.: The importance of skip connections in biomedical image segmentation. In: International Workshop on Large-Scale Annotation of Biomedical Data and Expert Label Synthesis. Springer (2016) 30. Ng, A.: Machine Learning Yearning (2017) 31. Dumoulin, V., Visin, F.: A guide to convolution arithmetic for deep learning. arXiv preprint arXiv:1603.07285 (2016) 32. Clevert, D.-A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289 (2015) 33. Pedamonti, D.: Comparison of non-linear activation functions for deep neural networks on MNIST classification task. arXiv preprint arXiv:1804.02763 (2018) 34. Hu, P., et al.: Automatic 3D liver segmentation based on deep learning and globally optimized surface evolution. Phys. Med. Biol. 61(24), 8676 (2016) 35. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)

Chapter 6

Deep Active Self-paced Learning for Biomedical Image Analysis Wenzhe Wang, Ruiwei Feng, Xuechen Liu, Yifei Lu, Yanjie Wang, Ruoqian Guo, Zhiwen Lin, Tingting Chen, Danny Z. Chen and Jian Wu

Abstract Automatic and accurate analysis in biomedical images (e.g., image classification, lesion detection and segmentation) plays an important role in computeraided diagnosis of common human diseases. However, this task is challenging due to the need of sufficient training data with high quality annotation, which is both timeconsuming and costly to obtain. In this chapter, we propose a novel Deep Active Self-paced Learning (DASL) strategy to reduce annotation effort and also make use of unannotated samples, based on a combination of Active Learning (AL) and SelfPaced Learning (SPL) strategies. To evaluate the performance of the DASL strategy, we apply it to two typical problems in biomedical image analysis, pulmonary nodule segmentation in 3D CT images and diabetic retinopathy (DR) identification in digital retinal fundus images. In each scenario, we propose a novel deep learning model and train it with the DASL strategy. Experimental results show that the proposed models trained with our DASL strategy perform much better than those trained without DASL using the same amount of annotated samples.

W. Wang · R. Feng · X. Liu · Y. Lu · Y. Wang · R. Guo · Z. Lin · T. Chen · J. Wu (B) College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China e-mail: [email protected] W. Wang e-mail: [email protected] W. Wang · R. Feng · X. Liu · Y. Lu · Y. Wang · R. Guo · Z. Lin · T. Chen · D. Z. Chen · J. Wu Real Doctor AI Research Centre, Zhejiang University, Hangzhou 310027, China D. Z. Chen Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, USA © Springer Nature Switzerland AG 2020 Y.-W. Chen and L. C. Jain, Deep Learning in Healthcare, Intelligent Systems Reference Library 171, https://doi.org/10.1007/978-3-030-32606-7_6

95

96

W. Wang et al.

6.1 Introduction Automatic and accurate analysis in biomedical images (e.g., image classification, lesion detection and segmentation) plays an important role in computer-aided diagnosis of common human diseases. Recent advances in deep learning, especially in Convolution Neural Networks (CNN), provide powerful tools for solving a variety of biomedical imaging problem, such as lesion detection and disease diagnosis. However, sufficient annotations are commonly needed to train a deep network that achieves a good performance, which can incur a great deal of annotation effort and cost. For applications where the appearance of each tiny lesion is important for decision making, it is especially difficult for human experts to annotate every of these lesions for deep network training. To better investigate and tackle this issue, we take 3D pulmonary nodule segmentation and diagnosis of diabetic retinopathy (DR) as examples. Pulmonary Nodule Segmentation Lung cancer is one of the most life-threatening malignancies. A pulmonary nodule is a small growth in the lung, which has the risk of being a site of cancerous tissue. The boundaries of pulmonary nodules have been regarded as a vital criterion for lung cancer diagnosis [1], and Computed Tomography (CT) is one of the most common methods for examining the presence and boundary features of pulmonary nodules, as shown in Fig. 6.1. Thus, the automated segmentation of pulmonary nodules in CT volumes promotes early diagnosis of lung cancer by reducing the need for expensive human expertise. Recent studies about pulmonary nodule segmentation [3, 4] attempt to use weakly labeled data. However, restricted by rough annotations in biomedical images, they do not perform very well, since they produce rough boundary segmentation of pulmonary nodules and incur considerable false positives. On the other hand, a deep active learning framework [5] was proposed to annotate samples during network training. Although being able to make good use of fully-annotated samples, this approach did not utilize abundant unannotated samples in model training. Diabetic Retinopathy Identification Diabetic Retinopathy (DR) is one of the most severe complications of diabetes, which can cause vision loss or even blindness. DR can be identified by ophthalmologists based on the type and count of lesions observable in a retinal fundus image. Usually, the severity of DR is rated on a scale of 0 to 4: normal, mild, moderate, severe, and proliferative. As shown in Fig. 6.2b, grades 1–3

Fig. 6.1 Examples of CT lung image slices in the LIDC-IDRI dataset [2]. Pulmonary nodules are marked with red boxes

6 Deep Active Self-paced Learning for Biomedical Image Analysis

(a)

97

(b)

Grade 1

Grade 2

Grade 3

Grade 4

Fig. 6.2 a Missing annotated lesions in images. Yellow dotted boxes are ophthalmologists’ notes and blue arrows indicate missing annotation. b DR grades can be identified by the types and count of lesions (yellow: MA, blue: HE, green: EXU, and red: RNV). The lesions for Grade 4 are different from those of other grades

are classified as non-proliferative DR (NPDR), and can be identified by the amount of lesions including microaneurysm (MA), hemorrhages (HE), and exudate (EXU). Grade 4 is proliferative DR (PDR), whose lesions (such as retinal neovascularization (RNV)) are different from those of other grades. Identifying DR from retinal fundus images is time-consuming and manual-intensive, thus, it is important to develop an automatic method to assist DR diagnosis for better efficiency and reducing expert labor. To fully utilize the features of lesions in retinal fundus images for identifying DR, one kind of methods first detects lesions for further classification. Dai et al. [6] tried to detect lesions using clinical reports. van Grinsven et al. [7] sped up model training by selective data sampling for HE detection. Seoud et al. [8] used hand-crafted features to detect retinal lesions and identify DR grade. Yang et al. [9] proposed a two-stage framework for both lesion detection and DR grading using annotation of locations including MA, HE, and EXU. However, there are still difficulties to handle: (i) A common problem is that usually not all lesions are annotated. In retinal fundus images, the amount of MA and HE is often relatively large, and experts may miss some lesions (e.g., see Fig. 6.2a), which can be treated as negative samples (i.e., background) and thus are “noise” to the model. (ii) Not all kinds of lesions are beneficial to distinguishing all DR grades. For example, DR grade 4 (PDR) can be identified using RNV lesions, but has no direct relationship with MA and HE lesions (see Fig. 6.2b). In this chapter, we introduce a novel Deep Active Self-paced Learning (DASL) strategy, based on bootstrapping, to reduce annotation effort [5, 10] in the previously presented applications, i.e. volumetric instance level pulmonary nodule segmentation from CT scans and DR identification from retinal fundus images. To alleviate the lack of fully-annotated samples and make use of unannotated samples, our proposed DASL strategy combines Active Learning (AL) [11] with Self-Paced Learning (SPL) [12] strategies. Figure 6.3 outlines the main steps of our strategy. Starting with annotated samples, we train our CNNs, and use it to predict unannotated samples. After ranking the confidence and uncertainty of each test sample, we utilize highconfidence and high-uncertainty samples in self-paced and active annotation learning [13], respectively, and add them to the training set to fine-tune CNNs. The testing and

98

W. Wang et al.

Add into annotated samples

Annotated samples Unannotated samples

Train CNN

Inference unannotated samples

Select high confidence samples

Get network output

Select high uncertainty samples

Annotate

Deep Active Self-paced Learning (DASL) Fig. 6.3 Our proposed deep active self-paced learning (DASL) strategy

fine-tuning of CNNs repeat until the Active Learning process is terminated. Experimental results on the LIDC-IDRI lung CT dataset [2] and in our private retina fundus dataset show our DASL strategy is effective for annotation effort reduction. Preliminary versions of this work have been presented at the 2018 MICCAI Conference [14, 15]. In this chapter, we extend our method in [14, 15] in the following ways: (1) evaluating and further analyzing its performance for DR identification, (2) presenting additional discussions of the experimental results that were not included in the conference versions [14, 15].

6.2 The Deep Active Self-paced Learning Strategy The Deep Active Self-paced Learning (DASL) strategy is a combination of Active Learning (AL) [11] and Self-Paced Learning (SPL) [12] that alleviates the lack of fully-annotated samples and makes use of unannotated samples. Active Learning Strategy Active Learning attempts to overcome the annotation bottleneck by querying the most confusing unannotated instances for further annotation [11]. We utilize a straightforward strategy to select confusing samples during model training, different from [5], which applied a set of fully convolutional networks (FCN) for sample selection. The calculation of this sample uncertainty is defined as: Ud = 1 − max(Pd , 1 − Pd ),

(6.1)

where Ud denotes the uncertainty of the dth sample and Pd denotes the posterior probability of dth sample. Note that the initial training set is often too small to cover the entire population distribution. Thus, there are usually a lot of samples that a deep learning model is not (yet) trained with. It is not advisable to extensively annotate samples of similar patterns in one iteration, so the calculation of sample uncertainty needs to take this similarity into account. As in [5], we use cosine similarity to estimate the similarity between volumes. Therefore, the uncertainty of the dth volume is defined as:

6 Deep Active Self-paced Learning for Biomedical Image Analysis

 D Ud = (1 − max(Pd , 1 − Pd )) ×

j=1

sim(Pd , P j ) − 1 D−1

99

β ,

(6.2)

where D denotes the number of unannotated samples, sim() denotes cosine similarity, P j denotes the posterior probability of jth sample and β is a hyper-parameter that controls the relative importance of the similarity term. Note that when β = 0, this definition degenerates to the least confident uncertainty as defined in Eq. (6.1). We set β = 1 in our experiments. In each iteration, after acquiring the uncertainty of each unannotated sample, we select the top N samples for annotation and add them to the training set for further fine-tuning. Self-paced Learning Strategy Self-Paced Learning (SPL) was inspired by the learning process of humans/animals that gradually incorporates easy-to-hard samples into training [12]. It utilizes unannotated samples by considering both prior knowledge and the learning acquired during training [13]. Formally, let L(w; xi , pi ) denote the loss function of CNNs, where w denotes the model parameters, and xi and pi denote the input and output of the model, respectively. SPL aims to optimize the following function: min

E(w, v; λ, Ψ ) = C n

w,v∈[0,1]

n 

vi L(w; xi , pi ) + f (v; λ), s.t. v ∈ Ψ

(6.3)

i=1

where v = [v1 , v2 , . . . , vn ]T denotes the weight variables reflecting the samples’ confidence, f (v; λ) is a self-paced regularization term that controls the learning strategy, λ is a parameter for controlling the learning pace, Ψ is a feasible region that encodes the information of predetermined curriculum, and C is a standard regularization parameter for the trade-off of the loss function and the margin. We set C = 1 in our experiments. A self-paced function should satisfy the following three conditions [16]. (1) f (v; λ) is convex with respect to v ∈ [0, 1]n . (2) The optimal weight of each sammonotonically decreasing with respect to its corresponding loss li . ple vi∗ should be n vi should be monotonically increasing with respect to λ. (3) v1 = i=1 To linearly discriminate the samples with their losses, the regularization function of our learning scheme is defined as follows [16]: 

 n  1 v22 − vi , f (v; λ) = λ 2 i=1

(6.4)

With Ψ = [0, 1]n , the partial gradient of Eq. (6.3) using our learning scheme is equal to ∂E = Cli + vi λ − λ = 0, (6.5) ∂vi

100

W. Wang et al.

where E denotes the objective in Eq. (6.3) with a fixed w, and li denotes the loss of the ith sample. The optimal solution for E is given by Eq. (6.7) below. Note that since the labels of unannotated samples are unknown, it is challenging to calculate their losses. We allocate each “pseudo-label” by Eq. (6.6). yi∗ = argmin li , yi ={0,1}

⎧ ⎨ 1 − Cli , Cl < λ i ∗ λ vi = ⎩ 0, otherwise,

(6.6)

(6.7)

For pace parameter updating, we set the initial pace as λ0 . For the tth iteration, we compute the pace parameter λt as follows:

λt =

⎧ ⎪ ⎨ ⎪ ⎩

λ0 , t = 0 λ(t−1) + α × ηt , 1 ≤ t < τ λ

(t−1)

(6.8)

, t ≥ τ,

where α is a hyper-parameter that controls the pace increasing rate, ηt is the average accuracy in the current iteration, and τ is a hyper-parameter for controlling n the pace vi should update. Note that based on the third condition defined above, v1 = i=1 be monotonically increasing with respect to λ. Since v ∈ [0, 1]n , the updating of the parameter λ should be stopped after a few iterations. Thus, we introduce the hyperparameter τ to control the pace updating. To verify the relationship between AL and SPL in DASL, we use a sequence of “SPL-AL-SPL” to fine-tune the models in Sect. 6.3 and Sect. 6.4.

6.3 DASL for Pulmonary Nodule Segmentation Due to the sparse distribution of pulmonary nodules in CT volumes [2], employing 3D fully convolutional networks (e.g., [17, 18]) to semantically segment them may suffer from the class imbalance issue. Built on 3D image segmentation work [18, 19] and Mask R-CNN [20], we propose a 3D region-based network, named Nodule R-CNN, that provides an effective way for pulmonary nodule segmentation. When trained with DASL, Nodule R-CNN achieves promising results with few training data. To the best of our knowledge, this is the first work on pulmonary nodule instance segmentation in 3D images, and the first work to train 3D CNNs using both AL and SPL.

6 Deep Active Self-paced Learning for Biomedical Image Analysis

101

Conv(1)

BN, ReLU

RoI Pool

Conv(1)

Conv(3)

BN, ReLU

Conv(1) Max Pooling

Add

BN, ReLU

BN, ReLU

Mask Branch: Deconv(2, 2)

Deconv(2, 2)

Conv(1)

Deconv(2, 2)

Transition Layer:

Add

Detection Branch:

BN: Batch Normalization Conv(3, 2): 3*3*3 convolutional layer with stride=2 Conv(1): 1*1*1 convolutional layerDeconv(2 , 2): 2*2*2 deconvolutional layer with stride=2

Fig. 6.4 The detailed architecture of our Nodule R-CNN

6.3.1 Nodule R-CNN Building on recent advances of convolutional neural networks such as Region Proposal Networks (RPN) [21], Feature Pyramid Networks (FPN) [22], Mask R-CNN [20], and DenseNet [23], we develop a novel deep region-based network for pulmonary nodule instance segmentation in 3D CT images. Figure 6.4 illustrates the detailed architecture of our proposed Nodule R-CNN. Like Mask R-CNN [20], our network has a convolutional backbone architecture for feature extraction, a detection branch that outputs class labels and bounding-box offsets, and a mask branch that outputs object masks. In our backbone network, we extract diverse features at different levels by exploring an FPN-like architecture, which is a top-down architecture with lateral connections to build an in-network feature pyramid from a single-scale input. Three 3D DenseBlocks [19] with a growth rate of 12 are used to ease network training by preserving maximum information flow between layers and to avoid learning redundant feature maps by encouraging feature reuse. Deconvolution is adopted to ensure that the size of the feature map is consistent with the size of the input volume. Our model employs an RPN-like architecture to output classification results and bounding-box regression results. The architecture provides three anchors for each detected location. We use a patch-based training and testing strategy instead of using RoIAlign [20] to extract feature maps from RoIs due to limited GPU memory. In the mask branch, we utilize RoIPool to extract a small feature map from each RoI, and a Fully Convolutional Network (FCN) to generate the final label map of pulmonary nodule segmentation. In the final label map, the value of each voxel a represents the probability of a being a voxel of a pulmonary nodule. We define a multi-task loss on each sampled RoI as L = L cls + L box + L mask , where the classification loss L cls and the bounding-box loss L box are defined as in [24]. We define our segmentation loss L mask as Dice loss (since the outputs of the

102

W. Wang et al.

models trained with Dice loss are almost binary, and appear visually cleaner [18, 25]). Specifically, the Dice loss is defined as: L Dice

2 = − i



i pi yi  , pi + i yi

(6.9)

where pi ∈ [0, 1] is the ith output of the last layer in the mask branch passed through a sigmoid non-linearity and yi ∈ {0, 1} is the corresponding label.

6.3.2 Experiments of Pulmonary Nodule Segmentation We evaluate our proposed Nodule R-CNN with DASL using the LIDC-IDRI dataset [2]. Our experimental results are given in Table 6.1. The LIDC-IDRI dataset contains 1010 CT scans (see [2] for more details of this dataset). In our experiments, all nodules are used except those with a diameter < 3 mm, and each scan is resized to 512×512×512 voxels by linear interpolation. The inputs for our model are 3D patches of size of 128×128×128 voxels, which are cropped from CT volumes. 70% of the input patches contain at least one nodule. For this part of the inputs, segmentation masks are cropped to 32×32×32 voxels with nodules centering in them. We obtain the rest of the inputs by randomly cropping scans that very likely contain no nodule. The output size of the detection branch is 32×32×32×3×5, where the second last dimension represents the 3 anchors and the last dimension corresponds to the classification results and bounding-box regression results. In our experiments, 10% of the whole dataset are randomly selected as the validation set. We use a small subset of scans to train the initial Nodule R-CNN and the remaining samples are gradually added to the training set during the DASL process. First, we evaluate our Nodule R-CNN for pulmonary nodule instance segmentation. As shown in Table 6.1, we achieve a Dice of 0.64 and a Dice over truly detected nodules (TP Dice) of 0.95, both of which are the best results among state-of-the-art methods.

Table 6.1 Results on the LIDC-IDRI dataset for pulmonary nodule segmentation Method Dice TP dice Mean (± SD) Mean (± SD) Method in [4] Nodule R-CNN with DASL (50 initial samples) Nodule R-CNN with DASL (100 initial samples) Nodule R-CNN with DASL (150 initial samples) Nodule R-CNN (full training samples)

0.55(±0.33) 0.56(±0.45) 0.59(±0.45) 0.62(±0.43) 0.64(±0.44)

0.74(±0.14) 0.87(±0.09) 0.90(±0.05) 0.92(±0.03) 0.95(±0.12)

6 Deep Active Self-paced Learning for Biomedical Image Analysis

103

Fig. 6.5 Comparison using different amounts of initial annotated inputs for DASL: the solid lines are for the SPL process, the dotted lines are for the AL process, and the dashed line is for the current state-of-the-art result using full training samples

We then evaluate the combination of Nodule R-CNN and the DASL strategy. In our experiments, α is set to 0.002 and λ0 is set to 0.005, due to the high confidence of positive prediction. To verify the impact of different amounts of initial annotated samples, we conduct three experiments with 50, 100 and 150 initial annotated samples, respectively. Figure 6.5 summarizes the results. We find that, in DASL, when using less initial annotated samples to train Nodule R-CNN, SPL tends to incorporate more unannotated samples. This makes sense since the model trained with less samples does not learn enough patterns and is likely to allocate high confidence to more unseen samples. One can see from Fig. 6.5 that although the amount of samples selected by AL is quite small (N = 20 in our experiments), AL does help achieve higher Dice. Experimental results are shown in Table 6.1. We find that more initial annotated samples bring better results, and the experiment with 150 initial annotated samples gives the best results among our experiments on DASL, which is comparable to the performance of Nodule R-CNN trained with all samples (Fig. 6.6).

6.4 DASL for Diabetic Retinopathy Identification For identifying DR, we develop a new framework using retinal fundus images based on annotations that include DR grades and bounding boxes of MA and HE lesions (possibly with a few missing annotated lesions). We first extract lesion information into a lesion map by a detection model, and then fuse it with the original image for DR identification. To deal with noisy negative samples induced by missing annotated lesions, our detection model uses center loss [26] to cluster the features of similar samples around a feature center called Lesion Center, and a new sampling method, called Center-Sample, to find noisy negative samples by measuring their features’ similarity to the Lesion Center. In the classification stage, we integrate feature maps of the original images and lesion maps using an Attention Fusion Network (AFN), and evaluate the AFN trained with DASL strategy. As shown in Fig. 6.7, AFN can learn the weights between the original images and lesion maps when identifying different

104

W. Wang et al.

(a) Ground truth.

(b) Results of Nodule R-CNN (full annotated samples). Fig. 6.6 Some 2D visual examples of pulmonary nodule instance segmentation results on CT lung volumes. The pixels in white belong to segmented pulmonary nodules, and the pixels in black are background. As can be seen, our Nodule R-CNN model can attain accurate and clear results

Fig. 6.7 The center-sample detector (left) predicts the probabilities of the lesions using the antinoise center-sample module. Then AFN (right) uses the original image and detection model output as input to identify DR ( fles and f ori are feature maps, Wles and Wori are attention weights)

DR grades and can reduce the interference of unnecessary lesion information on classification. DASL helps AFN achieve promising results with few training data.

6.4.1 Center-Sample Detector The Center-Sample detector aims to detect n types of lesions (here, n = 2, for MA and HE) in a fundus image. Figure 6.7 gives an overview of the Center-Sample detector, which is composed of three main parts: shared feature extractor, classification/bounding box detecting header, and Noisy Sample Mining module.

6 Deep Active Self-paced Learning for Biomedical Image Analysis

105

The first two parts form the main network for lesion detection to predict the lesion probability map. Their main structures are adapted from SSD [27]. The backbone until conv4_3 is used as feature extractor, and the detect headers are the same as SSD. The third part includes two components: Sample Clustering for clustering similar samples and Noisy Sample Mining for determining the noisy samples and reducing their sampling weight. Sample Clustering We adapt Center loss in classification tasks to detection tasks to cluster similar samples. By adding 1×1 convolution layers after the shared feature extractor, this component transforms the feature map from the shared feature extractor, which is a tensor of size h × w × c, to a feature map u of size h × w × d (d  c). Each position u i j in u is a d-D feature vector, mapped from a corresponding position patch f i j in the original image to a high-dimensional feature space S, where f i j denotes the receptive field of u i j . We assign each u i j with a classification label so there are n + 1 label classes in totally, including the background (no lesion in corresponding location). Then, we average the deep features u i j of each class to obtain n + 1 feature centers (the centers of positive labels are called lesion centers), and make the u i j cluster around their corresponding center in the space S using center loss [26] (in Fig. 6.7, the triangles denote the centers): w

LC =

1  ||u i j − c yi j ||22 , 2 i=1 j=1 h

(6.10)

where yi j ∈ [0, n] is the corresponding label of u i j in location (i, j), and c yi j ∈ Rd is the center of the yi j th class. During the detection training phase, we minimize LC and simultaneously update the feature centers using the Stochastic Gradient Decent(SGD) algorithm in each iteration, to make the u i j cluster to the center c yi j and it converges well after several iterations. Noisy Sample Mining In the Noisy Sample Mining module, we reduce the impact of noisy negative samples by down-weighting them. First, for each u i j labeled as a negative sample, we select the minimum L2 distance between u i j and all lesion centers, denoted by min-disti j and sort all elements in min-dist in increasing order. Then, the sampling probability P(u i j ) is assigned as: ⎧ 0 < ri j < tl ⎨0 r −t P(u i j ) = ( tiuj−tll )γ tl ≤ ri j < tu ⎩ 1 r i j ≥ tu

(6.11)

where ri j is the rank of u i j in min-dist. Note that u i j is close to lesion centers if ri j is small. The lower bound tl and upper bound tu of sampling ranking and γ are three hyper-parameters. We treat the summation of LC and detection loss in [27] as multi-task loss for robustness. Different from center loss in [26], a large number of deep features ensures the stability of small batch size in our method.

106

W. Wang et al.

During the training phase, we train the model with cropped patches of the original images that include lesions. During the inference phase, a whole image is fed to the trained model, and the output is a tensor M of size h × w × n, where every n-D vector Mi j in M denotes the maximum probability among all Anchor Boxes in this position for each lesion. We take this tensor, called Lesion Map, as the input of the Attention Fusion Network. Experiments and Results A private dataset was provided by a local hospital, which contains 13k abnormal (more severe than grade of 0) fundus images of size about 2000 × 2000. Lesion bounding boxes were annotated by ophthalmologists, including 25k MA and 34k HE lesions, with about 26% missing annotated lesions. The common metric for object detection mAP is used as the evaluation metric since it reflects the precision and recall of each lesion. In our experiments for this stage, we select MA and HE lesions as the detection targets since other types of lesions are clear even in compressed images (512 × 512). During training, we train the model with cropped patches (300 × 300) which include annotated lesions from the original images. Random flips are applied as data augmentation. We use SGD (momentum = 0.9, weight decay = 10−5 ) as the optimizer and batch size is 16. The learning rate is initialized to 10−3 and divided by 10 after 50 k iterations. When training the Center-Sample detector, we first use center loss and detection loss as multi-task loss for pre-training. Then the CenterSample mechanism is included after 10 k training steps. tl and tu are set to 1st and 5th percentile among all deep features in one batch. We evaluate the effects of the Center-Sample components by adding them to the detection model one by one. Table 6.2 shows that the base detection network (BaseNet), which is similar to SSD, gives mAP = 41.7%. After using Center Loss as one part of multi-task loss, it raises to 42.2%. The Center-Sample strategy further adds 1.4% to it, with the final mAP = 43.6%. The results show the robustness of our proposed method for the missing annotation issue. Figure 6.8 visualizes some regions where deep features are close to lesion centers.

6.4.2 Attention Fusion Network As stated in Sect. 6.1, some lesion information can be noise to identifying certain DR grades. To solve this issue, we propose an information fusion method based on the

Table 6.2 Results of the center-sample components √ √ BaseNet √ Center Loss Center-Sample mAP(%) 41.7 42.2

√ √ √ 43.6

6 Deep Active Self-paced Learning for Biomedical Image Analysis

107

Fig. 6.8 Missing annotated samples determined by the center-sample module

attention mechanism [28], called Attention Fusion Network (AFN) and then evaluate the DASL strategy on it. AFN can produce the weights based on the original images and lesion maps to reduce the impact of unneeded lesion information for identifying different DR grades. It contains two feature extractors and an attention network (see Fig. 6.7). Two separate feature extractors first extract feature maps f ori and fles of the scaled original images and lesion maps, respectively. Then, f ori and fles are concatenated on channel dimension as the input of the attention network. The attention network consists of a 3 × 3 Conv, a ReLU, a dropout, a 1 × 1 Conv, and a Sigmoid layer. It produces two weight maps Wori and Wles , which have the same shape as the feature maps f ori and fles , respectively. Then, we compute the weighted sum f (i, j, c) of the two feature maps as follows: f (i, j, c) = Wori (i, j, c) ◦ f ori (i, j, c) + Wles (i, j, c) ◦ fles (i, j, c)

(6.12)

where ◦ denotes element-wise product. The weights Wori and Wles are computed 1 as W (i, j, c) = 1+e−h(i, j,c) , where h(i, j, c) is the last layer output before Sigmoid produced by the attention network. W (i, j, c) reflects the importance of the feature at position (i, j) and channel c. The final output is produced by performing a softmax operation on f (i, j, c) to get the probabilities of all grades. Experiments and Results The private dataset used (which is different from the one for evaluating Center-Sample above) contains 40 k fundus images, with 31 k/3 k/ 4 k/1.1 k/1 k images for DR grades 0 to 4 respectively, rated by ophthalmologists. We use two ResNet-18 [29] as the feature extractors for both inputs. The preprocessing includes cropping the images and resizing them to 224 × 224. Random rotations/crops/flips are used as data augmentation. First, we train AFN with SGD optimizer on the private dataset as shown in Table 6.3. Baseline only employs scaled original images as input to ResNet-18 for training. We re-implement the feature fusion method in [9], called Two-stage. Another fusion method that concatenates lesion maps and scaled images on channel dimension (called Concated) is compared, since both these inputs equally contribute to identifying DR with this method. All models are trained for 300 k iterations with the initial learning rate = 10−5 and divided by 10 at iterations 120 and 200 k. Weight decay and momentum are set to 0.1 and 0.9. Our approach outperforms the other methods considerably.

108

W. Wang et al.

Table 6.3 Results on private dataset Algorithms Kappa Baseline Two-stage Concated AFN

0.786 0.804 0.823 0.875

Accuracy 0.843 0.849 0.854 0.873

Fig. 6.9 Results of using different amounts of initial annotated inputs for AFN with DASL. The solid lines are for the SPL process, the dotted lines are for the AL process, and the dashed line is for the current state-of-the-art result using full training samples

Then, we evaluate the combination of AFN and the DASL strategy. In this stage, α and λ0 are both set to 0.01. To contrast the effect of different amounts of initial annotated samples, we conduct three experiments with 2, 3.5 and 5 k initial annotated samples, each of which only accounts for 6.25, 7.81 and 15.6% of the training set, respectively. One can see from Fig. 6.9 that although the amount of samples selected by AL is quite small (N = 400 in our experiments), AL does help achieve higher Dice. As shown in Fig. 6.9, instead of using full annotated samples, DASL helps the model to use a quite small part of the annotated samples to reach a kappa score of 0.784.

6.5 Conclusions In this chapter, we first presented a novel Deep Active Self-paced Learning strategy for biomedical image analysis. Then we proposed a region-based framework for instance level pulmonary nodule segmentation, and a classification framework for

6 Deep Active Self-paced Learning for Biomedical Image Analysis

109

diabetic retinopathy identification. Finally, we evaluated the DASL on the two tasks, respectively. Experimental results on public available datasets show that our proposed frameworks achieve state-of-the-art performance in each task, and the DASL works well for annotation effort reduction. Our DASL strategy is general and can be easily extended to other biomedical image analysis applications with limited training data. Acknowledgements The research of Jian Wu was partially supported by the Ministry of Education of China under grant No. 2017PT18, the Zhejiang University Education Foundation under grants No. K18-511120-004, No. K17-511120-017, and No. K17-518051-021, the Major Scientific Project of Zhejiang Lab under grant No. 2018DG0ZX01, and the National Natural Science Foundation of China under grant No. 61672453. The research of Danny Z. Chen was supported in part by NSF Grant CCF-1617735.

References 1. Gonçalves, L., Novo, J., Campilho, A.: Hessian based approaches for 3D lung nodule segmentation. Expert. Syst. Appl. 61, 1–15 (2016) 2. Armato III, S.G., McLennan, G., Bidaut, L., McNitt-Gray, M.F., Meyer, C.R., Reeves, A.P., Zhao, B., Aberle, D.R., Henschke, C.I., Hoffman, E.A., et al.: The lung image database consortium (LIDC) and image database resource initiative (IDRI): a completed reference database of lung nodules on CT scans. Med. Phys. 38(2), 915–931 (2011) 3. Messay, T., Hardie, R.C., Tuinstra, T.R.: Segmentation of pulmonary nodules in computed tomography using a regression neural network approach and its application to the lung image database consortium and image database resource initiative dataset. Med. Image Anal. 22(1), 48–62 (2015) 4. Feng, X., Yang, J., Laine, A.F., Angelini, E.D.: Discriminative localization in CNNs for weaklysupervised segmentation of pulmonary nodules. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 568–576. Springer (2017) 5. Yang, L., Zhang, Y., Chen, J., Zhang, S., Chen, D.Z.: Suggestive annotation: a deep active learning framework for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 399–407. Springer (2017) 6. Dai, L., Sheng, B., Wu, Q., Li, H., Hou, X., Jia, W., Fang, R.: Retinal microaneurysm detection using clinical report guided multi-sieving CNN. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 525–532. Springer (2017) 7. van Grinsven, M.J.J.P., van Ginneken, B., Hoyng, C.B., Theelen, T., Sanchez, C.I.: Fast convolutional neural network training using selective data sampling: application to hemorrhage detection in color fundus images. IEEE Trans. Med. Imaging 35(5), 1273–1284 (2016) 8. Seoud, L., Hurtut, T., Chelbi, J., Cheriet, F., Langlois, J.M.P.: Red lesion detection using dynamic shape features for diabetic retinopathy screening. IEEE Trans. Med. Imaging 35(4), 1116–1126 (2015) 9. Yang, Y., Li, T., Li, W., Wu, H., Fan, W., Zhang, W.: Lesion detection and grading of diabetic retinopathy via two-stages deep convolutional neural networks. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 533–540. Springer (2017) 10. Li, X., Zhong, A., Lin, M., Guo, N., Sun, M., Sitek, A., Ye, J., Thrall, J., Li, Q.: Self-paced convolutional neural network for computer aided detection in medical imaging analysis. In: International Workshop on Machine Learning in Medical Imaging, pp. 212–219. Springer (2017) 11. Settles, B.: Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin – Madison (2009)

110

W. Wang et al.

12. Kumar, M.P., Packer, B., Koller, D.: Self-paced learning for latent variable models. In: Advances in Neural Information Processing Systems, pp. 1189–1197 (2010) 13. Lin, L., Wang, K., Meng, D., Zuo, W., Zhang, L.: Active self-paced learning for cost-effective and progressive face identification. IEEE Trans. Pattern Anal. Mach. Intell. 40(1), 7–19 (2018) 14. Wang, W., Lu, Y., Wu, B., Chen, T., Chen, D.Z., Wu, J.: Deep active self-paced learning for accurate pulmonary nodule segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 723–731. Springer (2018) 15. Lin, Z., Guo, R., Wang, Y., Wu, B., Chen, T., Wang, W., Chen, D.Z., Wu, J.: A framework for identifying diabetic retinopathy based on anti-noise detection and attention-based fusion. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 74–82. Springer (2018) 16. Jiang, L., Meng, D., Zhao, Q., Shan, S., Hauptmann, A.G.: Self-paced curriculum learning. In: AAAI Conference on Artificial Intelligence, pp. 2694–2700 (2015) 17. Çiçek, Ö., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O.: 3D U-Net: learning dense volumetric segmentation from sparse annotation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 424–432. Springer (2016) 18. Milletari, F., Navab, N., Ahmadi, S.A.: V-Net: Fully convolutional neural networks for volumetric medical image segmentation. In: 4th IEEE International Conference on 3D Vision, pp. 565–571. IEEE (2016) 19. Yu, L., Cheng, J.Z., Dou, Q., Yang, X., Chen, H., Qin, J., Heng, P.A.: Automatic 3D cardiovascular MR segmentation with densely-connected volumetric convnets. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 287–295. Springer (2017) 20. He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask R-CNN. In: IEEE International Conference on Computer Vision, pp. 2980–2988. IEEE (2017) 21. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015) 22. Lin, T.Y., Dollár, P., Girshick, R.B., He, K., Hariharan, B., Belongie, S.J.: Feature pyramid networks for object detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 936–944. IEEE (2017) 23. Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708. IEEE (2017) 24. Girshick, R.: Fast R-CNN. In: IEEE International Conference on Computer Vision, pp. 1440– 1448. IEEE (2015) 25. Drozdzal, M., Vorontsov, E., Chartrand, G., Kadoury, S., Pal, C.: The importance of skip connections in biomedical image segmentation. In: Deep Learning and Data Labeling for Medical Applications, pp. 179–187. Springer (2016) 26. Wen, Y., Zhang, K., Li, Z., Qiao, Y.: A discriminative feature learning approach for deep face recognition. In: European Conference on Computer Vision, pp. 499–515. Springer (2016) 27. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: SSD: Single shot multibox detector. In: European conference on computer vision, pp. 21–37. Springer (2016) 28. Chen, L.C., Yang, Y., Wang, J., Xu, W., Yuille, A.L.: Attention to scale: Scale-aware semantic image segmentation. In: Computer Vision and Pattern Recognition, pp. 3640–3649. IEEE (2016) 29. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp. 770–778. IEEE (2016)

Chapter 7

Deep Learning in Textural Medical Image Analysis Aiga Suzuki, Hidenori Sakanashi, Shoji Kido and Hayaru Shouno

Abstract One of the characteristics of medical image analysis is that several medical images are not in the structure domain like natural images but in the texture domain. This chapter introduces a new transfer learning method, called “two-stage feature transfer,” to analyze textural medical images by deep convolutional neural networks. In the process of the two-stage feature transfer learning, the models are successively pre-trained with both natural image dataset and textural image dataset to get a better feature representation which cannot be derived from either of these datasets. Experimental results show that the two-stage feature transfer improves the generalization performance of the convolutional neural network on a textural lung CT pattern classification. To explain the mechanism of a transfer learning on convolutional neural networks, this chapter also shows analysis results of the obtained feature representations by an activation visualization method, and by measuring the frequency response of trained neural networks, in both qualitative and quantitative ways, respectively. These results demonstrate that such successive transfer learning enables networks to grasp both structural and textural visual features and be helpful to extracting good features from the textural medical images.

A. Suzuki (B) · H. Sakanashi National Institute of Advanced Industrial Science and Technology (AIST), 1–1–1 Umezono, Tsukuba, Ibaraki, Japan e-mail: [email protected] H. Sakanashi e-mail: [email protected] University of Tsukuba, 1–1–1 Tennodai, Tsukuba, Ibaraki, Japan S. Kido Osaka University, 2–2, Yamadaoka, Suita, Osaka, Japan e-mail: [email protected] H. Shouno University of Electro-Communications, 1–5–1 Chofugaoka, Chofu, Tokyo, Japan e-mail: [email protected] © Springer Nature Switzerland AG 2020 Y.-W. Chen and L. C. Jain, Deep Learning in Healthcare, Intelligent Systems Reference Library 171, https://doi.org/10.1007/978-3-030-32606-7_7

111

112

A. Suzuki et al.

7.1 Introduction In accordance with the purpose and application, the domains of medical image analysis can be broadly divided into two types. One is structural imagery, in which the salient local region is identified from backgrounds, e.g., tumor detections and semantic organ segmentation problems. In other words, the structural domain should be seen from a “microscopic” viewpoint, which means focusing on regional characteristics. The other type is textural imagery, in which broad spatial aspects are identified, e.g., the structural atypia in pathological diagnosis and most diffuse disease findings in non-pathological diagnoses. Such textural domains should be seen from a “macroscopic” viewpoint, which means taking a bird’s eye view to comprehend the whole scene, instead of focusing on a particular region. This chapter introduces the deep learning approach to analyze textural medical images. Deep convolutional neural networks (DCNNs), which are one type of deep learning, have been a de-facto standard solution for computer vision problems since Krichevsky et al. and their DCNN architecture called “AlexNet” greatly succeed at ILSVRC, the major large-scale image recognition competition, in 2012 [1]. Most deep learning approaches, including DCNNs, generally require a massive amount of annotated examples to perform well due to their lack of trainability. However, as techniques have been developed to effectively train DCNNs such as transfer learning and semi-supervised training, deep learning approaches have succeeded even with limited training data. In fact, due to such effective DCNN architectures evolving and accumulating their practical knowledge, DCNNs have become effective at not only solving large-scale natural imagery recognition tasks but also conducing medical image analysis, which usually faces a data deficiency problem. Nonetheless, on the other hand, most successful applications of DCNNs in medical image analysis have been limited to tumor detection and semantic segmentation problems, i.e., “structural” imagery [2]. DCNNs seem to be applied less to textural imagery than structural imagery, possibly due to DCNNs being conceived mainly for object recognition. In short, conventional DCNNs are hard to be applied directly to analyze textural imagery due to their recognition mechanisms being unsuitable for textural analysis. DCNNs originate from a traditional artificial neural model of mammalian visual systems, called “Neocognitron”, proposed by Fukushima in 1980 [3]. Neocognitron is based on the primary visual cortex model proposed by Hubel and Wiesel in 1959 [4]. They found that our brains, in the primary visual cortex (V1), processes the visual stimulus with two-types of cells: simple cells (S-cells) and complex cells (C-cells). S-cells extract features of local shapes, such as lines and curves, corresponding to the edge between objects and background. C-cells work as deformed S-cells, which means C-cells tolerate small local shifts of stimulus. Hubel and Wiesel revealed that C-cells’ functionalities could be modeled by spatially combining the S-cells local feature activations. Also, they correctly deduced that in the whole of our visual systems, local features are integrated into the higher cortex along with broadening the receptive field and made into more abstract and broadscale representations in

7 Deep Learning in Textural Medical Image Analysis

113

such hierarchical information processing. Neocognitron is the artificial model of this neural processing of the visual cortex. Modern DCNNs fundamentally have the same mechanism, without its learning algorithm, as Neocognitron. Such S-cells local feature extraction and C-cells spatial deformations correspond to convolution and spatial-pooling operations in modern DCNN architecture, respectively. Such mechanisms of DCNNs are well-suited to structural imagery, i.e., object recognition tasks. Several studies have reported that DCNNs, which can perform object recognition well, extract the local edge feature in their lower-level representations and the more complicated shape structure that can characterize the objects in their higher-level representations. These results correspond to biological facts of the hierarchical representation of our visual systems, such as V1 extracting the local edge structure with a Gabor-filter-like operation. However, the textural recognition mechanism, which has a slightly complicated flow, does not correspond very well. Some contrivance is needed to fully utilize DCNNs’ representative ability in textural recognition tasks. In this chapter, we introduce a novel transfer learning approach called “two-stage feature transfer learning” to enable DCNNs to extract good feature representations for visual textures proposed in our paper [5]. We use DCNNs to classify diffuse lung diseases (DLDs) in high-resolution X-ray computed tomography (HRCT) imagery, a typical example of textural medical imagery. In two-stage feature transfer learning, a DCNN is successively trained with a massive natural image dataset and textural image dataset to obtain a good textural feature representation to see the scene from both macro and microscopic viewpoints. Moreover, we try to answer an important question: how/why does feature transfer learning work in DCNNs so well? The mechanism of feature transfer of DCNNs has not been discussed, despite feature transfer being commonly used as a dominant way to generalize when training data is insufficient. We analyzed the feature representation extracted in feature transferred DCNNs both qualitatively and numerically and demonstrated they became able to grasp the feature representation that occurs in domains used in the transfer learning processes. These results could hint at how we should apply the transfer learning to textural medical image analysis such as selection of the transfer domain used for pre-training.

7.2 Method 7.2.1 Deep Convolutional Neural Networks (DCNNs) DCNNs are multi-layered neural networks that consist of hierarchical transformations called “layers.” Most DCNNs can be separated into two parts. One is a feature extraction part, which extracts a feature representation from inputs. The other is a classification part, which actually solves the given problem using the feature representation extracted by the feature extraction part. The multi-layered perceptron with

114

A. Suzuki et al.

fully-connected layers, which is a traditional classifier neural network, is usually adopted as a classification part. The essence of DCNNs is the great ability of feature extraction part to obtain a good contracted representation of inputs well-suited for a given task. Moreover, back-propagation algorithms enable DCNNs to obtain such good representations, which have been handcrafted by experts in the manner of traditional image recognition, with the end-to-end optimization along with the classification part. The feature extraction part of DCNNs principally consist of “convolution” and “spatial pooling” layers. Convolution layers serve as local feature extractors, corresponding to S-cells in Neocognitron. This layer maps inputs into other activation maps that emphasize different local features with convolution operation in the same manner as in image processing. Spatial pooling layers spatially compress the activation map to tolerate small local deformations and to reduce the dimensionality of the activation map. This function corresponds to that of C-cells and contributes to abstracting hierarchical features in our visual information processing. In most cases, DCNNs layers are written as a deterministic correspondence. Feature extraction processes of sequential DCNNs that have no skip connections such as ResNet [6] can be written as a composed function of each layer like H = h iL i ◦ · · · ◦ h 1L 1

(7.1)

where h iL i is the type of ith layer, e.g., convolution or spatial-pooling. This trivial form, which means the DCNN is a composition form of deterministic correspondence, will be important when we analyze the feature representation of DCNNs in a later section.

7.2.2 Transfer Learning in DCNNs Transfer learning, in a loose sense, is a machine learning technique that utilizes the knowledge gained from non-target tasks, called “source domain,” to improve the model performance on a target task [7]. In DCNNs, transfer learning means reutilizing trainable parameters of networks, which were obtained from the learning result of non-target tasks, as an initial state of the DCNN. There are two common kinds of transfer learning of DCNNs: fine-tuning and feature transfer. In fine-tuning transfer learning, in a narrow sense, pretrained weights of the feature extraction part are fixed as the initial state. The classification part is initialized so as to suit the target task and trained to solve the given task. In other words, the fine-tuning approach uses the DCNN as a feature extractor and retrains only the classifier under the assumption that the transferred feature representation is suitable to solve the target task. In the feature transfer approach, the pretrained state is used as just an initial state, which means that the whole of the DCNN, including the feature extraction part, is retrained for the target task. This approach is slightly difficult to apply because not only the

7 Deep Learning in Textural Medical Image Analysis

115

classification part but also the feature extraction part needs to be trained. However, it can outperform the fine-tuning approach in terms of generalization to obtain more feasible feature representations during the additional retraining with the target task. In most transfer learning applications, including medical image analysis, a massive natural image dataset such as ImageNet is usually adopted as their source domain. Indeed, as we see below, transfer learning from natural images normally achieves good results in most tasks. However, the optimality of natural images as a source domain is slightly questionable when the target task involves images different from natural images, such as textural images. Moreover, the selection of the source domain strongly affects the generalization performance, as we see below too, because an inadequate choice of source domain could make generalization performance worse than when pretraining is not conducted. Thus, we should carefully apply transfer learning to particular target domains.

7.2.3 Two-Stage Transfer Learning Let us consider that when a radiologist becomes able to interpret complex radiograms, they learn about how to look at universal things in order to understand general visual stimulation like most children in early childhood. If DCNNs can be modeled on our visual systems, they will know “how to look at” basic scenes before learning how to look at difficult ones in the same way doctors learn how to make simple diagnoses before they learn how to make difficult diagnoses. A two-stage feature transfer learning, which can make the DCNNs suitable for the textural medical image analysis, is based on such point of view. A two-stage feature transfer learning is an extension of conventional feature transfer. Figure 7.1 shows a schematic diagram of two-stage feature transfer learning. In feature transfer learning, first, the DCNNs is trained with a massive number of natural images to classify objects in the same manner as conventional feature transfer learning. In this stage, DCNNs grasp the basic way to see general scenes like human babies growing up in a natural environment filled with the various visual stimulus. After that, in the second stage, the learning result from natural images is transferred again as an initial state of training to classify a massive number of textural images to learn how to grasp textural perspectives directly.1 Finally, in the third stage, DCNNs learn how to classify the target domain on the basis of the transferred knowledge obtained from natural and textural source domains. Such two-step feature transfer with different source domains gives a better feature representation that can show textural perspectives, which do not appear in the natural imagery directly.

1 Interchanging

the order of source domains for pretraining, i.e., using textural images before natural images, make the generalization performance worse because of the catastrophic forgetting of knowledge [8]. First pretraining should be done with a natural image dataset also by considering the computational cost of the training with natural images and the availability of the pretrained model.

116

A. Suzuki et al.

Fig. 7.1 Schematic diagram of two-stage feature transfer. The DCNN is first trained with natural images to obtain a good feature representation as the initial state. Subsequently, it transfers to the more effective domain (i.e., texture dataset) to obtain the feature representation suited for texturelike patterns. Then, finally, it trains with the target domain

7.3 Application of Two-Stage Feature Transfer for Lung HRCT Analysis 7.3.1 Materials This section demonstrates the effectiveness of two-stage feature transfer learning by giving an application example for classifying diffuse lung diseases (DLDs). DLDs is a collective term for lung disorders that spread to a large area of the lung. Some DLDs categorized into idiopathic pulmonary fibrosis (IPFs) can easily become immedicable. Thus, to make a better prognosis, DLDs need to be detected while they are still small and mild by using high-resolution X-ray computed tomography (HRCT). The classification of HRCT patterns of DLDs to determine how the disease has advanced is formulated in the textural domain because DLD conditions are seen as textural patterns as shown as Fig. 7.2. In our work, these patterns are classified into seven classes: consolidations (CON), ground-glass opacities (GGO), honeycombing (HCM), reticular opacities (RET), emphysematous changes (EMP), nodular opacities (NOD), and healthy/normal (NOR). These categorizations were introduced by Uchiyama et al. [9].2 2 Most

studies on DLDs classify them into six classes with no discrimination between RET and GGO. This work explicitly discriminates them in a more precise and up-to-date manner.

7 Deep Learning in Textural Medical Image Analysis

CON

GGO

HCM

RET

117

EMP

NOD

NOR

Fig. 7.2 Typical HRCT images of diffuse lung diseases: consolidations (CON); ground-glass opacities (GGO); honeycombing (HCM); reticular opacities (RET); emphysematous changes(EMP); nodular opacities (NOR); and healthy (normal) opacities (NOR)

The DLD image dataset was acquired from Osaka University Hospital, Osaka, Japan. We collected 117 HRCT scans from 117 different participants. Each slice was converted into gray-scale images with a resolution of 512 × 512 pixels and slice-thickness of 1.0 mm. Lung region slices were annotated for their seven types of patterns by experienced radiologists. The annotation region shapes and their labels were the results of diagnoses by three physicians. The annotated CT images were partitioned into regions of interest (ROI) patches, which were 32 × 32 pixels, corresponding to about 4 cm2 . This is a small ROI size for DCNN input. Thus, we magnified them by 224 × 224 pixels using bicubic interpolation. Therefore, from these operations, we collected 169 patches for CONs, 655 for GGOs, 355 HCMs, 276 for RETs, 4702 for EMPs, 827 for NODs, and 5726 for NORs. We then divided these patches for DCNN training and evaluation, because not all classes contain patches from the same patients. For the training, we used 143 CONs, 609 GGOs, 282 HCMs, 210 RETs, 4406 EMPs, 762 NODs, and 5371 NORs. The remaining 26 CONs, 46 GGOs, 73 HCMs, 66 RETs, 296 EMPs, 65 NODs, and 355 NORs were used for the evaluation. In two-stage feature transfer learning, we need to choose both natural and textural datasets as source domains. We used the ILSVRC 2012 dataset, which is a subset of ImageNet, as the natural image dataset. Its advantage is that the pretrained models are readily available because this dataset is a de-facto baseline of natural image recognition. We also used the Columbia-Utrecht Reflectance and Texture (CUReT) database as the texture dataset. Figure 7.3 shows examples of textural images in the CUReT database. This database contains macro photographs of 61 classes of real-world textures. Each class has approximately 200 samples under various illuminations and viewing angles. This dataset, which contains colored and coarse-to-fine various, is preferable to learn the textural feature representations.

7.3.2 Experimental Details We compared the generalization performances between the different transfer processes as follows: 1. No transfer learning (learning from scratch with randomized initial weights)

118

A. Suzuki et al.

Fig. 7.3 Examples of textural images comprising the CUReT database. Top row: entire images of “Roofing Shingle,” “Salt Crystals,” and “Artificial Grass” classes. Bottom: cropped and resized images of textural regions used as input for the DCNN

Fig. 7.4 Schematic diagram of our DCNN, the same as AlexNet [1]. The DCNN acquires feature representation by repeating convolution and spatial pooling

2. Conventional feature transfer with ILSVRC 2012 dataset (transfer learning with natural images) 3. Conventional feature transfer with CUReT dataset (transfer learning with textural images) 4. Two-stage feature transfer learning (transfer learning with both natural and textural images) In our experiment, we adopted AlexNet [1], which is the earliest and most straightforward DCNN and is often used as a reference for the performance of DCNNs, such as by Litjens et al. [2]. Figure 7.4 illustrates its architecture. AlexNet is a pure sequential DCNN, so it can easily analyze the intermediate representation to reveal the mechanism of feature transfer learning. We trained our DCNN using a momentum stochastic gradient descent (momentum-SGD) algorithm with a momentum of 0.9 and a dropout rate of 0.5. When the network was trained from the randomized initial state for the first time, we set the learning rate of momentum-SGD to 0.05. Otherwise, when the network was trained from the pretrained state, which means feature transfer, we set the small learning rate to 0.0005. In all conditions (1)–(4),

7 Deep Learning in Textural Medical Image Analysis

119

Table 7.1 Classification performance comparison for test data (± following each metrics means the standard deviation) (1) (2) (3) (4) Transfer Accuracy Precision Recall F1-score

None 0.9445 ± 0.0018 0.9196 ± 0.0027 0.9378 ± 0.0023 0.9255 ± 0.0025

Single-stage (conventional) 0.9605 ± 0.0036 0.9333 ± 0.0040 0.9065 ± 0.0044 0.9469 ± 0.0057 0.9214 ± 0.0044 0.9496 ± 0.0044 0.9112 ± 0.0041 0.9460 ± 0.0049

Two-stage (ours) 0.9677 ± 0.0027 0.9555 ± 0.0037 0.9539 ± 0.0028 0.9527 ± 0.0035

we trained the network until training loss plateaus were obtained so as to steadily converge the network parameters. The generalization performance was evaluated by using four metrics: accuracy, recall, precision, and F1-score. To minimize the effect of an extraordinarily good result in transitions, the 75th percentile values of the learning process were used as representative values for each evaluation metric. We evaluated each condition n = 10 times with a different random seed to be averaged.

7.3.3 Experimental Results Experimental results are shown in Table 7.1. Condition (1), i.e., learning from scratch, serves as a baseline of the basic DCNN performance. Condition (2), i.e., feature transfer from only textural images, shows worse performance than learning from scratch (1). This result suggests that CUReT, by only itself, is useless as the source domain for traditional single-stage feature transfer learning. It also suggests an interesting finding: inappropriate feature transfer learning could make generalization performance worse. Condition (3), i.e., the most conventional feature transfer from only a massive number of natural images, shows better performance than (1). This result dovetails with reports that the transfer learning from natural images is usually effective for most medical image analysis [2]. However, in this textural case, condition (4) has the most statistically significant results (1) – (3) ( p < 0.01, non-parametric Wilcoxon signedrank test, n = 10 ). The two-stage feature transfer efficiently worked to improve DCNNs performances in a textural recognition task.

7.4 How Does Transfer Learning Work? We saw that the stage feature transfer learning, including two-stage feature transfer, improves the generalization performance of DCNNs. When we are going to apply a kind of learning technique of DCNNs, we had better consider “why the technique

120

A. Suzuki et al.

works well,” because, especially in medical imaging, the whole method needs to be explainable for us to trust its inferences. The mechanism of feature transfer learning from natural images has been partially clarified through eager studies in natural image recognition to correspond to biological facts about our visual recognition processes. However, for textural recognition, particularly for tricky two-stage feature transfer learning, more detailed analysis is needed to reveal the mechanism of improvements. In this section, we try to explain how two-stage feature transfer works with two different approaches.

7.4.1 Qualitative Analysis by Visualizing Feature Representations First, from an intuitive perspective, let us visualize differences in feature representation between the transfer learning processes and try to see whether they correspond to inputs of DCNNs. Visualizing the activation of networks is a common but flexible way to interpret how DCNNs work. Methods for visualizing DCNNs’ activations can be broadly classified into two types: where-based and what-based. Where-based methods reveal the spatial region where DCNNs are concentrating on their recognition, e.g., the occlusion saliency [10] and Grad-CAM [11]. In contrast, what-based ones reveal what components are essential in the extracted feature representation in their feature extraction process. Currently, where-based approaches, particularly those descended from Grad-CAM, have become mainstream as a visualization method due to their clear visualization results. However, a where-based approach is unfortunately unsuitable to reveal the saliency in textural domains, which have characteristics with no locality. Thus, we adopted a what-based visualization method called “DeSaliNet,” [12] which can expose essentials diffusing to a wide area in higher-order representation. The main idea of DeSaliNet is, in short, to propagate the extracted feature activation into the input space with a well-designed pseudo inverse of trained feature extraction mapping. As in Eq. 7.1, the feature extraction process of DCNNs can be interpreted as a deterministic map that consists of a composition of each layer correspondence. Here, DeSaliNet defines a backward “visualization” path, from extracted feature representations to input space, by inverse correspondence of Eq. 7.1 as h (i)† = h 1L 1 † ◦ · · · ◦ h iL i † ,

(7.2)

by layerwise pseudo inverse correspondences h iL i † associated with forward layers h iL i . This concept is illustrated in Fig. 7.5. DeSaliNet enables the salient component to be revealed as salience in the visualization result in input space. Figure 7.6 shows the visualization results of the extracted feature, which is an input of the classification part of AlexNet, for models (1)–(4).

7 Deep Learning in Textural Medical Image Analysis

121

Fig. 7.5 A feature visualization flow using DeSaliNet. The feature map to visualize is calculated during the forward propagation stage (right). When visualizing neuronal activations, the feature map is switched to the backward visualization path (left), which consists of inverse maps of each forward layer, and is backpropagated into the input space as a saliency image

First, let us look at Model (1), learned from scratch. The visualization results in heatmap overlays clearly do not show salient activations in any regions of the input unlike other models. The performance metrics of model (1) do not seem to be bad; however, the representative ability of DCNNs was not sufficiently utilize due to the lack of training examples of DLDs. Let us see the conventional single-stage feature transfer models, (2) and (3). Model (2), transferred from textural imagery, shows diffusional activations corresponding to flat regions where textural features appeared (e.g., bottom right of HCM example and whole of CON and RET examples). Model (3), transferred from natural imagery, in contrast, shows activations in the regions where edge structures appear (e.g., the pits of CON or cyst wall contours of HCM example). Intuitively, a feature transfer learning enables DCNNs to grasp feature representations that appear in source domains. Next, what does model (4), come from two-stage feature transfer, see? By visualizing the feature representations of model (4), interestingly, the two-stage feature transfer model seems to respond to both edge and textural structures, which are shown in models (3) and (4). This result demonstrated that DCNNs could additively obtain

122

A. Suzuki et al.

the feature representation from multiple source domains with such successive feature transfer. The significant improvement by two-stage feature transfer occurs because the DCNN obtain better feature representation, which can obtain both textural and structural characteristics, to classify DLD patterns.

7.4.2 Numerical Analysis: Frequency Response of Feature Extraction Part To make our explanation more precise, we tried to analyze the feature extraction process numerically and quantitatively. Traditionally, various studies on textural computer vision reported that textural components in an image appear in low-frequency near DC and a structural feature appears in low-mid to high frequencies with the Fourier analysis [13]. In accordance with this common finding, we analyzed the frequency response of DCNNs by regarding them as a 2-dimensional Fourier system.3 To define a frequency response of feature extraction of DCNNs, we denote by 1(ω, θ ) the reference frequency image that only has a spatial frequency component of ω and a phase component of θ . The reference frequency image 1(ω, θ ) is given by inverse Fourier transform as 1(ω, θ ) = F

−1

   2  −1 ˆ  ˆ1(r, φ) 1(r, φ)  F 2

(7.3)

ˆ φ) is a polar representation in Fourier space where 1(r,  ˆ φ) = 1(r,

1 0

(r = ω, φ = θ ) (otherwise)

(7.4)

 · 22 denotes the Frobenius norm of image and F[·] denotes the two-dimensional Fourier transformation. Figure 7.7 illustrates the process of generating reference frequency images. Then, we define the frequency response of the extracted feature by a Frobenius norm gain corresponding to the reference frequency response. Let h be the correspondence of feature extraction; the frequency response can be denoted by the notations in Eqs. (7.2), (7.3) and (7.4) as: G(ω) =

    h ◦ h † (1(ω, θ )) 2 2

(7.5)

θ∈[0,π)

3 In

fact, DCNNs works as a basis decomposition, which corresponds to a convolution “kernel” literally, instead of Fourier systems identified in frequency space. However, because the images have another isomorphic representation in Fourier space, this analysis can be meaningful.

7 Deep Learning in Textural Medical Image Analysis

123

Fig. 7.6 Visualization results of extracted feature maps from DLDs images. The leftmost figures show the DCNN inputs. Each row represents the input DLDs images, which are in the classes HCM, CON, and RET, respectively. Each column represents the DCNN learning processes as described in Sec.6. For each visualization result, the above ones show the normalized reconstructed input. Bright regions indicate that the corresponding components of inputs have a strong effect on feature maps. The below ones, surrounded by a dotted line, show the saliency heatmap overlay on the input DLD images

124

A. Suzuki et al.

Fig. 7.7 Mechanism for generating the reference frequency image. In 2-dimensional Fourier space, set the value 1 to two opposite points on the circle of radius ω, which are (ω, θ) and (ω, −θ) in polar coordinates. The reference frequency image is given as its inverse Fourier transform. In this example, ω = 5 and θ = π/4

Fig. 7.8 The frequency response of each feature transfer model. Red-dotted, blue-dotted, and greensolid lines represent feature transfer models (2), (3), and (4), respectively, which are described in Sect. 6. The two-stage feature transfer model (4) has peaks that appeared in both models (2) and (3)

Equation (7.5) represents how salient the fixed-norm input 1(ω, γ ) in the feature extraction layer, thus this metric may be suitable to evaluate the frequency responses. Figure 7.8 shows the frequency response of feature transfer models (2)–(4). The salient frequency that strongly affects the feature representation makes a response peak in the plot. To emphasize the peak structures, each response was normalized into an interval [0, 1] and was smoothed by a second-order Savitzky-Golay filter. Model (2), transferred from textural images, has peak responses at low-frequencies near DC (ω  0 [Hz]) and mid-frequencies (ω = 20 − 40 [Hz]), which are known to be essential for textural images [13]. Model (3), transferred from natural images, has strong peak responses at low-frequencies near 10 [Hz] and mid-high-frequencies

7 Deep Learning in Textural Medical Image Analysis

125

(ω  70 [Hz]). As argued in the above section, model (4), transferred from both textural and natural images, has a peak response at low-frequencies near DC to mid-frequencies like model (2); however, it also has a peak at mid-high-frequencies near 70 [Hz], similar to model (3). This result accords with the visualization result that the two-stage feature transfer model can additively obtain both textural and structural feature representation from multiple source domains. Although a Fourier perspective is not very reasonable for analyzing DCNNs, such an analytical approach from frequency response with feature extraction correspondence may be useful for determining a characteristic of the model. From the viewpoint of finding DLDs in HRCT opacities, both textural and edge structures are important criteria. The numerical analysis results also revealed that the two-stage feature transfer with natural and textural images enables both textural and edge structures to be grasped and contributes to improving the performance of DCNNs by analyzing the feature representations.

7.5 Conclusion This chapter explained that a textural analysis with DCNNs, which is the powerful image recognition model, is much harder than a structural one due to the recognition mechanism of DCNNs. Thus, some special tricks are needed to fully utilize the great ability of DCNNs in the textural recognition tasks. We introduced a novel transfer learning approach called two-stage feature transfer learning, which pretrains entire DCNNs with massive numbers of natural images and textural images successively. In the experiment, we applied a two-stage feature transfer for a DLD classification task from HRCT imagery, which is a typical textural medical image analysis task, and compared generalization performances to the cases of no transfer learning and conventional single-stage feature transfer learning. In the experimental results, the two-stage feature transfer learning significantly outperformed the other approaches. Moreover, by quantitatively and numerically analyzing differences in obtained feature representation of DCNNs, two-stage feature transfer enables DCNNs to grasp both structural and textural perspectives and thus perform efficiently in the DLD recognition. The two-stage feature transfer could be used for bringing out the maximum ability of DCNNs in many application for textural medical image analyses.

References 1. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp 1097–1105 (2012) 2. Litjens, G., Kooi, T., Bejnordi, B.E., Setio, A.A.A., Ciompi, F., Ghafoorian, M., Van Der Laak, J.A., Van Ginneken, B., Sánchez, C.I.: A survey on deep learning in medical image analysis. Med. Image Anal. 42, 60–88 (2017)

126

A. Suzuki et al.

3. Fukushima, K.: Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol. Cybern. 36(4), 193–202 (1980) 4. Hubel, D.H., Wiesel, T.N.: Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J. Physiol. 160(1), 106–154 (1962) 5. Suzuki, A., Sakanashi, H., Kido, S., Hayaru, S.: Feature representation analysis of deep convolutional neural network using two-stage feature transfer - an application for diffuse lung disease classification. IPSJ Trans. Math. Model. Its Appl. 100–110 (2018) 6. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 7. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2010) 8. Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A.A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al.: Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. 114(13), 3521–3526 (2017) 9. Uchiyama, Y., Katsuragawa, S., Abe, H., Shiraishi, J., Li, F., Li, Q., Zhang, C.T., Suzuki, K., et al.: Quantitative computerized analysis of diffuse lung disease in high-resolution computed tomography. Med. Phys. 30(9), 2440–2454 (2003) 10. Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: visualising image classification models and saliency maps (2013). arXiv preprint arXiv:13126034 11. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: Visual explanations from deep networks via gradient-based localization. In: ICCV, pp 618–626 (2017) 12. Mahendran, A., Vedaldi, A.: Salient deconvolutional networks. In: European Conference on Computer Vision, pp 120–135. Springer (2016) 13. Julesz, B., Caelli, T.: On the limits of fourier decompositions in visual texture perception. Perception 8(1), 69–73 (1979)

Chapter 8

Anatomical-Landmark-Based Deep Learning for Alzheimer’s Disease Diagnosis with Structural Magnetic Resonance Imaging Mingxia Liu, Chunfeng Lian and Dinggang Shen Abstract Structural magnetic resonance imaging (sMRI) has been widely used in computer-aided diagnosis of brain diseases, such as Alzheimer’s disease (AD) and its prodromal stage, i.e., mild cognitive impairment (MCI). Based on sMRI data, anatomical-landmark-based deep learning has been recently proposed for AD and MCI diagnosis. These methods usually first locate informative anatomical landmarks in brain sMR images, and then integrate both feature learning and classification training into a unified framework. This chapter presents the latest anatomicallandmark-based deep learning approaches for automatic diagnosis of AD and MCI. Specifically, an automatic landmark discovery method is first introduced to identify discriminative regions in brain sMR images. Then, a landmark-based deep learning framework is presented for AD/MCI classification, by jointly performing feature extraction and classifier training. Experimental results on three public databases demonstrate that the proposed framework boosts the disease diagnosis performance, compared with several state-of-the-art sMRI-based methods.

8.1 Introduction Brain morphometric pattern analysis using structural magnetic resonance imaging (sMRI) data is effective in distinguishing anatomical differences between patients with Alzheimer’s disease (AD) and normal controls (NCs). It is also effective in evaluating the progression of mild cognitive impairment (MCI), a prodromal stage of AD. In the literature, there are extensive sMRI-based approaches proposed to M. Liu (B) · C. Lian · D. Shen Department of Radiology and BRIC, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA e-mail: [email protected] C. Lian e-mail: [email protected] D. Shen e-mail: [email protected] © Springer Nature Switzerland AG 2020 Y.-W. Chen and L. C. Jain, Deep Learning in Healthcare, Intelligent Systems Reference Library 171, https://doi.org/10.1007/978-3-030-32606-7_8

127

128

M. Liu et al.

Fig. 8.1 Illustration of structural MRI biomarkers for AD/MCI diagnosis shown in a local-to-global manner, including (1) voxel-level measure, (2) patch-level measure, (3) ROI-level measure, and (4) whole-image-level measure. ROI: region of interest

help clinicians interpret structural changes of the brain. Many existing methods were developed for the aim of fundamental sMR image analysis (e.g., atlas propagation [1] and anatomical landmark detection [2]), while others focused on computer-aideddiagnosis of AD and MCI [3–14]. To facilitate automatic brain disease diagnosis, previous studies have derived different kinds of biomarkers (measures/features) from sMRI, e.g., volume and shape measurements [4–6, 15], cortical thickness [7, 8, 10], and gray matter tissue density maps [3]. From local- to global- scales (see Fig. 8.1), these measures can be roughly categorized into four classes, i.e., (1) voxel-level, (2) patch-level, (3) regionof-interest (ROI) level, and (4) whole-image-level features [16]. Specifically, voxellevel features aim to identify brain structural changes by directly measuring local tissue (e.g., gray matter, white matter and cerebrospinal fluid) density of a brain via voxel-wise analysis. However, voxel-level measurement is typically of very high dimension (e.g., in millions), leading to high over-fitting risk for subsequent learning models [17]. Different from voxel-level measures, ROI-level features of sMRI attempt to model structural changes of the brain within pre-defined ROIs. However, the definition of ROIs usually requires a prior hypothesis on the abnormal regions from a structural/functional perspective, requiring expert knowledge in practice [18, 19]. Also, an abnormal region might be only a small part of a pre-defined ROI or span over multiple ROIs, thereby leading to the loss of discriminative information. Different from voxel- and patch-level methods, whole-image-level features evaluate brain abnormalities by regarding each sMRI as a whole [20], thus ignoring the local structural information of the image. It is worth noting that the appearance of brain sMR images is often globally similar and locally different, and previous studies have shown that the early stage of AD only induces structural changes in small local regions rather than in isolated voxels or the whole brain. Hence, sMRI features defined at voxel-level, ROI-level and whole-image-level may not be informative for identifying the early AD-related structural changes of the brain. Recently, patch-level (an intermediate scale between voxel-level and ROI-level) features have been developed to represent sMR images, showing their advantage in distinguishing AD/MCI patients from NCs [14, 21, 22]. A common challenge for the patch-level approaches is to determine how to select discriminative patches from tens of thousands of patches in each sMR image, as not all image patches are affected by dementia. Moreover, most of the existing patch-level

8 Anatomical-Landmark-Based Deep Learning …

129

representations (e.g., intensity values, and/or morphological features) are engineered and empirically pre-defined, which are typically independent of subsequent classifier learning [14, 23, 24]. Due to possible heterogeneity between features and classifiers, the pre-defined features may lead to sub-optimal learning performance for brain disease diagnosis. In addition, global information of each sMR image could not be fully captured by using only local image patches. In summary, there are at least three key challenges in patch-based approaches: (1) how to select informative image patches efficiently, (2) how to model both local patchlevel and global image-level information of each brain sMRI, and (3) how to integrate feature learning and classifier training into a unified framework. To address these three challenges, an anatomical-landmark-based deep learning framework has been recently proposed [2, 14, 25] for AD/MCI diagnosis, which will be the focus of this chapter. This kind of methods first identifies anatomical landmarks in brain sMRIs via statistical group comparison, and these landmarks are defined as discriminative locations between different groups (e.g., ADs vs. NCs). Then, it jointly performs feature extraction and classification training via hierarchical deep neural networks, where both local and global structural information of sMRIs are explicitly modeled for brain disease diagnosis. The rest of this chapter is organized as follows. In Sect. 8.2, the materials and image pre-processing used in this chapter are introduced. Then, an anatomic method for landmark discovery from brain sMR images is presented in Sect. 8.3. Section 8.4 presents a landmark-based deep network for the automatic diagnosis of AD and MCI. Section 8.5 introduces experiments and corresponding analyses. Section 8.6 analyzes the influence of several essential parameters, limitations of the current framework, and possible future research directions. Finally, this chapter is concluded in Sect. 8.7.

8.2 Materials and Image Pre-Processing Three public datasets are used in the experiments, including the Alzheimer’s Disease Neuroimaging Initiative-1 (ADNI-1) [26], the ADNI-2, and the MIRIAD (Minimal Interval Resonance Imaging in Alzheimer’s Disease) [27] datasets. Subjects in the baseline ADNI-1 and ADNI-2 datasets have 1.5T and 3T T1-weighted structural MRI data, respectively. There is a total of 821 subjects, including 199 AD, 229 NC, 167 pMCI, and 226 sMCI subjects in the baseline ADNI-1 dataset. The ADNI2 dataset contains 636 subjects, i.e., 159 AD, 200 NC, 38 pMCI, and 239 sMCI subjects. The definitions of pMCI and sMCI in both ADNI-1 and ADNI-2 are based on whether MCI subjects would convert to AD within 36 months after the baseline time. It is worth noting that many subjects in ADNI-1 also participated in ADNI2. For independent testing, subjects that appear in both ADNI-1 and ADNI-2 are removed from ADNI-2. The baseline MIRIAD dataset includes 1.5T T1-weighted sMR images from 23 NC and 46 AD subjects [27]. The demographic information of the studied subjects used in the experiments is shown in Table 8.1.

130

M. Liu et al.

Table 8.1 Demographic and clinical information of subjects in three datasets. Values are reported as Mean ± Standard Deviation (Std); Edu: Education years; MMSE: mini-mental state examination; CDR-SB: sum-of-boxes of clinical dementia rating Datasets Category Male/female Age Edu MMSE CDR-SB ADNI-1 [26]

ADNI-2 [26]

MIRIAD [27]

AD

106/93

pMCI

102/65

sMCI

151/75

NC

127/102

AD

91/68

pMCI

24/14

sMCI

134/105

NC

113/87

AD

19/27

NC

12/11

75.30 ± 7.50 74.79 ± 6.79 74.89 ± 7.63 75.85 ± 5.03 74.24 ± 7.99 71.27 ± 7.28 71.66 ± 7.56 73.47 ± 6.25 69.95 ± 7.07 70.36 ± 7.28

14.72 ± 3.14 15.69 ± 2.85 15.56 ± 3.17 16.05 ± 2.87 15.86 ± 2.60 16.24 ± 2.67 16.20 ± 2.69 16.51 ± 2.54 – –

23.30 ± 1.99 26.58 ± 1.71 27.28 ± 1.77 29.11 ± 1.00 23.16 ± 2.21 26.97 ± 1.66 28.25 ± 1.62 29.03 ± 1.27 19.20 ± 4.01 29.39 ± 0.84

4.34 ± 1.61 1.85 ± 0.94 1.42 ± 0.78 0.03 ± 0.12 4.43 ± 1.75 2.24 ± 1.26 1.20 ± 0.78 0.05 ± 0.23 – –

All brain sMR images of studied subjects are processed using a standard pipeline [25, 28]. Specifically, the MIPAV software1 is first used to perform anterior commissure (AC)-posterior commissure (PC) correction for each sMR image. Then, each image is resampled to have a resolution of 256 × 256 × 256, followed by the N3 algorithm [29] to correct intensity in-homogeneity of images. Skull stripping and manual editing are further performed to ensure that both skull and dura are cleanly removed. Finally, the cerebellum is removed by warping a labeled template to each skull-stripped image.

8.3 Anatomical Landmark Discovery for Brain sMRIs To extract informative patches for feature learning and classifier training, a datadriven landmark discovery algorithm [2] is proposed to locate discriminative image patches from brain sMRIs. The goal is to identify the landmarks with statistically significant group differences between AD patients and NC subjects in local structures of sMRIs. An illustration of the anatomical landmark discovery method is shown in 1 http://mipav.cit.nih.gov/index.php.

8 Anatomical-Landmark-Based Deep Learning …

131

Fig. 8.2 Diagram of the anatomical landmark discovery method for brain sMRIs based on group comparison between the Alzheimer’s disease (AD) patients group and the normal controls (NCs) group [2], including three sequential steps: (1) linear registration, (2) nonlinear registration, (3) feature extraction, and (4) group comparison between AD and NC groups to generate voxel-wise p-value map

Fig. 8.2. Specifically, a group comparison between AD and NC is first performed on the training images to discover anatomical landmarks that are able to differentiate AD from NC. Both linear and a nonlinear registration are used to locate the corresponding voxels across all training images. Then, a statistical method (i.e., Hotelling’s T2 statistic [30]) is used to perform voxel-wise group comparison. Finally, a p-value map is obtained after group comparison to identify AD-related landmarks.

8.3.1 Generation of Voxel-Wise Correspondence By using the Colin27 template [31], a linear registration is first performed to remove the global translation, scale and rotation differences of sMR images, and to resample all the images with an identical spatial resolution (i.e., 1 × 1 × 1 mm3 ) as the template image. Since these linearly-aligned images are not voxel-wisely comparable, nonlinear registration is further performed for spatial normalization [32]. After spatial normalization, the warped images lie in the same stereotactic space as the template image. The nonlinear registration step creates a deformation field for each subject, which estimates highly nonlinear deformations that are local to specific regions in the brain. Based on the deformation field of each sMR image obtained in nonlinear registration, one can build the correspondence between voxels in the template and each linearly-aligned image (e.g., see pink lines and red circles in Fig. 8.2). For instance, for a voxel (x, y, z) in the template image, one can find its corresponding voxel (x + d x, y + dy, z + dz) in a specific linearly-aligned image, where (d x, dy, dz) is the displacement from the template image to the linearly-aligned image defined by the deformation field.

132

M. Liu et al.

8.3.2 Voxel-Wise Comparison Between Different Groups With the linearly-aligned images (with cross-subject voxel-wise correspondence), local morphological features can be extracted to identify local morphological patterns that have statistically significant between-group differences. Here, the linearlyaligned images (rather than the nonlinearly-aligned images) are used for feature extraction. The reason is that, after linear registration that only normalizes global shapes and scales of brain sMRIs, these linearly-aligned images can still preserve internal local differences and distinct local structures of the brain. In contrast, those warped images after nonlinear registration are very similar to each other, thus the morphological differences between different groups that we are interested in will be less significant. To take advantage of the context information conveyed in neighboring voxels, a cubic patch (with the size of 15 × 15 × 15) centered at a specific voxel is extracted from each linearly-aligned image to compute statistics of morphological features. Specifically, the oriented energies [33] that are invariant to local in-homogeneity are extracted as morphological features for each patch. Also, a bag-of-words strategy [34] is used for vector quantization to obtain the histogram features with relatively low feature dimension. For each patch centered at a voxel in the template, one can extract two groups of morphological features from training images in AD and NC groups, respectively. The dimensionality of the morphological features is 50, which is defined by the number of clustering groups when applying the bag-of-words method. Finally, the Hotelling’s T2 statistic [30] is performed for group comparison between AD and NC groups. As a result, a p-value map corresponding to all voxels in the template space can be obtained.

8.3.3 Definition of AD-Related Anatomical Landmarks Based on the obtained p-value map, AD-related anatomical landmarks can be identified from all voxels in the template. Specifically, voxels in the template with p-values smaller than 0.01 are regarded as potential locations showing statistically significant between-group differences. To avoid large redundancy, only local minima (whose p-values are also smaller than 0.01) in the p-value map are defined as AD-related landmarks in the template. A total of 1, 741 anatomical landmarks (defined in the template) are identified from the AD and NC groups in the ADNI-1 dataset, as shown in Fig. 8.3a. In this figure, these anatomical landmarks are ranked according to their discriminative capabilities (i.e., p-values in group comparison) in distinguishing AD patients from NCs. That is, a small p-value indicates a strong discriminative power, and vice versa. Using the deformation field estimated by nonlinear registration, landmarks for each training image can be computed by mapping these landmarks from the template image to its corresponding linearly-aligned sMR image.

8 Anatomical-Landmark-Based Deep Learning …

133

Fig. 8.3 Illustration of a all identified AD-related anatomical landmarks from AD and NC subjects in ADNI-1, and b selected top 50 landmarks. Here, different colors denote p-values in group comparison between AD and NC groups in ADNI-1. Different colors in b denote p-values in group comparison between AD and NC, where a small p-value indicates a strong discriminative power and vice versa

As shown in Fig. 8.3a, many landmarks are spatially close to each other, and thus image patches centered at these landmarks would overlap with each other. In consideration of information redundancy, besides considering p-values of landmarks, a spatial Euclidean distance threshold (i.e., 20) is further used as a criterion to control the distance between landmarks for reducing the overlaps among image patches. This process yields a subset of all identified 1, 741 landmarks, with the top 50 ones shown in Fig. 8.3b. From Fig. 8.3b, one can observe that many landmarks located in the areas of bilateral hippocampal, bilateral parahippocampal, and bilateral fusiform, and these areas have been reported to be related to AD in previous studies [5, 35]. Here, MCI (including pMCI and sMCI) subjects share the same landmark pool as that identified from AD and NC groups. The assumption here is that, since MCI is the prodromal stage of AD, landmarks with group differences between AD and NC subjects are the potential atrophy locations in sMRIs of MCI subjects.

8.3.4 Landmark Detection for Unseen Testing Subject To fast detect landmarks for unseen testing images, a regression forest based method is developed, based on landmarks identified in the training images [2]. Specifically, in the training stage, a multi-variate regression forest is used to learn a nonlinear mapping between the patch centered at each voxel and its 3D displacement to a target landmark (see Fig. 8.4a) in the linearly-aligned image, where the mean variance of the targets is used as the splitting criteria. Centered at each voxel, an image patch can be extracted to compute the same morphological features as those used for AD landmark identification. In the regression forest, patch-level features from training linearly-aligned images are used as the input data (see green squares in Fig. 8.4b), while the 3D displacement (see yellow lines in Fig. 8.4b) from each patch to each target landmark (in the linearly-aligned image space) is treated as the output.

134

M. Liu et al.

Fig. 8.4 Illustration of the regression forest based landmark detection method. a Definition of displacement from a voxel (green square) to a target landmark (red circle). b Regression voting based on multiple patches. c Voting map for the landmark location (yellow denotes a large probability and purple represents a small probability)

In the testing stage, the learned regression forest can be used to estimate the 3D displacement from each voxel in the testing image to a potential landmark position, based on the local morphological features extracted from the patch centered at this voxel. Since there are several trees for a regression forest, the mean prediction value from all trees is used as the final location for a specific landmark. Therefore, by using the estimated 3D displacement, each voxel can cast one vote to the potential landmark position. By aggregating all votes from all voxels (see Fig. 8.4b), a voting map can finally be obtained (see Fig. 8.4c), from which the to-be-estimated landmark can be defined as the location with the maximum vote. For a new testing image, its landmark locations can be fast computed via the trained random forest, without using any time-consuming nonlinear registration process.

8.4 Landmark-Based Deep Network for Disease Diagnosis A landmark-based deep multi-instance learning (LDMIL) framework is developed for brain disease diagnosis. Different from previous patch-based studies [21, 36], this method can locate discriminative image patches via anatomical landmarks without the requirement of any pre-defined engineered features for image patches. This is particularly meaningful for medical imaging applications, where annotating discriminative regions in the brain and extracting discriminative features from sMRIs often require clinical expertise and high cost. Also, this method is capable of capturing both the local information of image patches and the global information of each sMRI, by hierarchically learning local-to-global representations for sMR images layer by layer via deep neural networks. Figure 8.5 shows an illustration of the LDMIL framework. Based on anatomical landmarks identified via group comparison between AD and NC subjects, multiple

8 Anatomical-Landmark-Based Deep Learning …

135

2

Fig. 8.5 Illustration of the landmark-based deep multiple instance learning (LDMIL) framework using sMRI data, including (1) sMR image processing, (2) anatomical landmark detection, (3) landmark-based instance/patch extraction (with each bag corresponding to the combination of L local instances/patches extracted from each image), and (4) multi-instance convolutional neural network for disease classification (with bags as the input and subject-level labels as the output)

image patches (i.e., instance) are extracted from each brain sMR image. Then, a multiinstance convolutional neural network (MICNN) is developed to jointly learn patchbased features and an automatic disease classification model. Detailed processes for anatomical-landmark-based deep networks for brain disease diagnosis can be found in the following.

8.4.1 Landmark-Based Patch Extraction Based on the top L identified anatomical landmarks, multiple patches (i.e., instances) are extracted from each sMR image (i.e., bag), as shown in Fig. 8.6. Here, image patches with the size of 24 × 24 × 24 centered at each specific landmark location are extracted. Given L landmarks, L patches from an sMR image (as a bag) are generated for representing each sMRI/subject. That is, the combination of these L patches can be regarded as a sample for the subsequent CNN model. Besides, multiple patches centered at each landmark location are further sampled with displacements within a 5 × 5 × 5 cubic (with the step size of 1), aiming to suppress the impact of registration error as well as to augment the training samples. Centered at each landmark location, there are 125 patches can be generated from each sMRI. As a result, a total of 125 L samples/bags can theoretically be extracted from each sMR image, with each bag containing L image patches.

136

M. Liu et al.

Fig. 8.6 The 2D Illustration of the landmark-based sample generation strategy. Each sample of the network is a specific combination of L patches extracted from L landmark locations. Since different patches centered at each landmark location are further randomly sampled with displacements within a 5 × 5 × 5 cubic (with the step size of 1), there are theoretically a total of 125 L samples (i.e., all combinations of local image patches) for representing each sMR image

8.4.2 Multi-instance Convolutional Neural Network Since not all image patches extracted from one sMR image are affected by dementia, class labels for those image patches could be ambiguous. To this end, a multi-instance CNN (MICNN) model is developed for AD-related brain disease diagnosis, with a schematic diagram shown in Fig. 8.7. Given an input sMR image, the input of MICNN is a bag containing L patches (i.e., instances) extracted from L landmark locations. To learn feature representations of individual patches in the bag, multiple sub-CNN architectures are first developed in the MICNN. Such architecture uses a bag of L instances as the input, corresponding to L landmark locations of the brain. It produces patch-level representations for each individual sMR image. More specifically, L parallel sub-CNN architectures are embedded with a series of 6 convolutional layers (i.e., Conv1, Conv2, Conv3, Conv4, Conv5, and Conv6), and 2 fully-connected (FC)

Fig. 8.7 Illustration of the landmark-based multi-instance convolutional neural network (MICNN), including L sub-CNN architectures corresponding to L landmarks. Given an input sMR image, the input data of the deep model are L local image patches extracted from L landmark locations

8 Anatomical-Landmark-Based Deep Learning …

137

layers (i.e., FC7, and FC8). The rectified linear unit (ReLU) activation function is used in the convolutional layers, while Conv2, Conv4, and Conv6 are followed by max-pooling procedures for down-sampling, respectively. Because structural changes caused by dementia can be subtle and distribute across multiple brain regions, only one or a few patch(es) can not provide enough information to represent global structural changes of the brain. This is different from the conventional multiple instance learning, where the image class can be derived by the estimated label of the most discriminative patch [37, 38]. Hence, besides patchlevel representation learned from L sub-CNNs, bag-level (i.e., whole-image-level) representations are further learned for each sMR image. Specifically, the patch-level representation (i.e., output feature maps of FC7) at the FC8 layer are first concatenated, followed by 3 fully-connected layers (i.e., FC9, FC10, and FC11). Such additional fully-connected layers are expected to be capable of capturing the complex relationship among patches located by landmarks, and thus, can form a global representation for the brain at the whole-image-level. Finally, the output of FC11 is fed to a soft-max output layer to predict the probability of an input sMRI belonging to a specific category (e.g., AD or NC). N , which contains N bags with the correspondLet the training set be X = {Xn }n=1 N ing labels y = {yn}n=1 . The bag of then-th training image Xn consists of L instances, defined as Xn = xn,1 , xn,2 , · · · , xn,L . As shown in Fig. 8.5, bags corresponding to all training images are the basic training samples for the MICNN model, and the labels of those bags are consistent with the bag-level (i.e., subject-level) labels. Here, the subject-level label information (i.e., y) is used in a back-propagation procedure for learning the most relevant features in the fully-connected layers and also updating network weights in those convolutional layers. Here, MICNN aims to learn a nonlinear mapping function  : X → y, by minimizing the following loss function: Loss(W) =

 Xn ∈X

−log (P(yn |Xn ; W))

(8.1)

where P(yn |Xn ; W) indicates the probability of the bag Xn being correctly classified as the class yn using the network coefficients W. In summary, the MICNN architecture is a patch-based classification model, where local-to-global feature representations can be learned for each sMR image. That is, patch-level representations are first learned via multiple sub-CNN architectures corresponding to multiple landmarks, to capture the local structural information located at different parts of the brain. The global information conveyed by multiple landmarks is further modeled via additional fully-connected layers, to represent the brain structure at a whole-image-level. Hence, both local and global features of brain sMRIs can be incorporated into the classifier learning process. The MICNN is optimized by using the stochastic gradient descent (SGD) algorithm [39], with a momentum coefficient of 0.9 and a learning rate of 10−2 . The weight updates are performed with mini-batches of 30 samples per epoch. In addition, the network is implemented based on a computer with a single GPU (i.e., NVIDIA GTX TITAN 12 GB) and the platform of Tensorflow [40]. Here, 10% subjects are ran-

138

M. Liu et al.

domly selected from ADNI-1 as the validation data, while the remaining subjects in ADNI-1 are regarded as the training data. Given the patch size of 24 × 24 × 24 and L = 40, the training time for MICNN is about 27 h, while the testing time for a new sMRI is less than 1 s.

8.5 Experiments This section first introduces the methods for comparison and experimental settings, and then presents the AD/MCI classification results achieved by different methods.

8.5.1 Methods for Comparison The LDMIL approach is compared with three state-of-the-art methods for sMRIbased AD/MCI diagnosis, including (1) ROI-based method (ROI), (2) voxel-based morphometry (VBM) [3], and (3) conventional landmark-based morphometry (CLM) [2]. Also, LDMIL is further compared with its single-instance variant, called the landmark-based deep single-instance learning (LDSIL) method. (1) ROI based method (ROI): Similar to several previous works [41–43], ROIspecific features are extracted from the pre-processed sMR images. Specifically, the brain is first segmented into three different tissue types, i.e., gray matter (GM), white matter (WM), and cerebrospinal fluid (CSF), using FAST [44] in the FSL software package.2 The anatomical automatic labeling (AAL) atlas [45] with 90 pre-defined ROIs in the cerebrum is then aligned to the native space of each subject using a deformable registration algorithm [46]. Finally, volumes of GM tissue inside those 90 ROIs are extracted as feature representation for each sMR image. Here, the volumes of GM tissue are normalized by the total intracranial volume, which is estimated by the summation of GM, WM, and CSF volumes from all ROIs. Using these 90-dimensional ROI features, a linear support vector machine (SVM) is trained with the parameter C = 1 for classification. (2) Voxel-based morphometry (VBM) method [3]: All sMR images are first spatially normalized to the same template image using a nonlinear image registration technique, followed by GM extraction from the normalized images. The local tissue (i.e., GM) density of the brain is directly measured in a voxel-wise manner, and a group comparison is performed using t-test to reduce the dimensionality of the highdimensional features. Similar to the ROI-based method, those voxel-based features are fed to a linear SVM for classification. (3) Conventional landmark-based morphometry (CLM) method [2] with engineered feature representations: As a landmark-based method, CLM shares the same landmark pool with the LDMIL method. Different from LDMIL, CLM adopts engi2 http://fsl.fmrib.ox.ac.uk/fsl/fslwiki.

8 Anatomical-Landmark-Based Deep Learning …

139

neered features for representing patches around each landmark. Specifically, CLM first extracts morphological features from a local patch centered at each landmark and then concatenates those features from multiple landmarks together. Finally, the normalized features are fed into a linear SVM classifier. (4) Landmark-based deep single instance learning (LDSIL): The architecture of LDSIL is similar to a sub-CNN in LDMIL (see Fig. 8.7), containing 6 convolutional layers and 3 fully-connected layers. Specifically, LDSIL learns a CNN model corresponding to each specific landmark, with patches extracted from the landmark as the input and the subject-level class label as the output. Given L landmarks, L CNN models can be learned independently via LDSIL, generating L probability scores for a testing subject. For making a final classification decision, the estimated probability scores for patches are simply fused by using a majority voting strategy. Note that, different from LDMIL, LDSIL can capture only the local structural information of brain sMR images.

8.5.2 Experimental Settings Two groups of experiments are performed to validate the effectiveness of different methods, including (1) AD diagnosis (i.e., AD vs. NC classification), and (2) MCI conversion prediction (i.e., pMCI vs. sMCI classification). To evaluate the robustness and generalization ability of a specific classification model, subjects from ADNI-1 are used as the training set, while subjects from ADNI-2 and MIRIAD are regarded as two independent testing sets. Seven metrics are employed for performance evaluation, including receiver operating characteristic (ROC) curve, area under ROC (AUC), accuracy (ACC), sensitivity (SEN), specificity (SPE), F-Score, and Matthews correlation coefficient (MCC) [47]. For the fair comparison, in the LDMIL method and its variant (i.e., LDSIL), the size of image patch is empirically set to 24 × 24 × 24, while the number of landmarks used here is L = 40. Similar to LDMIL, the network of LDSIL is optimized via the SGD algorithm [39], with a momentum coefficient of 0.9 and a learning rate of 10−2 . Also, three landmark-based methods (i.e., CLM, LDSIL, and LDMIL) share the same landmark pool, while LDSIL and LDMIL employ the same size of image patches.

8.5.3 Results of AD Diagnosis In the first group of experiments, the task of AD versus NC classification is performed, with the model trained on ADNI-1 and tested on ADNI-2 and MIRIAD, respectively. Table 8.2 and Fig. 8.8a, b report the experimental results on both the ADNI-2 and MIRIAD datasets.

AUC ACC SEN SPE F-Score MCC

0.867 0.792 0.786 0.796 0.769 0.580

0.841 0.769 0.692 0.830 0.726 0.530

0.881 0.822 0.774 0.861 0.794 0.638

0.957 0.906 0.874 0.930 0.891 0.808

0.959 0.911 0.881 0.935 0.897 0.819

0.918 0.870 0.913 0.783 0.903 0.704

0.921 0.884 0.913 0.826 0.913 0.739

0.954 0.899 0.978 0.739 0.928 0.770

0.958 0.913 0.957 0.826 0.936 0.802

0.972 0.928 0.935 0.913 0.945 0.839

0.638 0.661 0.474 0.690 0.277 0.120

0.593 0.643 0.368 0.686 0.221 0.040

0.636 0.686 0.395 0.732 0.256 0.097

0.645 0.700 0.368 0.753 0.252 0.095

0.776 0.769 0.421 0.824 0.333 0.207

Table 8.2 Results of AD classification and MCI conversion prediction on both the ADNI-2 and the MIRIAD datasets, with classifiers trained on the ADNI-1 dataset AD versus NC on ADNI-2 AD versus NC on MIRIAD pMCI versus sMCI on ADNI-2 ROI VBM CLM LDSIL LDMIL ROI VBM CLM LDSIL LDMIL ROI VBM CLM LDSIL LDMIL

140 M. Liu et al.

8 Anatomical-Landmark-Based Deep Learning …

(a)

1

ROI VBM CLM LDSIL LDMIL(Ours)

0.4 0.2 0.2

0.4

0.6

1-Specificity

0.8

0.8

0.6 ROI VBM CLM LDSIL LDMIL(Ours)

0.4 0.2 0

1

Sensitivity

Sensitivity

0.6

0

1

0.8

0.8

Sensitivity

(c)

(b)

1

0

141

0

0.2

0.4

0.6

0.8

0.6 ROI VBM CLM LDSIL LDMIL(Ours)

0.4 0.2

1

1-Specificity

0 0

0.2

0.4

0.6

0.8

1

1-Specificity

Fig. 8.8 The ROC curves achieved by five different methods in a AD versus NC classification on ADNI-2, b AD versus NC classification on MIRIAD, and c pMCI versus sMCI classification on ADNI-2. The classification models are trained on ADNI-1

From Table 8.2, one can observe that the LDMIL method generally outperforms those competing methods in AD versus NC classification on ADNI-2 and MIRIAD. For instance, on the ADNI-2 dataset, the AUC value achieved by LDMIL is 0.959, which is much better than those yielded by ROI, VBM, and CLM (i.e., 0.867, 0.841, and 0.881, respectively). It is worth noting that sMR images from ADNI2 are scanned using 3T scanners, while images from ADNI-1 are scanned using 1.5T scanners. Although sMR images in the training set (i.e., ADNI-1) and the testing set (i.e., ADNI-2) have different signal-to-noise ratios, the classification model learned by LDMIL can still reliably distinguish AD patients from NCs. This implies that the LDMIL method has strong robustness and generalization ability, which is particularly important in handling multi-center sMRIs in applications. As shown in Fig. 8.8a, b, three landmark-based methods (i.e., CLM, LDSIL, and LDMIL) consistently outperform both ROI-based and voxel-based approaches (i.e., ROI, and VBM) in AD classification. The possible reason is that the landmarks identified in this work have a stronger discriminative ability to capture differences of structural brain changes between AD and NC subjects, compared to the pre-defined ROIs and the isolated voxels. In addition, it can be seen from Fig. 8.8a, b that the AUC values achieved by LDSIL (the single-instance variant of LDMIL) is comparable to that yielded by LDMIL in AD versus NC classification.

8.5.4 Results of MCI Conversion Prediction The results of MCI conversion prediction (pMCI vs. sMCI classification) are reported in Table 8.2 and Fig. 8.8c, with the classifiers trained and tested on the ADNI-1 and the ADNI-2 datasets, respectively. It can be observed from Table 8.2 that, in most cases, the LDMIL method achieves better results than the other four methods in MCI conversion prediction. On the other hand, as shown in Fig. 8.8, the superiority of LDMIL over LDSIL is particularly obvious in pMCI versus sMCI classification, even though such superiority is not that distinct in AD versus NC classification. The reason could be that

142

M. Liu et al.

LDMIL models both local patch-level and global bag-level structure information of the brain, while LDSIL can only capture local patch-level information. Since the structural abnormalities caused by AD are obvious compared to NCs, only a few landmarks can be discriminative enough to distinguish AD from NC subjects. In contrast, while structural changes of MCI brains may be very subtle and distribute across multiple regions of the brain, it is difficult to determine whether an MCI subject would convert to AD using solely one or a few landmark(s). In such a case, the global information conveyed by multiple landmarks could be crucial for classification. Moreover, because each landmark defines only a potentially (rather than a certain) atrophic location (especially for MCI), it is unreasonable to assign the same subject-level class label to all patches extracted from a specific landmark location in LDSIL. Different from LDSIL, LDMIL can model both the local information of image patches and the global information of multiple landmarks, by assigning class labels at the subject-level rather than the patch-level. This explains why LDMIL outperforms LDSIL in pMCI versus sMCI classification, although both methods yield similar results in AD versus NC classification.

8.6 Discussion This section first investigates the influence of the essential parameters on the diagnostic performance, and then elaborates the limitations of the current study and present possible future research direction.

8.6.1 Influence of Parameters To investigate the influences of the two parameters involved in the LDMIL method (i.e., the number of landmarks, and the image patch size) on the classification performance, a group of experiments is performed by varying the patch size in the set {8 × 8 × 8, 12 × 12 × 12, 24 × 24 × 24, 36 × 36 × 36, 48 × 48 × 48, 60 × 60 × 60}. The AUC values of AD versus NC classification on the ADNI-2 dataset are reported in Fig. 8.9a. From Fig. 8.9a, one can see that the best results are obtained by LDMIL using the patch size of 48 × 48 × 48. Also, LDMIL is not very sensitive to the size of the image patch within the range of [24 × 24 × 24, 48 × 48 × 48]. When using patches of size 8 × 8 × 8, the AUC value (0.814) is not satisfactory. This implies that very small local patches are not capable of capturing enough informative brain structures. Similarly, the results are not good using very large patches (e.g., 60 × 60 × 60), since subtle structural changes within the large patch could be dominated by uninformative normal regions. In addition, using large patches will bring a huge computational burden, thus affecting the utility of LDMIL in practical applications.

8 Anatomical-Landmark-Based Deep Learning …

(a)

143

(b)

1.0

1.0 0.959

0.959

0.927

0.961 0.9

0.923

0.8

AUC

AUC

0.9

0.814 0.8

0.7

AD vs. NC on ADNI-2

0.6

AD vs. NC on MIRIAD 0.5 0.7

8×8×8

12×12×12 24×24×24 36×36×36 48×48×48 60×60×60

Patch Size

0.4

pMCI vs. sMCI on ADNI-2 1

5

10

15

20

25

30

35

40

45

50

Number of Landmarks

Fig. 8.9 AUC achieved by the LDMIL method in a AD versus NC classification using different patch sizes on ADNI-2, and b AD classification (i.e., AD vs. NC) and MCI conversion prediction (i.e., pMCI vs. sMCI) tasks using different numbers of landmarks

Besides, the AUC values achieved by LDMIL using the different numbers of landmarks are reported in Fig. 8.9b. From Fig. 8.9b, one can observe that the overall performance increases along the increase of the number of landmarks. In particular, in pMCI versus sMCI classification, LDMIL using less than 15 landmarks cannot yield satisfactory results. This implies that the global information conveyed by multiple landmarks can help boost the learning performance, especially for MCI subjects without obvious disease-induced structural changes. On the other hand, when the number of landmarks is larger than 35, the growth trend of the AUC values slows down, and the results are relatively stable. Hence, it is reasonable to choose the number of landmarks in the range of [30, 50], while using more landmarks will increase the number of network weights to be optimized in LDMIL.

8.6.2 Limitations and Future Research Direction There are still several technical issues to be considered in the future. First, the number of training subjects is limited (i.e., hundreds), even though one can extract hundreds of thousands of image patches from multiple landmark locations for classifier training. Second, the pre-selection of local patches based on anatomical landmarks is still independent of feature extraction and classifier construction, which may hamper the diagnostic performance. Third, in the current implementation, the size of image patches was fixed for all locations in the brain, while structural changes caused by dementia may vary across different locations. Finally, data from different datasets are treated equally in the current study, without considering the difference of data distribution in different datasets. This may negatively affect the generalization capability of the learned network. Accordingly, the anatomical landmark-based deep learning framework for AD/MCI diagnosis can be further studied in the following directions. First, using the large number of longitudinal sMR images in three datasets (i.e., ADNI-1, ADNI-2,

144

M. Liu et al.

and MIRIAD) could further improve the robustness of the learned model [48, 49]. Second, it is desired to automatically identify both patch- and region-level discriminative locations in whole brain sMRI, upon which both patch- and region-level feature representations can be jointly learned and fused in a data-driven manner to construct disease classification models [50]. Third, it is reasonable to extend the current framework by using multi-scale image patches to capture richer structural information of brain sMRIs for disease diagnosis [13, 14]. Finally, it is interesting to design a series of domain adaptation methods [51] for dealing with the challenge caused by different data distributions, which is expected to further improve the diagnostic performance.

8.7 Conclusion Brain morphometric pattern analysis using sMRIs has been widely investigated for automatic diagnosis of AD and MCI. Existing sMRI-based studies can be categorized into voxel-, patch-, ROI- and whole-image-level approaches. Patch-level methods provide intermediate scale representations for brain sMRIs, and have been recently used in AD/MCI diagnosis. To select informative patches from each sMR image, an anatomical landmark detection algorithm is discussed in this chapter, by identifying locations having statistically significant group differences between AD patients and NC subjects in local structures of sMRIs. Based on these identified anatomical landmarks, this chapter further introduces a landmark-based deep learning framework to not only learn local-to-global representation for sMRIs but also integrate both feature learning and classifier training into a unified model for AD/MCI diagnosis. The boosted performance in disease classification achieved by this approach indicates that anatomical-landmark-based deep learning methods are possible alternatives to the clinical diagnosis of brain alterations associated with cognitive3 impairment/decline. Acknowledgements This study was partly supported by NIH grants (EB006733, EB008374, EB009634, MH100217, AG041721, AG042599, AG010129, and AG030514). Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database. The investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report, with details shown online.

References 1. Wolz, R., Aljabar, P., Hajnal, J.V., Hammers, A., Rueckert, D.: LEAP: learning embeddings for atlas propagation. NeuroImage 49(2), 1316–1325 (2010) 2. Zhang, J., Gao, Y., Gao, Y., Munsell, B., Shen, D.: Detecting anatomical landmarks for fast Alzheimer’s disease diagnosis. IEEE Trans. Med. Imaging 35(12), 2524–2533 (2016) 3 https://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.

pdf.

8 Anatomical-Landmark-Based Deep Learning …

145

3. Ashburner, J., Friston, K.J.: Voxel-based morphometry: the methods. NeuroImage 11(6), 805– 821 (2000) 4. Jack, C., Petersen, R.C., Xu, Y.C., O’Brien, P.C., Smith, G.E., Ivnik, R.J., Boeve, B.F., Waring, S.C., Tangalos, E.G., Kokmen, E.: Prediction of AD with MRI-based hippocampal volume in mild cognitive impairment. Neurology 52(7), 1397 (1999) 5. Atiya, M., Hyman, B.T., Albert, M.S., Killiany, R.: Structural magnetic resonance imaging in established and prodromal Alzheimer disease: a review. Alzheimer Dis. Assoc. Disord. 17(3), 177–195 (2003) 6. Dubois, B., Chupin, M., Hampel, H., Lista, S., Cavedo, E., Croisile, B., Tisserand, G.L., Touchon, J., Bonafe, A., Ousset, P.J., et al.: Donepezil decreases annual rate of hippocampal atrophy in suspected prodromal Alzheimer’s disease. Alzheimer’s Dement. 11(9), 1041–1049 (2015) 7. Cuingnet, R., Gerardin, E., Tessieras, J., Auzias, G., Lehéricy, S., Habert, M.O., Chupin, M., Benali, H., Colliot, O.: Automatic classification of patients with Alzheimer’s disease from structural MRI: a comparison of ten methods using the ADNI database. NeuroImage 56(2), 766–781 (2011) 8. Lötjönen, J., Wolz, R., Koikkalainen, J., Julkunen, V., Thurfjell, L., Lundqvist, R., Waldemar, G., Soininen, H., Rueckert, D.: Fast and robust extraction of hippocampus from MR images for diagnostics of Alzheimer’s disease. NeuroImage 56(1), 185–196 (2011) 9. Liu, M., Zhang, D., Shen, D.: View-centralized multi-atlas classification for Alzheimer’s disease diagnosis. Hum. Brain Mapp. 36(5), 1847–1865 (2015) 10. Montagne, A., Barnes, S.R., Sweeney, M.D., Halliday, M.R., Sagare, A.P., Zhao, Z., Toga, A.W., Jacobs, R.E., Liu, C.Y., Amezcua, L., et al.: Blood-brain barrier breakdown in the aging human hippocampus. Neuron 85(2), 296–302 (2015) 11. Liu, M., Zhang, J., Yap, P.T., Shen, D.: View-aligned hypergraph learning for Alzheimer’s disease diagnosis with incomplete multi-modality data. Med. Image Anal. 36, 123–134 (2017) 12. Liu, M., Zhang, D., Shen, D.: Relationship induced multi-template learning for diagnosis of Alzheimer’s disease and mild cognitive impairment. IEEE Trans. Med. Imaging 35(6), 1463– 1474 (2016) 13. Lian, C., Zhang, J., Liu, M., Zong, X., Hung, S.C., Lin, W., Shen, D.: Multi-channel multi-scale fully convolutional network for 3D perivascular spaces segmentation in 7T MR images. Med. Image Anal. 46, 106–117 (2018) 14. Liu, M., Zhang, J., Adeli, E., Shen, D.: Joint classification and regression via deep multi-task multi-channel learning for Alzheimer’s disease diagnosis. IEEE Trans. Biomed. Eng. (2018) 15. Liu, M., Zhang, D., Adeli, E., Shen, D.: Inherent structure-based multiview learning with multitemplate feature representation for Alzheimer’s disease diagnosis. IEEE Trans. Biomed. Eng. 63(7), 1473–1482 (2016) 16. Liu, M., Zhang, J., Nie, D., Yap, P.T., Shen, D.: Anatomical landmark based deep feature representation for MR images in brain disease diagnosis. IEEE J. Biomed. Health Inform. 22(5), 1476–1485 (2018) 17. Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning. Springer Series in Statistics, vol. 1. Springer, Berlin (2001) 18. Small, G.W., Ercoli, L.M., Silverman, D.H., Huang, S.C., Komo, S., Bookheimer, S.Y., Lavretsky, H., Miller, K., Siddarth, P., Rasgon, N.L., et al.: Cerebral metabolic and cognitive decline in persons at genetic risk for Alzheimer’s disease. Proc. Natl. Acad. Sci. 97(11), 6037–6042 (2000) 19. Lian, C., Ruan, S., Denœux, T., Jardin, F., Vera, P.: Selecting radiomic features from FDG-PET images for cancer treatment outcome prediction. Med. Image Anal. 32, 257–268 (2016) 20. Wolz, R., Aljabar, P., Hajnal, J.V., Lötjönen, J., Rueckert, D.: Nonlinear dimensionality reduction combining MR imaging with non-imaging information. Med. Image Anal. 16(4), 819–830 (2012) 21. Tong, T., Wolz, R., Gao, Q., Guerrero, R., Hajnal, J.V., Rueckert, D.: Multiple instance learning for classification of dementia in brain MRI. Med. Image Anal. 18(5), 808–818 (2014) 22. Coupé, P., Eskildsen, S.F., Manjón, J.V., Fonov, V.S., Pruessner, J.C., Allard, M., Collins, D.L.: Scoring by nonlocal image patch estimator for early detection of Alzheimer’s disease. NeuroImage: Clin. 1(1) (2012) 141–152

146

M. Liu et al.

23. Lian, C., Ruan, S., Denœux, T., Li, H., Vera, P.: Spatial evidential clustering with adaptive distance metric for tumor segmentation in FDG-PET images. IEEE Trans. Biomed. Eng. 65(1), 21–30 (2017) 24. Lian, C., Ruan, S., Denœux, T., Li, H., Vera, P.: Joint tumor segmentation in PET-CT images using co-clustering and fusion based on belief functions. IEEE Trans. Image Process. 28(2), 755–766 (2019) 25. Liu, M., Zhang, J., Adeli, E., Shen, D.: Landmark-based deep multi-instance learning for brain disease diagnosis. Med. Image Anal. 43, 157–168 (2018) 26. Jack, C.R., Bernstein, M.A., Fox, N.C., Thompson, P., Alexander, G., Harvey, D., Borowski, B., Britson, P.J., L Whitwell, J., Ward, C.: The Alzheimer’s disease neuroimaging initiative (ADNI): MRI methods. J. Magn. Reson. Imaging 27(4), 685–691 (2008) 27. Malone, I.B., Cash, D., Ridgway, G.R., MacManus, D.G., Ourselin, S., Fox, N.C., Schott, J.M.: MIRIAD-Public release of a multiple time point Alzheimer’s MR imaging dataset. NeuroImage 70, 33–36 (2013) 28. Cheng, B., Liu, M., Suk, H.I., Shen, D., Zhang, D.: Multimodal manifold-regularized transfer learning for MCI conversion prediction. Brain Imaging Behav. 1–14 (2015) 29. Sled, J.G., Zijdenbos, A.P., Evans, A.C.: A nonparametric method for automatic correction of intensity nonuniformity in MRI data. IEEE Trans. Med. Imaging 17(1), 87–97 (1998) 30. Mardia, K.: Assessment of multinormality and the robustness of Hotelling’s T2 test. Appl. Stat. 163–171 (1975) 31. Holmes, C.J., Hoge, R., Collins, L., Woods, R., Toga, A.W., Evans, A.C.: Enhancement of MR images using registration for signal averaging. J. Comput. Assist. Tomogr. 22(2), 324–333 (1998) 32. Ashburner, J., Friston, K.J.: Why voxel-based morphometry should be used. NeuroImage 14(6), 1238–1243 (2001) 33. Zhang, J., Liang, J., Zhao, H.: Local energy pattern for texture classification using self-adaptive quantization thresholds. IEEE Trans. Image Process. 22(1), 31–42 (2013) 34. Leung, T., Malik, J.: Representing and recognizing the visual appearance of materials using three-dimensional textons. Int. J. Comput. Vis. 43(1), 29–44 (2001) 35. De Jong, L., Van der Hiele, K., Veer, I., Houwing, J., Westendorp, R., Bollen, E., De Bruin, P., Middelkoop, H., Van Buchem, M., Van Der Grond, J.: Strongly reduced volumes of putamen and thalamus in Alzheimer’s disease: an MRI study. Brain 131(12), 3277–3285 (2008) 36. Liu, M., Zhang, D., Shen, D.: Ensemble sparse classification of Alzheimer’s disease. NeuroImage 60(2), 1106–1116 (2012) 37. Yan, Z., Zhan, Y., Peng, Z., Liao, S., Shinagawa, Y., Zhang, S., Metaxas, D.N., Zhou, X.S.: Multi-instance deep learning: discover discriminative local anatomies for bodypart recognition. IEEE Trans. Med. Imaging 35(5), 1332–1343 (2016) 38. Amores, J.: Multiple instance classification: review, taxonomy and comparative study. Artif. Intell. 201, 81–105 (2013) 39. Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004) 40. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.: Tensorflow: A system for large-scale machine learning. In: Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation, vol. 16, pp. 265–283 (2016) 41. Zhang, D., Shen, D.: Multi-modal multi-task learning for joint prediction of multiple regression and classification variables in Alzheimer’s disease. NeuroImage 59(2), 895–907 (2012) 42. Cheng, B., Liu, M., Zhang, D., Munsell, B.C., Shen, D.: Domain transfer learning for MCI conversion prediction. IEEE Trans. Biomed. Eng. 62(7), 1805–1817 (2015) 43. Liu, M., Zhang, D., Chen, S., Xue, H.: Joint binary classifier learning for ECOC-based multiclass classification. IEEE Trans. Pattern Anal. Mach. Intell. 38(11), 2335–2341 (2016) 44. Zhang, Y., Brady, M., Smith, S.: Segmentation of brain MR images through a hidden Markov random field model and the expectation-maximization algorithm. IEEE Trans. Med. Imaging 20(1), 45–57 (2001)

8 Anatomical-Landmark-Based Deep Learning …

147

45. Tzourio-Mazoyer, N., Landeau, B., Papathanassiou, D., Crivello, F., Etard, O., Delcroix, N., Mazoyer, B., Joliot, M.: Automated anatomical labeling of activations in SPM using a macroscopic anatomical parcellation of the MNI MRI single-subject brain. NeuroImage 15(1), 273– 289 (2002) 46. Shen, D., Davatzikos, C.: HAMMER: hierarchical attribute matching mechanism for elastic registration. IEEE Trans. Med. Imaging 21(11), 1421–1439 (2002) 47. Matthews, B.W.: Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure 405(2) (1975) 442–451 48. Wang, M., Zhang, D., Shen, D., Liu, M.: Multi-task exclusive relationship learning for Alzheimer’s disease progression prediction with longitudinal data. Med. Image Anal. 53, 111– 122 (2019) 49. Jie, B., Liu, M., Liu, J., Zhang, D., Shen, D.: Temporally constrained group sparse learning for longitudinal data analysis in Alzheimer’s disease. IEEE Trans. Biomed. Eng. 64(1), 238–249 (2017) 50. Lian, C., Liu, M., Zhang, J., Shen, D.: Hierarchical fully convolutional network for joint atrophy localization and Alzheimer’s disease diagnosis using structural MRI. IEEE Trans. Pattern Anal. Mach. Intell. (2019) 51. Wang, M., Zhang, D., Huang, J., Shen, D., Liu, M.: Low-rank representation for multi-center autism spectrum disorder identification. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 647–654. Springer (2018)

Chapter 9

Multi-scale Deep Convolutional Neural Networks for Emphysema Classification and Quantification Liying Peng, Lanfen Lin, Hongjie Hu, Qiaowei Zhang, Huali Li, Qingqing Chen, Dan Wang, Xian-Hua Han, Yutaro Iwamoto, Yen-Wei Chen, Ruofeng Tong and Jian Wu Abstract In this work, we aim at classification and quantification of emphysema in computed tomography (CT) images of lungs. Most previous works are limited to extracting low-level features or mid-level features without enough high-level information. Moreover, these approaches do not take the characteristics (scales) of different emphysema into account, which are crucial for feature extraction. In contrast to previous works, we propose a novel deep learning method based on multi-scale deep convolutional neural networks. There are three contributions for this paper. First, we propose to use a base residual network with 20 layers to extract more highlevel information. Second, we incorporate multi-scale information into our deep neural networks so as to take full consideration of the characteristics of different emphysema. A 92.68% classification accuracy is achieved on our original dataset. Finally, based on the classification results, we also perform the quantitative analysis of emphysema in 50 subjects by correlating the quantitative results (the area percent-

L. Peng · L. Lin (B) · R. Tong · J. Wu College of Computer Science and Technology, Zhejiang University, Hangzhou, Zhejiang 310000, China e-mail: [email protected] L. Peng e-mail: [email protected] H. Hu · Q. Zhang · H. Li · Q. Chen · D. Wang Department of Radiology, Sir Run Run Shaw Hospital, Zhejiang University, Hangzhou, Zhejiang 310000, China e-mail: [email protected] Q. Zhang e-mail: [email protected] H. Li e-mail: [email protected] Q. Chen e-mail: [email protected] D. Wang e-mail: [email protected] © Springer Nature Switzerland AG 2020 Y.-W. Chen and L. C. Jain, Deep Learning in Healthcare, Intelligent Systems Reference Library 171, https://doi.org/10.1007/978-3-030-32606-7_9

149

150

L. Peng et al.

age of each class) with pulmonary functions. We show that centrilobular emphysema (CLE) and panlobular emphysema (PLE) have strong correlation with the pulmonary functions and the sum of CLE and PLE can be used as a new and accurate measure of emphysema severity instead of the conventional measure (sum of all subtypes of emphysema). The correlations between the new measure and various pulmonary functions are up to |r| = 0.922 (r is correlation coefficient).

9.1 Introduction Emphysema is a major component of chronic obstructive pulmonary disease (COPD), which is a leading cause of morbidity and mortality throughout the world [1]. The condition causes shortness of breath by excessive expansion of the alveoli. In general, emphysema can be subcategorized into three major subtypes on autopsy: centrilobular emphysema (CLE), paraseptal emphysema (PSE), and panlobular emphysema (PLE) [2]. They have different pathophysiological significance [3, 4]. For example, CLE is commonly associated with cigarette smoking. PSE is often not associated with significant symptoms or physiological impairments. PLE is generally associated with α1 -antitrypsin deficiency (AATD). Therefore, classification and quantification of emphysema are important. Computed tomography (CT) is currently considered to be the most accurate imaging technique for detecting emphysema, determining its subtype, and evaluating its severity [5]. In CT, the subtypes of emphysema have distinct radiological appearances [6]. Figure 9.1 shows typical examples of normal tissue and three subtypes of emphysema indicated by red arrows or red curves. It can be seen that, CLE generally appears as scattered small low attenuation areas. PSE is shown as low attenuation areas aligned in a row along a visceral pleura [3]. PLE usually manifests as a wide range low attenuation region with fewer and smaller lung vessels [7]. Many studies have been conducted to classify emphysema in CT images, which can be classified into unsupervised [6, 8–11] and supervised methods [7, 12–20, 28, 29]. Unsupervised methods aim at discovering new emphysema subtypes that go beyond the standard subtypes identified on autopsy. Binder et al. built a generative model for the discovery of disease subtypes within emphysema and of X.-H. Han Yamaguchi University, Yamaguchi, Japan e-mail: [email protected] Y. Iwamoto Graduate School of Information Science and Engineering, Ritsumeikan University, Kyoto, Japan e-mail: [email protected] Y.-W. Chen Zhejiang Lab, Hangzhou City, China e-mail: [email protected]

9 Multi-scale Deep Convolutional Neural Networks …

151

Fig. 9.1 Typical examples of different lung tissue patterns indicated by red arrows or red curves. a Normal tissue (NT). b Centrilobular emphysema (CLE). c Paraseptal emphysema (PSE). d Panlobular emphysema (PLE)

patient clusters that are characterized by distinct distributions of such subtypes [8]. In [6, 9], the authors proposed to generate unsupervised lung texture prototypes according to texture appearance, and to encode lung CT scans with prototype histograms. Song et al. use a variant of the Latent Dirichlet Allocation model to discover lung macroscopic patterns in an unsupervised way from lung regions that encode emphysematous areas [10]. Additionally, Yang et al. presented an unsupervised framework for integrating spatial and textural information to discover localized lung texture patterns of emphysema [11]. Compared with unsupervised approaches for emphysema classification, supervised methods focus on classifying the standard emphysema subtypes identified on autopsy, which have different pathophysiologic importance [3, 4]. One common way of characterizing patterns of emphysema is based on the local intensity distribution, such as adaptive intensity histogram [7] and kernel density estimation (KDE) [12]. Another class of approaches describes the morphology of emphysema using texture analysis techniques [7, 13–20]. Uppaluri et al. was the first to use texture features for classifying emphysema in CT images [13]. Since then, many approaches were proposed to classify emphysematous patterns using this idea, such as the adaptive multiple feature method (AMFM) [14], gradient magnitude [15], and the gray-level difference method [16]. More recently, several methods with higher accuracy were proposed. Sørensen et al. designed a combined model of local binary patterns (LBP) and intensity histogram for characterizing emphysema lesions [7]. In [17], the author put forward the joint Weber-based rotation invariant uniform local ternary pattern (JWRIULTP) for classifying emphysema, which allows for a much richer representation and also takes the comprehensive information of the image into account. Furthermore, some of these latest studies adopted learned schemes for extracting features, such as texton-based methods [18, 19] and sparse representation models [20]. In the last years, some attempts have revealed the potential of deep learning techniques on lung disease classification. For example, in the well-known lung nodule analysis (LUNA16) challenge, convolutional neural network (CNN) architectures were used by all the top performing systems [21, 22]. Moreover, some CNN systems were designed for a specific task. For lung nodule classification, vessels may be classified as nodules when the system only processes one of its views. Setio et

152

L. Peng et al.

al. proposed a multi-view CNN for false positive reduction of pulmonary nodules [23]. For each candidate, the author extracted multiple 2D views in fixed planes. Each 2D view was then processed by one CNN stream. The CNN features were integrated to calculate a final score. Unlike arbitrary objects in natural images, which contain complicated structures with specific orientation, the patterns of interstitial lung disease (ILD) are characterized by local textural features. Anthimopoulos et al. designed a CNN for the classification of ILD patterns, which can capture the low-level textural features of lung tissue [24]. In another work for ILD classification, to effectively extract texture and geometric features of lung tissue, the authors in [25] designed a CNN with rotational invariant Gabor-LBP representations of lung tissue patches as inputs. Finally, in [26, 27], the established CNNs (i.e., AlexNet, GoogLeNet) with transfer learning were used to classify ILD. Although deep learning is widely used in the classification of lung disease, it has been applied in only two studies [28, 29] for emphysema classification. The networks in these two studies used two or three convolutional layers, so they are not able to capture the high-level features. Besides [28, 29], most of existing methods for classifying emphysema are limited to extracting low-level features or mid-level features, which have limited abilities to distinguish different patterns. In contrast to previous works, we present a deep leaning approach based on the multi-scale deep convolutional neural networks (DCNN). The features learned from data involve more high-level information that humans cannot discover. Moreover, different subtypes of emphysema have their own distinct characteristics (scales), but most existing approaches for classification of emphysema do not take the characteristics of different emphysema into account. In this work, we incorporate multi-scale information into our deep neural networks for the following reasons: (1) The size of CLE is usually much smaller than PLE (diffuse regions), so we have to make a trade-off between local and global information. (2) Because PSE is always adjacent to a pleura margin, context information, which can be captured by a network with larger scale inputs, is important in defining PSE. Normal tissue, which is also a target pattern we need to classify, has more pulmonary vessels than emphysema lesions. A network with smaller scale inputs is more suitable for capturing such detailed information. This motivates a multi-scale method to capture the multi-scale information from the input image.

9.2 Methods In this section, we first show that how patches are generated through the annotations (Sect. 9.2.1). Subsequently, we present our multi-scale DCNN for emphysema classification. For simplicity, we first introduce our architecture in the single-scale scenario (Sect. 9.2.2) and then propose the multi-scale models (Sect. 9.2.3). Figure 9.2 shows an overview of the proposed approach.

9 Multi-scale Deep Convolutional Neural Networks …

153

Fig. 9.2 Overview of the proposed method

9.2.1 Patch Preparation Before extracting patches, we first extracted the lung field. As shown in Fig. 9.2 (right), for each annotated pixel, we could extract patches with different scales (27 × 27, 41 × 41, and 61 × 61) from its neighborhood. In this paper, different scales mean various sizes of inputs. We use inputs with original size, without resizing. The label assigned to each patch is the same as the label of the central pixel. All the patches are 2D samples extracted from slice images of 3D scans. Note that we ensure all categories are balanced when splitting dataset, i.e. each category has the same number of labelled patches in the training set, validation set, and test set,

154

L. Peng et al.

respectively. The classification accuracy was evaluated in patch-level. Because the dataset was divided by patients, patches extracted from a same patient or even a same scan cannot exist in both the training set and the test set. During quantification stage, each pixel in an test image was scanned one time and we extracted the patch from the neighborhood of the focused pixel.

9.2.2 Single Scale Architecture The base network is built on a 20-layer ResNet [30], which has achieved the excellent performance on image classification. We first briefly review the ResNet. ResNet employs processing blocks called residual blocks to ease the training of deeper networks. The residual block is expressed as H (x) = F(x) + x

(9.1)

where x is the input of a convolutional layer and F(x) is the residual function. A basic residual unit consists of two convolutional layers and batch normalization (BN) is adopted after every convolution and before activation (RELU). By stacking such structures, the 20-layer, 32-layer, 44-layer, 56-layer, 110-layer, and 1202-layer networks can be constructed. In order to adapt it to our problem (small inputs and only 4 classes), we remove the pooling layer and change the number of filters for some layers. Figure 9.2 (left) shows the details of this. We will explain why we choose a 20-layer structure in the experimental part.

9.2.3 Multi-scale Architecture As shown in Fig. 9.2 (middle), two approaches for fusing information from different scales are investigated:

9.2.3.1

Multi-scale Early Fusion (MSEF)

Due to the nature of the emphysema classification problem, one target category tends to be identified on a certain scale and the most suitable scales for different target categories may vary. Namely, we cannot find the best scale for all situations. Hence, it is necessary to incorporate information from various scales into our deep neural network. As shown in Fig. 9.2 (right), the convolutional layers for each scale are independent. We combine the outputs produced by average pooling layers and feed them into a 4-way shared fully connected layer with softmax to calculate a cross entropy classification loss [31], which can be formulated as

9 Multi-scale Deep Convolutional Neural Networks …

Loss(y, z) = −

K 

yk log(z k )

155

(9.2)

k=1

where K is the number of classes, z is the probability vector yielded by softmax layers, and y is the ground truth label.

9.2.3.2

Multi-scale Late Fusion (MSLF)

Another way to fuse the multi-scale information is to train separate networks, each of which focuses on a certain scale. Note that we used cross entropy loss function when training networks. In the fusion stage, we first sum up the values of probability vectors and then calculate the average of them. Figure 9.2 (right) shows the principle of our MSLF model. The final output probabilities can be expressed as P=

N 1  pi N i=1

(9.3)

where N is the number of streams and pi is the output of each stream.

9.3 Experiments This section aims at the presentation and discussion of the experimental results. Before that, we describe the dataset used in this study.

9.3.1 Dataset We have two datasets (see Table 9.1). All the data came from two hospitals and were acquired using seven types of CT machines (The scans from the first dataset produced by three CT scanners and the scans from the second dataset produced by another four CT scanners) with a slice collimation of 1–2 mm, a matrix of 512 × 512 pixels, and an in-plane resolution of 0.62–0.71 mm. The images were reconstructed at 1 mm thickness. The radiation dose ranges from 2 to 10 mSv. The first dataset includes 91 high resolution computed tomography (HRCT) volumes annotated manually by two experienced radiologists and checked by one experienced chest radiologist. Since emphysema is a diffuse lung disease (emphysema lesions spread over a wide area of lungs), it requires experts to spend considerable time and effort on fully annotating lesions. We have estimated that the average time for fully annotating one case was approximately 36man-hours. Another 3–5 h were then required to check and refine the annotations. Hence, the radiologists randomly selected about one-tenth to

156

L. Peng et al.

Table 9.1 Details of our dataset. N is the number of subjects acquired from one CT scanner First dataset Second dataset Manufacturer SIEMENS GE TOSHIBA

Model name Sensation 16 LightSpeed VCT Aquilion ONE

N 65 16 10

Manufacturer SIEMENS SIEMENS SIEMENS SIEMENS

Model name Definition AS 40 Definition AS 20 Definition Flash Force

N 15 6 8 21

Fig. 9.3 Example of annotated data. Red Mask: Non-emphysema tissue. Green mask: CLE lesions. a axial view. b coronal view

half of lesions for each case to annotate. By partially annotating, the workloads of radiologists were significantly reduced. Four types of patterns were annotated: CLE, PLE, PSE, and non-emphysema (NE) which corresponds to tissue without emphysema (but maybe with some other lung diseases). Considering the clinical applicability of our task, the radiologists annotated almost all the clinically common cases. The diversity of annotated lesions ranges from mild CLE, moderate CLE, severe CLE, mild PSE, substantial PSE, and PLE. The radiologists annotated the dataset by manually drawing masks for each type of pattern. Figure 9.3 shows an example of annotated data. The dataset was used for evaluation of classification accuracy shown in Sect. 9.2. Since the first dataset does not include complete pulmonary function evaluations, we collected additional 50 HRCT volumes from patients who have a complete pulmonary function evaluation for a quantitative analysis of emphysema shown in Sect. 9.3.

9 Multi-scale Deep Convolutional Neural Networks …

157

9.3.2 Evaluation of Classification Accuracy 9.3.2.1

Experimental Setup

Our classification experiments were conducted on 91 annotated subjects (the first dataset): 59 subjects (about 720,000 patches) for training, 14 subjects (about 140,000 patches) for validation, and 18 subjects (about 160,000 patches) for testing. Adam optimization algorithm [32] was applied to learn the parameters in our networks. The learning rate was exponentially decayed from 0.01 to 1e-4. The weights were initialized using normalized initialization proposed by Bengio et al. [33]. Training was stopped when the accuracy on the validation set did not improve after 3 epochs. The batch size was set to 50. The proposed approach was implemented in Python using Tensorflow framework. All the experiments were performed on a machine with CPU Intel Core i7-7700K @ 4.2 GHz, GPU NVIDIA GeForce Titan X, and 16GB of RAM.

9.3.2.2

Parameter Optimization

Our proposed networks have several hyper-parameters to be optimized. The most critical choices for the architecture are the number of layers and scales of the inputs. The number of layers affects the classification accuracy of the neural networks and needs to be selected carefully. In order to study how the performance of our system changes as we modify the number of layers, we fix the size of input and compare the classification accuracy at different number of layers. As shown in Fig. 9.4, the classification accuracy increases with the number of layers and plateaus at n = 20. Though the ResNet with 56 layers is widely used in many applications, there is no significant difference between the ResNet-56 and ResNet-20 in our experiments as

Fig. 9.4 Classification accuracy for different number of layers on fixed patch (input) size. Each color represents one patch (input) size

158

L. Peng et al.

Fig. 9.5 The effect of scales (patch/input size) on the accuracy of each category. a NE. b CLE. c PLE. d PSE

shown in Fig. 9.4. Since the ResNet-20 has fewer filters and lower complexity than ResNet-56, we choose the 20-layer ResNet in this work. Figure 9.5 demonstrates the effect of scales (patch/input size) on the accuracy of each category. The most suitable scales for different target categories are different: for non-emphysema tissue, the inputs of 27 × 27 generate the best result; for CLE, the best scale is 41 × 41; for PLE and PSE, the highest classification accuracy is obtained with inputs of 61 × 61. It suggests that non-emphysema tissue and CLE tend to be identified on a smaller scale and a larger scale is more suitable for PLE and PSE. However, if the input size is too small or too large, the system is unable to model the classification problem precisely. Therefore, patches of sizes 27 × 27, 41 × 41, and 61 × 61 are selected as inputs of the multi-scale neural networks.

9.3.2.3

Single Scale Versus Multiple Scales

In this subsection, we examine the effect of fusing multi-scale information. The relevant results are listed in Table 9.2. Note that, both MSEF model and MSLF model outperform any SS model (including 27 × 27, 41 × 41, and 61 × 61). Specifically, the accuracy of MSEF model is higher than the other models for three classes (except PLE). Since incorporating the multi-scale information leads to significantly higher accuracy, we can conclude that the multi-scale method is highly effective in comparison to the single scale setting.

9 Multi-scale Deep Convolutional Neural Networks …

159

Table 9.2 The comparison between single-scale models and multi-scale models 27 × 27 (%) 41 × 41 (%) 61 × 61 (%) MSEF (%) MSLF (%) NE CLE PLE PSE Avg.

9.3.2.4

93.19 86.85 83.61 87.35 87.77

91.77 88.87 92.18 89.52 90.58

86.04 86.50 95.06 95.52 90.81

94.05 91.17 89.48 95.89 92.68

91.98 89.02 93.78 92.36 91.80

Comparison to the State-of-the-art

In this subsection, we compare our approach to three state-of-the-art methods for emphysema classification and one deep learning method proposed for interstitial lung diseases classification: (1) RILBP+INT: Joint rotation invariant local binary pattern and intensity histogram for classifying emphysema published in [7]. (2) JWRIULTP+INT: Joint 3-D histograms for classification of emphysema published in [17]. (3) Texton-based: A texton-based approach for classifying emphysema published in [18]. (4) Anthimopoulos’s method: a deep learning method used for interstitial lung diseases classification published in [24]. The results are presented in Fig. 9.6. Our method significantly outperform other approaches when training size is larger than 5,000 patches. Furthermore, the accuracy of the three emphysema classification methods plateaus near 79% when the training size is larger than 80,000, while the accuracy of our method keeps growing.

9.3.3 Emphysema Quantification In this section, based on the classification results, we quantify the whole lung area of 50 subjects (the second dataset with complete pulmonary function evaluations) by calculating the area percentage of each class (CLE%, PLE%, PSE%, respectively), and correlate the quantitative results (area percentages) with various pulmonary function indices for diagnosing subjects with COPD [34]. Some visual results of full lung classification are shown in Fig. 9.7. It can be seen that, auto-annotations (or classification results) of the proposed method are similar to the annotations of radiologists (manual annotations). Correlation of quantitative results with various pulmonary function indices are shown in Table 9.3. The pulmonary function indices include the forced expiratory volume in one second dividing with a predicted value (FEV1%), the forced expiratory volume in one second/forced vital capacity (FEV1/FVC), peak expiratory flow (PEF), forced expiratory flow at 25–75% of forced vital capacity

160

L. Peng et al.

(FEF25 , FEF50 , FEF75 ), maximum voluntary ventilation (MVV), and the forced expiratory volume in one second/the volume change at the mouth between the position of full inspiration and complete expiration (FEV1/VCmax) [34, 35]. We found that PLE%, CLE% correlate significantly with the pulmonary function indices listed in the table, achieving correlation coefficients ranging from |r | = 0.629 to |r | = 0.889.

Fig. 9.6 The comparison of the proposed method with the state-of-the-art methods

Fig. 9.7 Examples of the classification results. Each row represents a subject. a, e, i Classification results in coronal view. b, f, j Typical original HRCT slices from subjects of a, e, i, respectively. c, g, k Auto-annotated mask of our proposed method. d, h, i Manual-annotated mask of radiologists. Green mask: CLE lesions. Blue mask: PLE lesions. Yellow mask: PSE lesions

9 Multi-scale Deep Convolutional Neural Networks …

161

Table 9.3 Correlation between quantitative results and various pulmonary function indices. “**” in this table means the correlation is statistically significant. |r | ≥ 0.8 means two variables are highly correlated. 0.8 > |r | ≥ 0.5 means there is a moderate correlation between two variables. 0.5 > |R| ≥ 0.3 means two variables are weakly correlated. |r | < 0.3 means there is no correlations between two variables CLE% PLE% PSE% Emphy% CLE%+PLE% FEV1 % FEV1 /FVC PEF FEF25 FEF50 FEF75 MVV FEV1/VCmax

r = − 0.791** p = 0.000 r = − 0.781** p = 0.000 r = − 0.629** p = 0.000 r = − 0.638** p = 0.000 r = − 0.672** p = 0.000 r = − 0.663** p = 0.000 r = − 0.666** p = 0.000 r = − 0.796** p = 0.000

r = − 0.889** p = 0.000 r = − 0.805** p = 0.000 r = − 0.775** p = 0.000 r = − 0.762** p = 0.000 r = − 0.866** p = 0.000 r = − 0.849** p = 0.000 r = − 0.852** p = 0.000 r = − 0.757** p = 0.000

r = − 0.061 p = 0.698 r = − 0.042 p = 0.790 r = − 0.002 p = 0.988 r=0.094 p = 0.556 r=0.140 p = 0.371 r=0.096 p = 0.540 r = − 0.047 p = 0.766 r = − 0.059 p = 0.710

r = − 0.879** p = 0.000 r = − 0.814** p = 0.000 r = − 0.748** p = 0.000 r = − 0.763** p = 0.000 r = − 0.740** p = 0.000 r = − 0.716** p = 0.000 r = − 0.788** p = 0.000 r = − 0.779** p = 0.000

r = − 0.922** p = 0.000 r = − 0.873** p = 0.000 r = − 0.771 p = 0.000 r = − 0.794** p = 0.000 r = − 0.806** p = 0.000 r = − 0.785** p = 0.000 r = − 0.796** p = 0.000 r = − 0.851** p = 0.000

Specifically, the correlation between PLE% and FEV1% achieves |r | = 0.889. We also find that there is no correlation between PSE% and any pulmonary function index listed in Table 9.3. It suggests that CLE and PLE are the main factors resulting in poor lung functions, especially PLE. PSE has almost no effect on lung functions. According to the literature [3], PSE is often not associated with significant symptoms or physiological impairments, which is in close agreement with our experimental results. This demonstrates the accuracy of our proposed method. Based on our quantitative analysis results, we combine CLE% with PLE% (sum of CLE% and PLE%) as a new and accurate measure of emphysema severity, in contrast to a common measure Emphy% (sum of all subtypes of emphysema) [36, 37]. As shown in Table 9.3, the new measure (CLE%+PLE%) correlates more strongly with pulmonary function indices than Emphy%, which suggests that our proposed measure may be a better indicator of the severity of emphysema.

9.4 Conclusion In this work, we presented a novel deep learning method for classification of pulmonary emphysema based on the multi-scale deep convolutional neural networks. The results showed that (1) the multi-scale method was highly effective compared

162

L. Peng et al.

to the single scale setting; (2) our approach achieved better performance than the state-of-the-art methods; (3) the measured emphysema severity was in good agreement with various pulmonary function indices, achieving correlation coefficients of up to |r | = 0.922 in 50 subjects. Our approach may be readily extended to classify other types of lesions in various medical imaging applications which have the same challenges as our task. Acknowledgements This work was supported in part by Zhejiang Lab Program under the Grant No.2018DG0ZX01, in part by the Key Science and Technology Innovation Support Program of Hangzhou under the Grant No.20172011A038, and in part by the Grant-in Aid for Scientific Research from the Japanese Ministry for Education, Science, Culture and Sports (MEXT) under the Grant No.18H03267 and No.17H00754.

References 1. Mannino, D.M., Kiri, V.A.: Changing the burden of COPD mortality. Int. J. Chron. Obstruct. Pulmon. Dis. 1, 219–233 (2006) 2. Takahashi, M., Fukuoka, J., Nitta, N., Takazakura, R., Nagatani, Y., Murakami, Y., Murata, K.: Imaging of pulmonary emphysema: a pictorial review. Int. J. Chron. Obstruct. Pulmon. Dis. 3, 193–204 (2008) 3. Lynch, D.A., Austin, J.H., Hogg, J.C., Grenier, P.A., Kauczor, H.U., Bankier, A.A., Coxson, H.O.: CT-definable subtypes of chronic obstructive pulmonary disease: a statement of the Fleischner Society. Radiology 277, 192–205 (2015) 4. Smith, B.M., Austin, J.H., Newell Jr., J.D., D’Souza, B.M., Rozenshtein, A., Hoffman, E.A., Barr, R.G.: Pulmonary emphysema subtypes on computed tomography: the MESA COPD study. Am. J. Med. 127, 94–e7 (2014) 5. Shaker, S.B., von Wachenfeldt, K.A., Larsson, S., Mile, I., Persdotter, S., Dahlbäck, M., Fehniger, T.E.: Identification of patients with chronic obstructive pulmonary disease (COPD) by measurement of plasma biomarkers. Clin. Respir. J. 2, 17–25 (2008) 6. Yang, J., Angelini, E.D., Smith, B.M., Austin, J.H., Hoffman, E.A., Bluemke, D.A., Laine, A.F.: Explaining radiological emphysema subtypes with unsupervised texture prototypes: MESA COPD study. In: Medical Computer Vision and Bayesian and Graphical Models for Biomedical Imaging, pp. 69–80. Springer, Cham (2016) 7. Sorensen, L., Shaker, S.B., De Bruijne, M.: Quantitative analysis of pulmonary emphysema using local binary patterns. IEEE Trans. Med. Imaging 29, 559–569 (2010) 8. Binder, P., Batmanghelich, N.K., Estépar, R.S.J., Golland, P.: Unsupervised discovery of emphysema subtypes in a large clinical cohort. In: International Workshop on Machine Learning in Medical Imaging, pp. 180–187. Springer, Cham (2016) 9. Häme, Y., Angelini, E.D., Parikh, M.A., Smith, B.M., Hoffman, E.A., Barr, R.G., Laine, A.F.: Sparse sampling and unsupervised learning of lung texture patterns in pulmonary emphysema: MESA COPD study. In: Proceedings of the IEEE International Symposium on Biomedical Imaging, pp. 109–113 (2015) 10. Song, J., Yang, J., Smith, B., Balte, P., Hoffman, E.A., Barr, R.G., Angelini, E.D.: Generative method to discover emphysema subtypes with unsupervised learning using lung macroscopic patterns (LMPS): the MESA COPD study. In: Proceedings of the IEEE International Symposium on Biomedical Imaging. pp. 375–378 (2017) 11. Yang, J., Angelini, E.D., Balte, P.P., Hoffman, E.A., Austin, J.H., Smith, B.M., Laine, A.F.: Unsupervised discovery of spatially-informed lung texture patterns for pulmonary emphysema: the MESA COPD study. In: Proceedings of the MICCAI, pp. 116–124 (2017)

9 Multi-scale Deep Convolutional Neural Networks …

163

12. Mendoza, C.S., Washko, G.R., Ross, J.C., Diaz, A.A., Lynch, D.A., Crapo, J.D., Estépar, R.S.J.: Emphysema quantification in a multi-scanner HRCT cohort using local intensity distributions. In: Proceedings of the IEEE International Symposium on Biomedical Imaging, pp. 474–477 (2012) 13. Uppaluri, R., Mitsa, T., Sonka, M., Hoffman, E.A., McLennan, G.: Quantification of pulmonary emphysema from lung computed tomography images. Amer. J. Respir. Crit. Care Med. 156, 248–254 (1997) 14. Xu, Y., Sonka, M., McLennan, G., Guo, J., Hoffman, E.A.: MDCT-based 3D texture classification of emphysema and early smoking related lung pathologies. IEEE Trans. Med. Imaging 25, 464–475 (2006) 15. Park, Y.S., Seo, J.B., Kim, N., Chae, E.J., Oh, Y.M., Do Lee, S., Kang, S.H.: Texture-based quantification of pulmonary emphysema on high-resolution computed tomography: comparison with density-based quantification and correlation with pulmonary function. Invest. Radiol. 43, 395–402 (2008) 16. Prasad, M., Sowmya, A., Wilson, P.: Multi-level classification of emphysema in HRCT lung images. Pattern Anal. Appl. 12, 9–20 (2009) 17. Peng, L., Lin, L., Hu, H., Ling, X., Wang, D., Han, X., Chen, Y.W.: Joint weber-based rotation invariant uniform local ternary pattern for classification of pulmonary emphysema in CT images. In: Proceedigs of the International Conference on Image Processing, pp. 2050–2054 (2017) 18. Gangeh, M.J., Sørensen, L., Shaker, S.B., Kamel, M.S., De Bruijne, M., Loog, M.: A textonbased approach for the classfication of lung parenchyma in CT images. In: Proceedings of the MICCAI, pp. 595–602 (2010) 19. Asherov, M., Diamant, I., Greenspan, H.: Lung texture classification using bag of visual words. In: Proceedings of the SPIE Medical Imaging (2014) 20. Yang, J., Feng, X., Angelini, E.D., Laine, A.F.: Texton and sparse representation based texture classification of lung parenchyma in CT images. In: Proceedings of the EMBC, pp. 1276–1279 (2016) 21. Litjens, G., Kooi, T., Bejnordi, B.E., Setio, A.A.A., Ciompi, F., Ghafoorian, M., Sánchez, C.I.: A survey on deep learning in medical image analysis. Med. Image Anal. 42, 60–88 (2017) 22. Dou, Q., Chen, H., Yu, L., Qin, J., Heng, P.A.: Multilevel contextual 3-d cnns for false positive reduction in pulmonary nodule detection. IEEE Trans. Biomed. Eng. 64, 1558–1567 (2017) 23. Setio, A.A.A., Ciompi, F., Litjens, G., Gerke, P., Jacobs, C., Van Riel, S.J., Van, G.: Pulmonary nodule detection in CT images: false positive reduction using multi-view convolutional networks. IEEE Trans. Med. Imaging 35, 1160–1169 (2016) 24. Anthimopoulos, M., Christodoulidis, S., Ebner, L., Christe, A., Mougiakakou, S.: Lung pattern classification for interstitial lung diseases using a deep convolutional neural network. IEEE Trans. Med. Imaging 35, 1207–1216 (2016) 25. Wang, Q., Zheng, Y., Yang, G., Jin, W., Chen, X., Yin, Y.: Multiscale rotation-invariant convolutional neural networks for lung texture classification. IEEE J. Biomed. Health Inform. 1–1 (2017) 26. Hoo-Chang, S., Roth, H.R., Gao, M., Lu, L., Xu, Z., Nogues, I., Summers, R.M.: Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Trans. Med. Imaging 35, 1285–1298 (2016) 27. Gao, M., Xu, Z., Lu, L., Harrison, A.P., Summers, R.M., Mollura, D.J.: Holistic interstitial lung disease detection using deep convolutional neural networks: multi-label learning and unordered pooling. arXiv preprint arXiv:1701.05616 (2017) 28. Karabulut, E.M., Ibrikci, T.: Emphysema discrimination from raw HRCT images by convolutional neural networks. In: Proceedings of the ELECO, pp. 705–708 (2015) 29. Pei, X: Emphysema classification using convolutional neural networks. In: Proceedings of the ICIRA, pp. 455–461 (2015) 30. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the CVPR, pp. 770–778 (2016)

164

L. Peng et al.

31. Heaton, J.: Ian Goodfellow, Yoshua Bengio, and Aaron Courville: deep learning. In: Genetic Programming and Evolvable Machines, pp. 305–307 (2017) 32. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Proceedings of the International Conference on Learning Represents, pp. 1–13 (2015) 33. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the International Conference on Artificial Intelligence and Statistics, pp. 249–256 (2010) 34. Crapo, R.O. et al.: American thoracic society. Standardization of spirometry 1994 update. Am. J. Respir. Crit. Care Med. 152, 1107–1136 (1995) 35. Sverzellati, N., Cademartiri, F., Bravi, F., Martini, C., Gira, F.A., Maffei, E., Rossi, C.: Relationship and prognostic value of modified coronary artery calcium score, FEV1, and emphysema in lung cancer screening population: the MILD trial. Radiology 262, 460–467 (2012) 36. Ceresa, M., Bastarrika, G., de Torres, J.P., Montuenga, L.M., Zulueta, J.J., Ortiz-de-Solorzano, C., Muñoz-Barrutia, A.: Robust, standardized quantification of pulmonary emphysema in low dose CT exams. Acad. Radiol. 18, 1382–1390 (2011) 37. Hame, Y.T., Angelini, E.D., Hoffman, E.A., Barr, R.G., Laine, A.F.: Adaptive quantification and longitudinal analysis of pulmonary emphysema with a hidden markov measure field model. IEEE Trans. Med. Imaging 33, 1527–1540 (2014)

Chapter 10

Opacity Labeling of Diffuse Lung Diseases in CT Images Using Unsupervised and Semi-supervised Learning Shingo Mabu, Shoji Kido, Yasuhi Hirano and Takashi Kuremoto Abstract Research on computer-aided diagnosis (CAD) for medical images using machine learning has been actively conducted. However, machine learning, especially deep learning, requires a large number of training data with annotations. Deep learning often requires thousands of training data, but it is tough work for radiologists to give normal and abnormal labels to many images. In this research, aiming the efficient opacity annotation of diffuse lung diseases, unsupervised and semi-supervised opacity annotation algorithms are introduced. Unsupervised learning makes clusters of opacities based on the features of the images without using any opacity information, and semi-supervised learning efficiently uses the small number of training data with annotation for training classifiers. The performance evaluation is carried out by clustering or classification of six kinds of opacities of diffuse lung diseases in computed tomography (CT) images: consolidation, ground-glass opacity, honeycombing, emphysema, nodular and normal, and the effectiveness of the proposed methods is clarified.

10.1 Introduction Research on computer-aided diagnosis (CAD) has been actively conducted. Although radiologists diagnose medical images based on their own knowledge and experiences, CAD would aid in providing the second opinions to support the decisions of radiologists [12]. CAD for diffuse lung diseases dealt with in this chapter has been S. Mabu (B) · S. Kido · Y. Hirano · T. Kuremoto Yamaguchi Univeristy, Tokiwadai 2-16-1, Ube, Yamaguchi 755-8611, Japan e-mail: [email protected] S. Kido e-mail: [email protected] Y. Hirano e-mail: [email protected] T. Kuremoto e-mail: [email protected] © Springer Nature Switzerland AG 2020 Y.-W. Chen and L. C. Jain, Deep Learning in Healthcare, Intelligent Systems Reference Library 171, https://doi.org/10.1007/978-3-030-32606-7_10

165

166

S. Mabu et al.

also studied. For example, normal and abnormal opacities can be classified into six kinds of pulmonary textures using bag-of-features approach [12]. In CAD research, deep learning [9] is attracting attention because of its automatic feature extraction ability and high accuracy. In [15], lung texture classification and airway detection using convolutional restricted Boltzmann machine are proposed, which shows that the combination of generative and discriminative learning gives better classification accuracy than either of them alone. However, deep learning based on supervised learning basically needs a large number of samples with correct annotations for the training. Since it is hard work for radiologists to give annotations to thousands of images for making training samples, unsupervised and semi-supervised learning are practical and useful solutions to reduce the cost of annotations. Unsupervised learning trains classifiers without using annotated samples and semi-supervised learning uses only a small number of annotated samples [3]. Moreover, unsupervised and semi-supervised learning can be used to make databases with annotations for supervised learning, which contributes to enhancing the applicability of deep learning. The authors proposed an unsupervised learning algorithm for opacity annotation using evolutionary data mining [11] that does not need any correct annotations for the learning. However, there was still large room for improvement in the annotation accuracy. Therefore, this chapter introduces deep learning-based unsupervised learning and semi-supervised learning to enhance clustering or classification accuracy. The aim of unsupervised learning in this chapter is to build a feature representation method for distinguishing different opacities, and the aim of semi-supervised learning is to improve learning efficiency, where only a small number of training samples are used to train a classifier. The annotations to the testing samples given by semi-supervised learning are also used as new training samples to retrain and improve the classifier. The proposed unsupervised learning method consists of three steps; (1) feature extraction using deep autoencoder (DAE) [9], (2) histogram representation of CT data using bag-of-features [4], and k-means clustering [7] to make groups of opacities. There are many conventional feature extraction methods in the image processing such as scaled invariance feature transform (SIFT) [10], speeded up robust features (SURF) [1] and histograms of oriented gradients (HOG) [5]. However, the best features for classification depend on the problems. Therefore, step (1) of the proposed method uses DAE to automatically extract important features to represent opacities. Note that DAE is trained by unsupervised learning, which is suitable for the problem of this chapter. In step (2), a bag-of-features method is used to represent each opacity with the combination of many local features, which does not need annotated samples, too. In step (3), k-means clustering is applied to make groups of opacities that correspond to the class labels such as consolidation, ground-glass opacity, etc. In the feature extraction from CT images of diffuse lung diseases, the texture analysis is one of the important processes because various types of opacity patterns exist in images. Therefore, the proposed method combines DAE and bag-of-features to extract texture features in small regions of interest (ROIs). The proposed semi-supervised learning also consists of three steps. Step (1) and (2) are the same as the unsupervised learning but in step (3) iterative semi-supervised learning using Support Vector Machine (SVM) [2] is proposed. An SVM classifier

10 Opacity Labeling of Diffuse Lung Diseases in CT Images …

167

is trained using only a small number of training samples, and after the trained SVM gives annotations to the other samples (testing samples), some annotated testing samples with high confidence are used as new training samples, which is called self-training. In addition, the proposed method asks the user to give annotations to quite a small number of testing samples with low confidence, which is called active learning [14]. The proposed method gradually improves the classifier by iterating self-training and active learning.

10.2 Materials and Methods Figure 10.1 shows the flow of the proposed methods. We used 406 cases of lung CT images taken in Yamaguchi University Hospital, Japan (SOMATOM Sensation 64 by SIEMENS). The CT images were divided into 32 × 32 [pixel] region of interest (ROI) images (0.65 mm/pixel), where the ROI images do not overlap with each other. In this research, one expert radiologist made manual segmentation on the CT images, where the regions of normal and five kinds of abnormal opacities were respectively shown. The segmentation images were regarded as ground-truth. Some sample ROI images of normal and five kinds of abnormal opacities (consolidation, ground-glass opacity (GGO), honeycombing, emphysema, nodular) are shown in Fig. 10.2. The aim of the proposed methods is to classify each ROI and give a correct opacity label. However, in the clustering or classification process, each ROI is further divided into 8 × 8 [pixel] patch images in order to analyze the texture patterns in small local regions (Fig. 10.3). The classification of each ROI is executed by combining the features of 8 × 8 [pixel] patches.

10.2.1 Unsupervised Learning First, the feature extraction is executed by DAE using all the 8 × 8 [pixel] patches as inputs. DAE has a function of encoding input data to obtain feature values, and also a function of decoding the feature values to reconstruct the input data. When the patch with 64 values (pixels) is inputted to DAE designed in this chapter, 12 feature values (a feature vector) are obtained by the encoding function of DAE. Figure 10.4 shows the structure of 13-layered DAE used in this chapter. The number of inputs is 64 corresponding to the number of pixels of each patch, and that of outputs is also 64. The learning of DAE is implemented so that the original image inputted to the first layer is reconstructed in the output layer. Since the number of units in each middle layer of DAE is smaller than the number of units in the input layer (=64), DAE has to generate effective values in the middle layers to reconstruct the images in the output layer. The feature vectors used in the next bag-of-features step are the values obtained in the seventh layer with 12 units; therefore, the original 64 input values are compressed to 12 feature values. The activation function is Rectified

168

Fig. 10.1 Flow of the proposed unsupervised and semi-supervised learning

S. Mabu et al.

10 Opacity Labeling of Diffuse Lung Diseases in CT Images …

169

Fig. 10.2 Samples of 32 × 32 images of normal/abnormal opacities

Fig. 10.3 Patch image generation

Fig. 10.4 Structure of deep autoencoder

Linear Unit (ReLU) [13], and the weights of DAE are learned by stochastic gradient descent with Adam (adaptive moment estimation) [8] which is an online estimation method of appropriate learning rate. Second, a bag-of-features method is applied to represent each ROI as a histogram of the occurrence frequency of keypoints as shown in Fig. 10.5. In the feature extraction by DAE, each patch is represented by 12 feature values, i.e., a feature vector. Then, k-means clustering is used to make clusters whose centroids show the keypoints in the feature space. Figure 10.6 shows a simple example of the clusters generated by k-means, where patch p is assigned to the nearest cluster so that Euclidean distance between patch p and the centroid of cluster c is minimized. A cluster to which patch p is assigned is determined by

170

S. Mabu et al.

Fig. 10.5 Histogram representation of an ROI

Fig. 10.6 Example of clusters and keypoints generated by k-means in bag-of-features

  12  Cluster _ patch( p) = argc min  (v p (i) − vc (i))2 ,

(10.1)

i=1

where v p (i) shows ith element of feature vector p, and vc (i) shows that of the centroid of cluster c. In the experiment, the number of clusters (keypoints) was set at 1024 which showed the good balance between the calculation time and accuracy in the next k-means clustering. After generating clusters, each ROI is represented by a histogram of the occurrence frequency of keypoints as shown in Fig. 10.5; in other words, the occurrence frequency shows the number of patches belonging to each cluster. Finally, k-means clustering is applied to the histograms of ROIs to make groups of ROIs, where ROIs with similar histogram patterns are assigned to the same cluster. Note that k-means clustering in this step aims at ROI clustering, while that in the previous step aims at keypoint generation. A cluster to which ROI r is assigned is determined by

10 Opacity Labeling of Diffuse Lung Diseases in CT Images …

 1024  Cluster _R O I (r ) = argc min  (h r (i) − h c (i))2 ,

171

(10.2)

i=1

where h r (i) shows ith element of the histogram of ROI r , and h c (i) shows that of the centroid of cluster c.

10.2.2 Semi-supervised Learning The semi-supervised learning and unsupervised learning proposed in this chapter use the same methods in step (1) and (2) as shown in Fig. 10.1. However, step (3) of semisupervised learning is different from unsupervised learning, that is, the following iterative semi-supervised learning is executed (Fig. 10.7). First, the initial training set Dtrain is prepared. Here, 1% (185 samples (ROIs)) of all the samples are used as the initial training samples and put to Dtrain . Second, an SVM is trained using Dtrain , and the trained SVM annotates the testing samples (18385 samples) in testing set Dtest . SVM can also calculate the class membership probabilities of each testing sample [16]; then, the testing samples with more than 99.0% class membership probability are regarded as new training samples with correct annotations (self-training) and put to Dtrain . Next, the entropy of the classification result of each testing sample d ∈ Dtest is calculated by H (d) =



p(c|d)log

c∈C

Fig. 10.7 Training phase of semi-supervised learning

1 , p(c|d)

(10.3)

172

S. Mabu et al.

where C is a set of class labels and p(c|d) is a class membership probability of sample d for class c. High entropy shows low confidence in the annotation and low entropy shows high confidence, therefore, some of the testing samples with low confidence are selected for active learning (human annotation). In this chapter, first, top 10% of the low confidence samples are selected from Dtest , then 18 samples (0.1% of the whole samples) are randomly selected from the low confidence samples for active learning. The annotated samples by active learning are put to Dtrain . We did not directly select the top 0.1% of the low confidence samples but selected randomly from the lowest 10% samples because the variation of the training samples is important to make a robust classifier. If the top 0.1% samples are selected for active learning, the selected samples might have similar features which may not contribute to accelerating the learning. The important point of this method is to efficiently give annotations to make training samples through the learning phase, therefore, we did not prepare many training samples with annotations before the training. By iterating the above process, the number of training samples increases. Of course, in the performance evaluation, the newly annotated ROIs by active learning are excluded from the testing set to evaluate the classification accuracy using only the samples without human annotation.

10.3 Results 10.3.1 Results of Unsupervised Learning The clustering accuracy was evaluated using 10094 ROIs extracted from the CT images. In this experiment, the number of clusters was set at 64. The reason why the number of clusters was set at 64 which is large than the original number of classes, i.e., six, is that there are various opacity patterns even in the same type of opacity and a sufficient number of clusters are needed to distinguish the various patterns clearly. After executing the proposed method, the clustering accuracy 72.8% was obtained and 64 clusters were generated where 23 NOR, five CON, 17 GGO, 18 EMP, one HCM, and 0 NOD clusters were contained. The clustering accuracy of each opacity is shown in Table 10.1, where the clustering for consolidation and emphysema shows more than 80% accuracy, while that for honeycombing is 53.5% and the nodular cluster is not generated. In fact, most of the nodular ROIs were included in normal clusters because it is difficult to distinguish the nodular ROIs based only on the local texture information. We executed two other methods for comparison. One is a method without DAE (only using k-means and bag-of-features), and the other is k-means clustering with HOG features, and they showed clustering accuracy of 69.5% and 36.6%, respectively. Therefore, the effectiveness of the combination of DAE and bag-of-features is clarified. Figure 10.8 shows examples of the generated clusters. Figure 10.8a is defined as a normal cluster because the number of normal ROIs is the largest among all the kinds

10 Opacity Labeling of Diffuse Lung Diseases in CT Images … Table 10.1 Clustering accuracy of each kind of opacity Class label Accuracy (%) Normal Consolidation GGO Honeycombing Emphysema Nodular Total

Fig. 10.8 Example of generated clusters

63.1 83.7 78.5 53.5 84.4 – 72.8

173

174

S. Mabu et al.

of opacities. We can see many normal ROIs in this cluster, but some ROIs are the other kinds of opacities. Figure 10.8b is a consolidation cluster. Consolidation has a very clear opacity pattern; thus the other types of opacities are not contained in this cluster. Figure 10.8c is a honeycombing cluster, where only a part of ROIs belonging to this cluster is shown due to the space limit, and the values in the parentheses show the total number of ROIs assigned to this cluster. We can see that 242 honeycombing ROIs are assigned to this cluster, however, eight normal, three consolidation, 135 GGO, 37 emphysema, and 27 nodular ROIs are also assigned to this cluster. From this result, it is found that the clustering of honeycombing is more difficult than that of the other opacities, i.e., normal, consolidation, GGO, and emphysema; however, the texture patterns of ROIs in the honeycombing cluster are very similar. Therefore, the feature extraction basically works, but to emphasize the differences between the opacities more clearly, an enhancement of the feature extraction is necessary.

10.3.2 Results of Iterative Semi-supervised Learning The classification accuracy of the proposed semi-supervised learning method was compared to the method without semi-supervised learning (called conventional method). Figure 10.9 shows the improvement of the classification accuracy obtained by the proposed method and the conventional method. Both methods used the same feature extraction using DAE and bag-of-features, but after the feature extraction, the proposed method used SVM with iterative semisupervised learning and the conventional method used SVM without semi-supervised learning, that is, the training data were randomly given. Note that the number of training data with correct annotation was the same between these two methods. At the first iteration, 1% (185) samples with annotations were given as the training samples. Then, 0.01% (18) samples with annotations were added to the training set every iteration. That is, the number of training data with annotation was

Fig. 10.9 Improvement of the classification accuracy

1

proposed method

classification accuracy

0.95 0.9 0.85

conventional method

0.8 0.75 0.7 0.65

100

200

300

400

iteration

500

600

10 Opacity Labeling of Diffuse Lung Diseases in CT Images …

175

Table 10.2 Classification result at 20th iteration (the number of training samples = 527 (2.8%)) Predicted class Actual class

NOR

CON

GGO

HCM

EMP

NOD

Total

Normal (NOR)

2509

10

31

36

11

392

2989

Recall (%) 83.9

Consolidation (CON)

1

2942

15

0

86

3

3047

96.6

Ground-glass opacity (GGO) 170

59

2195

4

337

245

3010

72.9

Honeycombing (HCM)

2

15

2151

79

303

3004

71.6

454

Emphysema (EMP)

2

112

120

45

2624

107

3010

87.1

Nodular (NOD)

720

2

135

43

68

2015

2983

67.5

Total

3856

3127

2511

2279

3205

3065

18043



Precision (%)

65.1

94.1

87.4

94.4

81.9

65.7





18 × (i − 1) + 185 at iteration i. For example, at 20th iteration (when 2.8% (527) samples with annotations were in the training set), the conventional method showed 78.7% accuracy, while the proposed method showed 80.0% accuracy. At 507th iteration (when 50% (9293) samples with annotations were in the training set), the accuracy of the conventional method was 85.9%, and that of the proposed method was 98.5%. Table 10.2 shows the classification results (confusion matrix) at 20th iteration of the proposed method. The number of training samples at 20th was 527 (2.8%), and the classification accuracy was 80.0%. Each row corresponds to the actual class, and each column corresponds to the predicted class. For example, in the column of Normal (NOR), we can see NOR = 2509, Consolidation (CON) = 1, GGO = 170, Honeycombing (HCM) = 454, Emphysema (EMP) = 2 and Nodular (NOD) = 720. It means that 2509 normal, one consolidation, 170 GGO, 454 honeycombing, two emphysema, and 720 nodular samples are classified as normal. Thus, the precision of NOR is 65.1% (= 2509/(2509 + 1 + 170 + 454 + 2 + 720)). In the row of NOR, we can see the numbers of NOR samples classified as NOR, CON, GGO, HCM, EMP, and NOD, respectively, and the recall of NOR is calculated as 83.9% (= 2509/(2509 + 10 + 31 + 36 + 11 + 392)). Table 10.3 shows the classification results at 156th iteration when the proposed method achieved 90.0% accuracy. We can see that most of the values of precision and recall are improved. Table 10.4 shows the classification results with 98.5% accuracy at 507th iteration. The precision of all the opacities is very high (95.1–99.7%), and their recall except NOR is also very high (96.3–99.5%). The recall of NOR is 70.7% which is lower than the recall at 259th iteration. The reason for this result is discussed later. Figure 10.10 shows the ratio of the accumulated number of samples whose annotations are fixed at each iteration. In the proposed method, the annotations were fixed when they were given by self-training or active learning. The self-training of SVM selected samples with more than 99.0% class membership probability to fix the annotations and moved them to the training set. The fixed annotations were not be changed in later iterations. From Fig. 10.10, we can see that the number of annotated samples of the proposed method increases as the iteration goes on and the ratio

176

S. Mabu et al.

Table 10.3 Classification result at 156th iteration (the number of training samples = 2975 (16.0%)) Predicted class Actual class

NOR

CON

GGO

HCM

EMP

NOD

Total

Normal (NOR)

2163

0

35

54

2

25

2279

Recall (%) 94.9

Consolidation (CON)

0

2950

5

0

36

0

2991

98.6

Ground-glass opacity (GGO) 69

32

2258

7

209

37

2612

86.4

Honeycombing (HCM)

0

18

2352

57

21

2623

89.7

175

Emphysema (EMP)

0

70

85

29

2549

30

2763

89.7

Nodular (NOD)

380

0

109

69

20

1749

2327

75.2

Total

2787

3052

2510

2511

2873

1862

15595



Precision (%)

77.6

96.7

90.0

93.7

88.7

93.9





Table 10.4 Classification result at 507th iteration (the number of training samples = 9293 (50.0%)) Predicted class Actual class

NOR

CON

GGO

HCM

EMP

NOD

Total

Normal (NOR)

58

0

10

14

0

0

82

Recall (%) 70.7

Consolidation (CON)

0

2821

1

0

17

0

2839

99.4

Ground-glass opacity (GGO) 0

6

1404

1

6

0

1417

99.1

Honeycombing (HCM)

0

1

1941

8

0

1950

99.5

0

Emphysema (EMP)

0

20

5

9

1816

3

1853

98.0

Nodular (NOD)

3

0

25

14

0

1094

1136

96.3

Total

61

2847

1446

1979

1847

1097

9277



Precision (%)

95.1

99.1

97.1

98.1

98.3

99.7





Fig. 10.10 The accumulated number of samples whose annotations have been fixed

1 proposed method (annotation by human and SVM (self-training))

0.8

0.6

0.4 conventional method (annotation by human)

0.2

0

100

200

300

400

iteration

500

600

10 Opacity Labeling of Diffuse Lung Diseases in CT Images …

177

becomes 99.95% at 682nd iteration, that is, the annotations of most of the samples are completed. The line of the conventional method shows the ratio of the number of annotated samples when 18 samples are annotated by human every iteration, and the ratio at 682nd iteration is 67.01%. The difference of the ratios between the proposed method and conventional methods (99.95 − 67.01 = 32.94%) is obtained by the self-training with the learned knowledge, not by human annotation. Note that both methods received the same number of human annotated samples at any iterations. These results show that the automatic annotation of the proposed method accelerates the opacity labeling and reduces the cost of human annotation. Several issues on the proposed semi-supervised learning are discussed below. First, the advantage of the proposed method is explained as follows. As shown in Fig. 10.9, the classification accuracy of the proposed method is better than that of the conventional method. It is effective in actual use because only a small number of training samples should be provided at first and the proposed annotation system will efficiently become smarter as the self-training and active learning are iterated. If a sufficient number of training samples are obtained, standard supervised learning should be used to get high accuracy, while the proposed method is suitable for the situations where we cannot prepare a large number of training samples or cannot spare time to make annotations, e.g., at local clinics. Second, the confusion matrices shown in Tables 10.2, 10.3 and 10.4 are discussed. In Table 10.2, the precision of CON and HCM are 94.1% and 94.4%, respectively, which are the first and second highest values among the six opacities. The precision of NOR (65.1%) and NOD (65.7%) are the first and second lowest values, respectively, and many misclassifications between NOR and NOD are found. For example, in the column of NOR, 720 samples out of 3856 predicted as NOR are actually NOD, which shows the difficulty of classification between NOR and NOD. In Table 10.3, the precision of CON, GGO, HCM, and NOD are more than 90.0%, that of EMP is 88.7%, and that of NOR is 77.6%. Comparing to Table 10.2, the precision of NOR and NOD are improved because of the additional training samples. The tendency of precision and recall is almost the same between Tables 10.2 and 10.3. In Table 10.4, NOR shows 95.1% precision which is much better than that in Tables 10.2 and 10.3. This improvement is obtained because the number of training samples of NOR is increased and the misclassifications between NOR and NOD are decreased. Table 10.4 also shows that the number of testing NOR samples is only 82, which means that the other 3013 samples have been already annotated and fixed by active learning. Because the samples to be annotated for active learning are selected based only on the confidence, many NOR samples with lower confidence than the other opacities were mainly selected for active learning in this case. In other words, most of the NOR samples were around the decision boundaries of the classes, thus active learning was applied mainly to NOR. As a result, the precision of NOR and NOD were improved as 95.1% and 99.7%, respectively. On the other hand, the recall of NOR in Table 3 is worse than Table 2 because some NOR samples which are difficult to be classified remained in the testing set. Such a situation occurs when the proposed method misclassifies some NOR images as other kinds of opacities with relatively high confidence, then such NOR images are not selected for active learning. Therefore, the remaining problem

178

S. Mabu et al.

Fig. 10.11 Example of ROIs classified as nodular

misclassificaƟon

is to enhance the feature extraction mechanism for accurately classifying NOR with a limited number of training samples. Third, the misclassification of NOD is analyzed. Figure 10.11 shows an example of ROIs classified as nodular. The squares around some ROIs show the misclassification examples. Figure 10.11 contains seven misclassified ROIs, however, the texture patterns are very similar. According to some expert radiologists, it is difficult for even radiologists to make annotations based only on the ROI images. Radiologists diagnose opacities based not only on the local images but on the whole scan, that is, they consider the context. Therefore, in the future research, it is necessary to combine the information of ROIs and their surrounding areas to make annotations. This study has the following limitations. First, the proposed method is suitable for the cases where we can prepare only a small number of annotated data and aim to increase annotated data. Therefore, if we can prepare a sufficient number of training data, state-of-the-art deep learning techniques would show better classification accuracy. Second, we used one set of CT images for performance evaluation. To verify the general annotation ability of the proposed method, it is necessary to evaluate it using other datasets, e.g., [6].

10.4 Conclusions In this chapter, unsupervised and semi-supervised learning methods for annotating opacities of diffuse lung diseases were proposed. The objective of the proposed methods is to reduce the cost of making annotations. From the results, it was clarified that the clustering and classification accuracy of the proposed methods are better

10 Opacity Labeling of Diffuse Lung Diseases in CT Images …

179

than the conventional methods. In the future, the proposed method will be enhanced by using transfer learning or applying multi-channel autoencoders considering the information of ROIs and their surrounding regions. Acknowledgements This work was financially supported by JSPS Grant-in-Aid for Scientific Research on Innovative Areas, Multidisciplinary Computational Anatomy, JSPS KAKENHI Grant Number 26108009; JSPS Grant-in-Aid for Scientific Research for Young Scientists (B), JSPS KAKENHI Grant Number 16K16116.

References 1. Bay, H., Ess, A., Tuytelaars, T., Van Gool, L.: Speeded-up robust features (surf). Comput. Vis. Image Underst. 110(3), 346–359 (2008) 2. Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pp. 144–152. ACM (1992) 3. Chapelle, O., Scholkopf, B., Zien, A.: Semi-supervised learning (2006) 4. Csurka, G., Dance, C., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: Workshop on Statistical Learning in Computer Vision, ECCV, vol. 1, pp. 1–2. Prague (2004) 5. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1, pp. 886–893. IEEE (2005) 6. Depeursinge, A., Vargas, A., Platon, A., Geissbuhler, A., Poletti, P.A., Müller, H.: Building a reference multimedia database for interstitial lung diseases. Comput. Med. Imaging Graph. 36(3), 227–238 (2012) 7. Han, J., Pei, J., Kamber, M.: Data Mining: Concepts and Techniques. Elsevier, Amsterdam (2011) 8. Kingma, P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 9. LeCun, Y., Bengio, Y., Hinton, G.E.: Deep learning. Nature 521, 436–444 (2015) 10. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004) 11. Mabu, S., Obayashi, M., Kuremoto, T., Hashimoto, N., Hirano, Y., Kido, S.: Unsupervised class labeling of diffuse lung diseases using frequent attribute patterns. Int. J. Comput. Assist. Radiol. Surg. 12(3), 519–528 (2017) 12. Moghbel, M., Mashohor, S.: A review of computer assisted detection/diagnosis (cad) in breast thermography for breast cancer detection. Artif. Intell. Rev. 39(4), 305–313 (2013) 13. Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10) 14. Settles, Burr: Active learning literature survey. Univ. Wis., Madison 52(55–66), 11 (2010) 15. van Tulder, G., de Bruijne, M.: Combining generative and discriminative representation learning for lung CT analysis with convolutional restricted boltzmann machines. IEEE Trans. Med. Imaging 35(5), 1262–1272 (2016) 16. Wu, T.F., Lin, C.J., Weng, R.C.: Probability estimates for multi-class classification by pairwise coupling. J. Mach. Learn. Res. 5, 975–1005 (2004)

Chapter 11

Residual Sparse Autoencoders for Unsupervised Feature Learning and Its Application to HEp-2 Cell Staining Pattern Recognition Xian-Hua Han and Yen-Wei Chen Abstract Self-taught learning aims at obtaining compact and latent representations from data them-selves without previously manual labeling, which would be timeconsuming and laborious. This study proposes a novel self-taught learning for more accurately reconstructing the raw data based on the sparse autoencoder. It is well known that autoencoder is able to learn latent features via setting the target values to be equal to the input data, and can be stacked for pursuing high-level feature learning. Motivated by the natural sparsity of data representation, sparsity has been imposed on the hidden layer responses of autoencoder for more effective feature learning. Although the conventional autoencoder-based feature learning aims at obtaining the latent representation via minimizing the reconstruction error of the input data, it is unavoidable to produce reconstruction residual error of the input data and thus some tiny structures are unable to be represented, which may be essential information for fine-grained image task such as medical image analysis. Even with the multiple-layer stacking for high-level feature pursuing in autoencoder-based learning strategy, the lost tiny structure in the former layers cannot be recovered evermore. Therefore, this study proposes a residual sparse autoencoder for learning the latent feature representation of more tiny structures in the raw input data. With the unavoidably generated reconstruction residual error, we exploit another sparse autoencoder to pursuing the latent feature of the residual tiny structures and this self-taught learning process can continue until the representation residual error is enough small. We evaluate the proposed residual sparse autoencoding for self-taught learning the latent representations of HEp-2 cell image, and prove that promising performance for staining pattern recognition can be achieved compared with the conventional sparse autoencoder and the-state-of-the-art methods. X.-H. Han (B) Graduate School of Science and Technology for Innovation, Yamaguchi University, 1677-1, Yoshida, Yamaguchi City, Yamaguchi 753-8511, Japan e-mail: [email protected] Y.-W. Chen Ritsumeikan University, 1-1-1, NojiHigashi, Kusatsu, Shiga 525-8577, Japan e-mail: [email protected] College of Science and Technology, ZheJiang University, Hangzhou, China © Springer Nature Switzerland AG 2020 Y.-W. Chen and L. C. Jain, Deep Learning in Healthcare, Intelligent Systems Reference Library 171, https://doi.org/10.1007/978-3-030-32606-7_11

181

182

X.-H. Han and Y.-W. Chen

11.1 Introduction Medical image analysis plays an important role for assisting medical experts to understand the internal organs of human and recognize characterization of different tissues. Unlike the frequently used generic images photoed in the real world where the features are often well-defined [1, 2] and are familiar to us, medical data is hard to be distinguishable and to define the characterization for the specific finegrained tasks since the visibility even for very tiny structures is required for providing acceptable performance. Furthermore, contrast to the possible enough images with ground-truth provided for training generalized machine learning models in generic image vision applications, medical images are more difficult to collect especially for patient data with the abnormalities arising from disease, and the manually labels require substantially more specialist knowledge to define and is a time-consuming task, which strongly motivates the extensive research for development of automated methods with unlabeled data, generally called unsupervised learning. Unsupervised learning ranging from the conventional methods such as principle component analysis, sparse coding, to neural network based method extracts hidden and compact features from unlabeled training data. Recently, the neural network based unsupervised approach has manifested the impressive performance for learning latent features in different vision applications [3–6], and mainly includes two categories: data distribution approximating model such as the Restricted Boltzmann Machine (RBM) [3, 4] and reconstruction error minimizating strategy (Self-taught learning) such as autoencoder [5, 6]. RBM aims at estimating the entropy of candidate features consistent with the data to infer the hidden feature, and has been applied for wide vision problems. Autoencoder (AE) is able to learn latent features via setting the target values to be equal to the input data, and is formulated to minimizing the reconstruction error of the input data. Motivated by the natural sparsity of data representation, sparsity has been imposed on the hidden layer responses of AE for more effective feature learning, which called sparse autoencoder (SAE). Recent dedication on the neural network-based unsupervised learning is to stack several layers for building much deeper framework pursuing high-level latent features, and validated further performance improvement for several applications. In addition, the unsupervised learning can be used as a pretraining step for further supervised learning in deep networks. Thus for more effectively using the pre training knowledge, understanding unsupervised learning is of fundamental importance. This study aims at exploring a novel unsupervised learning (self-taught learning) framework for medical image analysis. As we know that the target data in the medical image processing tasks usually are just a specific organ or a region of interests from different patients, which are generally called fine-grained task in generic image vision problem, and there exists only slight difference for distinguishing abnormality from normal tissue compared to the distinct difference of objects in generic vision problem. How to learn the latent and compact feature for medical data representation without lost of tiny structures, which are likely useful to the specific medical task, is essential issue for the fine-grained medical task. Although the conventional

11 Residual Sparse Autoencoders for Unsupervised Feature Learning …

183

AE-based feature learning aims at obtaining the latent representation via minimizing the reconstruction error of the input data, it is unavoidable to produce reconstruction residual error of the input data and thus some tiny structures are unable to be represented, which may be essential information for fine-grained medical image task. Even with the multiple-layer stacking for high-level feature pursuing in AE-based learning strategy, the lost tiny structure in the former layers can not be recovered evermore. Therefore, this chapter introduces a residual SAE for learning the latent feature representation of more tiny structures in the raw input data. With the unavoidably generated reconstruction residual error, we exploit further SAE for pursuing the latent feature of the residual tiny structures, and this self-taught learning process can continue until the representation residual error is enough small. We evaluate the proposed residual SAE for extracting the latent features of HEp-2 cell image [7], and visualize the activation maps of some sparse neurons for understanding the learned latent feature. In order to generate the same-dimensional features for the HEp-2 cell representation, we divided the activation maps of SAE into donut-shaped spatial regions, and aggregate the activations in one region as a mean value to form a representation vector of HEp-2 images. Experimental results for HEp-2 staining pattern recognition with the learned features by the proposed residual SAE can obtain promising performance compared with the conventional SAE and the state-of-the-art methods. The chapter is organized as follows. Section 11.2 describes the related work which include different unsupervised learning methods and the explored HEp-2 cell classification research work so far. Section 11.3 introduces a basic neural network-based unsupervised approach: autoencoder and its sparse constrained extension: sparse autoencoder. The proposed residual SAE, which can learn the latent feature representation of more tiny structures in the raw input data, is described in Sect. 11.4 and the aggregation of the learned feature into a fixed size of vector for image representation is followed in Sect. 11.5. The used medical context in our experiments and experimental results are given in Sects. 11.6, and the summary of this chapter is provided in Sect. 11.7, respectively.

11.2 Related Work Unsupervised Learning: Unsupervised leaning as machine learning technique has been widely explored in different applications ranging from computer vision, social media services to medical image analysis due to non-requirement for labeling data, which is laborious and time-consuming, and a lot of algorithms and approaches have been proposed [8–31]. The developed unsupervised learning technique can be mainly divided into three categories: clustering-based, data compression-based and neural network-based approaches. The most common and simplest clustering algorithm is the K -Means clustering where with the predefined cluster number K in the target database it then iteratively calculate the K -centers and selects the data points closest to that centroid in the cluster. For automatically selecting the cluster

184

X.-H. Han and Y.-W. Chen

number from the given data samples and improving the stability of the calculated centroid, some extensions such as X -means have been developed. The K -means and its extension clustering usually assign a definite cluster to each data item and this may be unrealistic in real applications, where it should be more reasonable to assign the data items on the centroid borderline as more than one cluster. In addition the clusters in the K -means algorithm are described only by the mean of the data items in the cluster, which cannot provide a complete description of what items in a cluster are like. In order to solve the above mentioned issue, mixture model has been proposed, which models the data as coming from a mixture distribution, with mixture components corresponding to clusters [9, 11]. The widely used distribution of each mixture component is Gaussian function, called as mixture model of Gaussian (GMM). As in K -means algorithm, the implementation of GMM involves two steps: (1) attribute the probability of individual observations (or weights towards such sub-components) to the postulated sub-components (the parameters describing the sub-components) in mixture model, which is similar to the data point assignment to the closest centroid in K -means; (2) calculate the mean, deviation and proportion of each Gaussian component with the fixed probability of individual observations to Gaussian models, which is equivalent the procedure of cluster center computation in K -means. This implementation of GMM is called Expectation Maximization (EM) procedure [11]. On the other hand, in text mining and analysis topic model as probabilistic topic models has been proposed for discovery of hidden semantic structures in a text body and widely applied in image processing for extracting latent and compact topics from input features. Data compression-based unsupervised methods include principle component analysis (PCA), independent component analysis (ICA) [14, 15], sparse coding [16– 20] and so on, PCA is a mathematical procedure that learns an orthogonal transformation from the target signal to convert a set of observation of possible correlated variables into a set of values of linearly uncorrelated variable called independent components. Via only retaining several principle components PCA can extract much lower-dimensional finds the linear combinations that communicate most of the variance in your data vector to approximate the original observation and thus provide date compression. ICA is a method to find a linear non-orthogonal coordinate system in any multivariate data. The directions of the axes in ICA are determined by not only the second but also higher order statistics of the original data. These two classical learning methods usually only produce non-overcomplete transformation basis, and thus require to use all the learned basis for well representing the observed signal, which leads to dense representation. On the other hand, understand processes in retina and primary visual cortex (V1) of human’s visual system [21], has been elucidated that early visual processes compress input into a more efficient form by activating only a few receptive fields in millions, which in mathematical theory can explain this mechanism into sparse coding by learning an over-complete basis and only using a few of basis to represent the observed signal called as sparse representation. Thanks to the success of sparse coding strategy on representation, compressing high-dimensional data, it is popularly applied in patter recognition, image representation and so on.

11 Residual Sparse Autoencoders for Unsupervised Feature Learning …

185

Recently, the neural network (NN) based unsupervised learning approach [3–6, 29, 30] has manifested the impressive performance for learning latent features in different vision applications [3–6]. The NN-based unsupervised learning method mainly includes data distribution approximating model such as the restricted Boltzmann machine [3, 4] and reconstruction error minimization strategy (Self-taught learning) such as autoencoder [5, 6]. RBM aims at estimating the entropy of candidate features consistent with the data to infer the hidden feature, and has been applied for wide vision problems. Autoencoder (AE) is able to learn latent features via setting the target values to be equal to the input data, and is formulated to minimizing the reconstruction error of the input data. Motivated by the natural sparsity of data representation, sparsity has been imposed on the hidden layer responses of AE for more effective feature learning, which called sparse autoencoder (SAE). Recent research effort on the NN-based unsupervised learning is to stack several layers for building much deeper framework, which can extract high-level latent features, and validated further performance improvement for several applications. In addition, the unsupervised learning can be used as a pre-training step for further supervised learning in deep networks. Thus for more effectively using the pre-training knowledge, understanding unsupervised learning is of fundamental importance. All the above mentioned unsupervised learning methods generally require the same-dimensional input data, and pre-processing procedure is usually applied for unifying the input data. In the image representation field, the unsupervised learning methods such as K -means, Sparse coding, GMM have been applied for learning the compact representation of the local image features such as SIFT, image patch, and combined the large amount of coded local features in an image into a unified dimensional vector using Bag-Of-Word (BOF) model [32–40]. The extracted features via BOF model as an effective image representation has manifested impressive performance in different visual categorization applications. This study explores a novel neural network-based unsupervised learning approach, called a residual sparse autoencoder for pursuing the latent feature of more tiny structures in the raw input image patch. With the unavoidably generated reconstruction residual error, we exploit another sparse autoencoder to pursuing the latent feature of the residual tiny structures and this self-taught learning process can continue until the representation residual error is enough small. Finally, we aggregate the learned latent features from the large amount of local descriptors (local patches) of an image using our proposed residual SAE as a fixed length of vector for image representation. HEp-2 Cell Recognition: Indirect immunofluorescence (IIF) is widely used as a diagnostic tool via image analysis; it can reveal the presence of autoimmune diseases by finding antibodies in the patient sera. Since it is effective for diagnosing autoimmune diseases [1], the demand for applying IIF image analysis in diagnostic tests is increasing. One research area involving IIF image analysis lies in the identification of the HEp-2 staining cell patterns using progressive techniques developed in the computer vision and machine learning fields. Several attempts to achieve the automatic recognition of HEp-2 staining patterns have been made. Perner et al. [2] proposed the extraction of texture and statistical features for cell image representation and then combined the extraction with a decision tree model for HEp-2 cell image clas-

186

X.-H. Han and Y.-W. Chen

sification. Soda et al. [3] investigated a multiple expert system (MES) in which an ensemble of classifiers was combined to label the patterns of single cells; however, research in the field of IIF image analysis is still in its early stages. There is still significant potential for improving the performance of HEp-2 staining cell recognition. Further, although several approaches have been proposed, they have usually been developed and tested on different private datasets under varying conditions, such as image acquisition according to different criteria and different staining patterns. Therefore, it is difficult to compare the effectiveness of these different approaches. In our study, we aim to achieve the automatic recognition of six HEp-2 staining patterns in an open HEp-2 dataset, which was recently released as part of the second HEp-2 cells classification contest at ICIP2013. There are a lot of works for exploring the recognition performances on this released HEp-2 cell dataset, and achieved promising results [41–46]. In the first HEp-2 cells classification contest at ICIP2012, it was shown that the LBP-based descriptor, rotation invariant co-occurrence LBP (RICLBP) for cell image representation, achieved promising HEp-2 cell classification performance [4, 5]. In the second HEp-2 cells classification contest at ICIP2013, it was further shown that the combination of another extended LBP version, pairwise rotation invariant co-occurrence LBP (PRICoLBP) [44] and BOF [45] with a Sift descriptor [46] achieved the best recognition results. Manivannan et al. [53] modeled multi-resolution local patterns with sparse coding and GMM for extracting discriminated features of HEp-2 cell images and manifested the impressive recognition performance on ICPR2014 contest HEp-2 cell dataset. Han et al. [47] extended LBP to local ternary pattern, and proposed RICLBP and Weber-based RICLBP for further improving the performance of HEp-2 cells classification. On the other hand, the same research group exploited the stacked fisher network for encoding weber local descriptor [48], which can extract high-level feature for image representation, and manifested impressive performance in HEp-2 cells classification. This chapter explores a novel neural network based unsupervised learning method for extracting the latent features of HEp-2 cell image representation, and pursue further performance improvement in HEp-2 cell classification.

11.3 Autoencoder and Its Extension: Sparse Autoencoder Autoencoder (AE) is a neural network-based unsupervised learning algorithm, which aims at automatically learn latent features for best reconstructing the unlabeled input data, with the typical purpose of dimension reduction—the process of reducing the number of random variables under consideration. Autoencoder mainly consists tow components: an encoder function to create a hidden layer (or multiple layers) which contains a code to describe the input, and a decoder which creates a reconstruction of the input from the hidden layer. Via designing hidden layer smaller than the input layer, an autoencoder can extract a compressed representation of the data in the hidden layer by learning correlations in the data. This facilitates the classification, visualization, communication and storage of data [48]. The basic structure of an AE

11 Residual Sparse Autoencoders for Unsupervised Feature Learning …

187

Fig. 11.1 The schematic concept of the autoencoder

is shown in Fig. 11.1. More detail, an AE is a symmetrical neural network to learn the features by minimizing the reconstruction error between the input data at the encoding layer and its reconstruction at the decoding layer. Given a sample of input data x ∈ Rd , the encoding procedure is implemented by applying a linear mapping and a nonlinear activation function in the AE network: y = sigm(Wx + b1 ),

(11.1)

where W ∈ Rdo ×d is a weight matrix with do features (do neurons in the encoding layers), b1 ∈ Rdo is the encoding bias, and sigm(·) is the logistic sigmoid function. Decoding of the latent feature y in the encoding layer is performed using a separate decoding matrix: (11.2) xˆ = VT y + b2 , where the decoding matrix is V ∈ Rdo ×d and b2 ∈ Rd is a decoding bias. Given training sample ensembles X = [x1 , x2 , · · · , x N ], the latent features Y in the data are learned by minimizing  the reconstruction error of the likelihood function ˆ are all the reconstructed data, and ˆ = X − X ˆ 2 = N xi − xˆ i 2 , where X L(X, X) i ˆ the parameters W, V, b1 , b2 can be optimized via minimizing L(X, X). Motivated by the natural sparsity of data representation, sparsity has been imposed to the target activation function, that is called as a sparse autoencoder (SAE) [5, 6], for learning more effective feature, and the cost function in a SAE is formulated as: ˆ +λ < W, V, b >= argmin L(X, X) W,V,b

do  j=1

K L(ρ  ρˆj )

(11.3)

188

X.-H. Han and Y.-W. Chen

where λ is the weight of the sparsity  N penalty, ρ is the target average activation of y ji is the average activation of j − th input the latent feature Y and ρˆj = N1 i=1 vector y j over the N training data. K L(·) denotes the Kullback-Leibler divergence [5], and is given by: K L(ρ  ρˆj ) = ρ log

ρ 1−ρ + (1 − ρ) log ρˆj 1 − ρˆj

(11.4)

which provides the sparsity constraint on the latent features.

11.4 Residual SAE for Self-taught Learning Although the AE and SAE aim at optimizing the parameters < W, V, b > via minimizing reconstruction errors of the input data, it is unavoidable to produce the ˆ which cannot be recoverable evermore with further residual error: X Res = X − X, processing. The lost residual error may be discriminated for the target task especially fine-grained image processing, thus this study proposes to stack further SAE to encoder the residual error instead of the learned hidden feature in the conventional stacked SAE framework, and the cost function of the proposed residual SAE is formulated as: ˆ )+λ < W Res , V Res , b Res >= argmin L Res (X Res , X Res

do 

K L(ρ Res  ρˆj Res )

j=1

(11.5) where W Res , V Res , b Res are the encoding weight matrix, decoding matrix and the encoding/decoding bias in the residual SAE. We can stack more residual layers for learning latent feature of the very tiny structures, and the global objective function can be formulated as: ˆ ) ˆ + β2 L Res1 (X Res1 , X Res1 < θ, θ Res1 , θ Res2 , · · · >= argmin β1 L(X, X)  + ··· + λ K L(ρ, ρ Res1 , · · · )

(11.6)

where θ =< W, V, b >, θ Res1 =< W Res1 , V Res1 , b Res1 >, θ Res2 =< W Res2 , V Res2 , b Res2 > denote the optimized parameter in the raw SAE, the first- and secondlevel residual SAE, respectively. β1 , β2 and β3 are the weights of the reconstruction errors in different levels of residual SAEs. The activation values in the hidden layers of the SAE, residual SAE can be combined as the represented features of the input data. The schematic concept of the proposed residual SAE is shown in Fig. 11.2, where d, d1 , d2 , · · · are the neuron numbers of the input layer, the hidden layers in the raw SAE, the first- and second-level residual SAE.

11 Residual Sparse Autoencoders for Unsupervised Feature Learning …

189

Fig. 11.2 The schematic concept of the proposed residual SAE. We stack several SAE to learn the latent feature of the unrecoverable residual in the former SAE until the reconstruction error are enough small

In addition, with the target HEp-2 cell images we extracted image patches to form d-dimensional vectors as the training samples of the proposed residual SAE. The learned weights: W, W Res1 , W Res2 are re-transformed into the sizes of the original image patch for visualization. The visualized weights of the residual SAE are shown in Fig. 11.3, which manifest more detail structure in the later-level residual SAE.

11.5 The Aggregated Activation of the Residual SAE for Image Representation As we mentioned that the input data in the raw SAE are the vectorized l × l local regions, which are sliding-extracted from the input image. We assume the neuron numbers of the hidden layers in the raw, the first- and second-level residual SAE are d1 , d2 and d3 , respectively, we can obtain d1 , d2 and d3 activated values for each local region. Generally, given a m × n image, we can extract the l × l local regions for (m − l) × (n − l) centered pixels as the focused pixels, and thus the produced activation values of the hidden layer with dk neurons in a (residual) SAE can be rearranged as dk maps with size (m − l) × (n − l). The bottom row of Fig. 11.2 manifests the obtained activation maps for different level SAEs. We also provide several activation maps of the hidden layers in three-levels residual SAEs for 2 HEp-

190

X.-H. Han and Y.-W. Chen

(a) The visualized weight from the raw SAE

(b) The visualized weights from the first and second level of residual SAE Fig. 11.3 The visualized weights in the raw SAE, the first- and second-level residual SAE

11 Residual Sparse Autoencoders for Unsupervised Feature Learning …

191

Fig. 11.4 Several activation maps of the hidden layers in three-level residual SAE for 2 HEp-2 cell images

2 cell images in Fig. 11.4, which manifests the detailed structures in the latter-level of residual SAE. Since the sizes of HEp-2 cell images are different, the activation maps in the residual SAE would accordingly change their sizes. For obtaining the same-length features for HEp-2 image representation, we divide each activation map into same number of regions, and averagely aggregate the activations in one region to form the final representation. Accompanying with the HEp-2 cell image, the cell region masks are also provided in this dataset, we apply morphological operators (dilation/erosion etc.) on cell mask image to form the center, middle and boundary regions for activation aggregation, which is called donut-shaped spatial regions-based aggregating method, as shown in Fig. 11.5. With the neuron numbers: d1 , d2 and d3 of the hidden layers in three-level residual SAE, a (d1 + d2 + d3 ) ∗ 3-dimensional feature vector can be generated for HEp-2 image representation.

192

An activation map

X.-H. Han and Y.-W. Chen

Region1

Region2

Region3

Fig. 11.5 The divided donut-shaped spatial regions for aggregating the activation value to form same-length feature vector as HEp-2 image representation

11.6 Experiments This section introduces the used medical context and the experimental results using the proposed residual SAE for image representation.

11.6.1 Medical Context In ANA tests, the HEp-2 substrate is, in general, applied, and both fluorescence intensity and staining pattern need to be classified, which is a challenging task that affects the reliability of IIF diagnosis. For classifying fluorescent intensity, the guidelines established by the Center for Disease Control and Prevention in Atlanta, Georgia (CDC) [49] suggest semi-quantitative scoring be performed independently by two physician IIF experts. The score ranges from 0 to 4+ according to the intensity: negative (0), very subdued fluorescence (1+), defined pattern but diminished fluorescence (2+), less brilliant green (3+), and brilliant green or maximal fluorescence (4+). The values are relative to the intensity of a negative and a positive control. The cell with positive intensity allows the physician to check the correctness of the preparation process, whereas that with negative intensity represents the auto-fluorescence level of the slide under examination. To reduce the variability of multiple readings, Rigon et al. [50] recently proposed classifying the fluorescence intensity into three classes, named negative, intermediate, and positive, by statistically analyzing the variability between several physicians’ fluorescence intensity classification. The open ICIP2013 HEp-2 dataset includes two intensity types of HEp-2 cells, intermediate and positive, and the purpose of the research is to recognize the staining pattern given the intensity types (intermediate or positive). The studied staining patterns primarily include six classes: (1) Homogeneous: characterized by a diffuse staining of the interphase nuclei and staining of the chromatin of mitotic cells; (2) Speckled: characterized by a granular nuclear staining of the interphase cell nuclei, which then consists of fine and coarse speckled patterns;

11 Residual Sparse Autoencoders for Unsupervised Feature Learning …

193

(3) Nucleolar: characterized by clustered large granules in the nucleoli of interphase cells that tend toward homogeneity, with fewer than six granules per cell; (4) Centromere: characterized by several discrete speckles (∼ 40–60) distributed throughout the interphase nuclei and characteristically found in the condensed nuclear chromatin during mitosis as a bar of closely associated speckles; (5) Golgi: also called the Golgi apparatus, is one of the first organelles to be discovered and observed in detail. It is composed of stacks of membrane-bound structures known as cisternae; (6) NuMem: Abbreviated from nuclear membrane, characterized as a fluorescent ring around the cell nucleus and are produced by anti-gp210 and anti-p62 antibodies. In the open ICIP2013 HEp-2 cell dataset, there are more than 10000 images, each showing a single cell, which were obtained from 83 training IIF images by cropping the bounding box of the cell. The detailed information about the different staining patterns is shown in Table 11.1, and some example images for all six staining patterns of the positive and intermediate intensity types are shown in Fig. 11.6. Using the provided HEp-2 cell images and their corresponding patterns, we can extract the features that are effective for image representation, and learn a classifier (or a mapping function) using the extracted features of cell images and the corresponding staining patterns. Using the constructed classifier (the mapping function), the staining pattern can automatically be predicted given any HEp-2 cell image. In the

Table 11.1 Cell image number for different staining patterns and different intensity types Homogeneous Speckled Nucleolar Centromere NuMem Golgi Positive 1087 Intermediate 1407

1457 1374

934 1664

1387 1364

943 1265

347 377

Fig. 11.6 Example images of six HEp-2 staining patterns of both positive and intermediate intensity types. a Positive intensity type; b Intermediate intensity type

194

X.-H. Han and Y.-W. Chen

classification procedure, the method used to extract the discriminant feature for cell image representation has a significant effect on the recognition performance.

11.6.2 Experimental Results Using HEp-2 cell images of two types of intensity (Intermediate and Positive), we validated the recognition performance by applying our proposed residual SAE and the conventional SAE. In our experiment, we randomly selected Q(Q = 10, 30, · · · , 310) cell images from each of the six patterns as training images, and the remainder are used as testing images for both Positive and Intermediate intensity types. The input for the proposed residual SAE are vectorized 7 × 7 (set l as 7) local regions from HEp-2 cell images. For each HEp-2 cell image, a lot of local regions can be extracted, and each neuron’s latent features of all extracted local regions from an image in the hidden layers can be re-arranged into a feature map as Fig. 11.4. We can extract (d1 + d2 + d3 ) feature maps foe each HEp-2 cell image, and use the donut-shaped spatial regions-based aggregating method for same dimensional vector as HEp-2 cell image representation. For categorizing the HEp-2 cell images in the different staining patterns, the linear SVM was used as the classifier because of its effectiveness as compared with other classifiers, such as K-nearest neighbor, and its efficiency as compared with a nonlinear SVM, which requires much more time to classify a sample. The above procedure was repeated 20 times, and the final results are the average recognition performance of the 20 runs, calculated as the percentages of properly classified cell images for all test samples. The comparative recognition rates for both ‘Positive’ and ‘Intermediate’ intensity types are shown in Fig. 11.7a and b, respectively, where ‘First (Raw SAE)’ denotes the image representation using the conventional SAE, ‘Second’, ‘Third’ denote the aggregated features from the feature maps of the single hidden layer: the second, third level of residual SAE, ‘Second and Third’ denotes the one from both second and third level of residual SAE, ‘2 Layers’ and ‘3 Layers’ mean our proposed residual SAE with two and three levels of hidden layers, respectively. It can be clearly seen that the proposed residual SAE can outperform the conventional SAE for both ‘Positive’ and ‘Intermediate’ intensity types. In addition, we change the size of the local regions for SAE and the residual SAE, and extract the aggregated representation for evaluating the recognition performance. Figure 11.8 manifests the compared recognition accuracies with different local region sizes (denoted as LR size), from which the increased size of local region can improving the performance a little for ‘Positive’ type while some degradation for ‘Intermediate’ type. Further, we combine the aggregated image representation of the residual SAE with different local region sizes (7, 9, 11), and conduct the recognition experiment for HEp-2 cell. The recognition accuracies for both ‘Positive’ and ‘Intermediate’ types are also given in Fig. 11.8, which show the better performance can be achieved by combining multi-scale residual SAE for latent feature extraction.

195

Recognition Accuracy (%)

11 Residual Sparse Autoencoders for Unsupervised Feature Learning …

Recognition Accuracy (%)

(a) ’Positive’ type

(b) ’Intermediate’ type

Fig. 11.7 The compared accuracy of HEp-e cell staining pattern recognition with raw/residual SAE and different training image numbers

196

X.-H. Han and Y.-W. Chen

Recognition Accuracy (%)

98

96 94 92 90 88 86 84

Positive LR Size=7

Intermediate

LR Size=9

LR Size=11

LR Size=7,9,11

Fig. 11.8 The compared results with different local region size for both positive and intermediate intensity types Table 11.2 The compared performance of our proposed Residual SAE and the state-of-the-art methods [48, 52, 53] GLRL SGLD Laws [52] rSIFT MP [53] FN [48] Our [52] [52] [53] Positive 77.23 Intermediate 39.33

84.37 49.75

94.68 81.06

91.9 78

95.29 86.91

97.90 91.93

98.45 92.24

Next, we compare the experimental results using our proposed residual SAE with the state-of-the-art methods [47, 51, 52] under the same experimental conditions for HEp-2 cell staining pattern recognition in Table 11.1, and show the promising performance can be obtained by our proposed method (Table 11.2).

11.7 Conclusion We have proposed a novel residual SAE network for self-taught learning the latent features of much tiny structure in the fine grained medical task. In stead of stacking the SAE for learning the high-level features from the output of the former SAE, we exploited residual SAE to model the residual reconstruction errors, which is vanished and cannot be recovered evermore in the former SAE, for learning the latent representation of more tiny structures. We have evaluated the proposed residual SAE for HEp-2 image representation, and proven that the promising performance for staining pattern recognition can be achieved.

11 Residual Sparse Autoencoders for Unsupervised Feature Learning …

197

Acknowledgements This work was supported in part by the Grant-in Aid for Scientific Research from the Japanese MEXT.

References 1. Conrad, K., Schoessler, W., Hiepe, F., Fritzler, M.J.: Autoantibodies in Systemic Autoimmune Diseases. Pabst Science Publishers, Lengerich (2002) 2. Conrad, K., Humbel, R.L., Meurer, M., Shoenfeld, Y.: Autoantigens and Autoantibodies: Diagnostic Tools and Clues to Understanding Autoimmunity. Pabst Science Publishers, Lengerich (2000) 3. Foggia, P., Percannella, G., Soda, P., Vento, M.: Benchmarking HEp-2 cells classification methods. IEEE Trans. Med. Imaging 32(10), 1878–1889 (2013) 4. Hiemann, R., Hilger, N., Sack, U., Weigert, M.: Objective quality evaluation of fluorescence images to optimize automatic image acquisition. Cytom. Part A 69(3), 182–184 (2006) 5. Soda, P., Rigon, A., Afeltra, A., Iannello, G.: Automatic acquisition of immunofluorescence images: algorithms and evaluation. In: 19th IEEE International Symposium on Computer Based Medical Systems, pp. 386-390 (2006) 6. Huang, Y.L., Chung, C.W., Hsieh, T.Y., Jao, Y.L.: Outline detection for the HEp-2 cells in indirect immunofluorescence images using watershed segmentation. In: IEEE International Conference on Sensor Networks, Ubiquitous and Trustworthy Computing, pp. 423-427 (2008) 7. Huang, Y.L., Jao, Y.L., Hsieh, T.Y., Chung, C.W.: Adaptive automatic segmentation of HEp-2 cells in indirect immunofluorescence images. IEEE International Conference on Sensor Networks, Ubiquitous and Trustworthy Computing, pp. 418-422 (2008) 8. Bach, F.R., Jordan, M.I.: Learning spectral clustering, with application to speech separation. J. Mach. Learn. Res. 7, 1963–2001 (2006) 9. Banfield, J.D., Raftery, A.E.: Model-based Gaussian and non-Gaussian clustering. Biometrics 49, 803–821 (1993) 10. Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2011) 11. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm (with discussion). JRSS-B 39, 1–38 (1977) 12. Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and it oracle properties. JASA 96, 1348–1360 (2001) 13. Fraley, C., Raftery, A.E.: MCLUST version 3 for R: normal mixture modeling and model-based clustering. Technical Report no. 504, Department of Statistics, University of Washington (2006) 14. Han, X.-H., Chen, Y.-W., Nakao, Z.: An ICA-Based Method for Poisson Noise Reduction. Lecture Notes in Artificial Intelligence, vol. 2773, pp. 1449–1454. Springer, Berlin (2003) 15. Han, X.-H., Nakao, Z., Chen, Y.-W.: An ICA-domain shrinkage based Poisson-noise reduction algorithm and its application to Penumbral imaging. IEICE Trans. Inf. Syst. E88-D(4), 750–757 (2005) 16. Elad, M., Aharon, M.: Image denoising via learned dictionaries and sparse representation. In: CVPR 06 (2006) 17. Hale, E.T., Yin, W., Zhang, Y.: Fixed-point continuation for l1-minimization: methodology and convergence. SIAM J. Optim. 19, 1107 (2008) 18. Hoyer, P.O.: Non-negative matrix factorization with sparseness constraints. JMLR 5, 1457– 1469 (2004) 19. Kavukcuoglu, K., Ranzato, M.A. LeCun, Y.: Fast inference in sparse coding algorithms with applications to object recognition, Technical Report CBLL-TR-2008-12-01. Computational and Biological Learning Lab, Courant Institute, NYU (2008) 20. Lee, H., Battle, A., Raina, R., Ng, A.Y.: Efficient sparse coding algorithms. In: NIPS 06 (2006)

198

X.-H. Han and Y.-W. Chen

21. Lee, H., Chaitanya, E., Ng, A. Y.: Sparse deep belief net model for visual area v2. In: Advances in Neural Information Processing Systems (2007) 22. Lee, H., Grosse, R., Ranganath, R., Ng, A.Y.: Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In: International Conference on Machine Learning. New York (2009) 23. Li, Y., Osher, S.: Coordinate descent optimization for l1 minimization with application to compressed sensing; a greedy algorithm. Inverse Probl. Imaging 3(3), 487–503 (2009) 24. Mairal, J., Elad, M., Sapiro, G.: Sparse representation for color image restoration. IEEE Trans. Image Process. 17(1), 53–69 (2008) 25. Mairal, J., Bach, F., Ponce, J., Sapiro, G.: Online dictionary learning for sparse coding. In: ICML09 (2009) 26. Aljalbout, E., Golkov, V., Siddiqui, Y., Cremers, D.: Clustering with deep learning: taxonomy and new methods (2018). arXiv preprint arXiv:1801.07648 27. Chen, D., Lv, J., Yi, Z.: Unsupervised multi-manifold clustering by learning deep representation. In: Workshops at the 31th AAAI Conference on Artificial Intelligence (AAAI), pp. 385-391 (2017) 28. Chen, G.: Deep learning with nonparametric clustering (2015). arXiv preprint arXiv:1501.03084 29. Coates, A., Ng, A., Lee, H.: An analysis of single-layer networks in unsupervised feature learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 215–223 (2011) 30. Dizaji, K.G., Herandi, A., Huang, H.: Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization (2017). arXiv preprint arXiv:1704.06327 31. Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1422–1430 (2015) 32. J’egou, H., Douze, M., Schmid, C.: Packing bag-of-features. In: Proceedings of the 12th IEEE International Conference Computer Vision. Kyoto, Japan, pp. 2357–2364 (2009) 33. Jégou, H., Douze, M., Schmid, C.: Improving bag-of-features for large scale image search. Int. J. Comput. Vis. 87(3), 316–336 (2010) 34. Ke, Y., Sukthankar, R.: PCA-SIFT: a more distinctive representation for local image descriptors. In: Proceedings of the IEEE International Conference on Computer Vision Pattern Recognition, pp. 505–513 (2004) 35. Kertesz, C.: Texture-based foreground detection. Int. J. Signal Process. Image Process. Pattern Recognit. 4(4), 51–62 (2011) 36. Lazebnik, S., Schmid, C., Ponce, J.: A sparse texture representation using local affine regions. IEEE Trans. Pattern Anal. Mach. Intell. 27(8), 1265–1278 (2005) 37. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: Proceedings of the IEEE Computer Society Conference on Computer Vision Pattern Recognition, pp. 2169–2178 (2006) 38. Liu, L., Wang, L., Liu, X.: In defense of soft-assignment coding. In: Proceedings of the International Conference on Computer Vision, pp. 2486–2493. Barcelona, Spain (2011) 39. Lowe, D.: Object recognition from local scale-invariant features. In: Proceedings of the International Conference on Computer Vision, pp. 1150–1157. Kerkyra, Greece (1999) 40. Lowe, D.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004) 41. Huang, Y.-L., Chung, C.-W., Hsieh, T.-Y., Jao, Y.-L.: Outline detection for the HEp- 2 cells in indirect immunofluorescence images using watershed segmentation. In: IEEE International Conference on Sensor Networks, Ubiquitous and Trustworthy Computing, pp. 423–427 (2008) 42. Perner, P., Perner, H., Muller, B.: Mining knowledge for HEp-2 cell image classification. J. Artif. Intell. Med. 26, 161–173 (2002) 43. Soda, P., Iannello, G., Vento, M.: A multiple experts system for classifying fluorescence intensity in antinuclear autoantibodies analysis. Pattern Anal. Appl. 12(3), 215–226 (2009)

11 Residual Sparse Autoencoders for Unsupervised Feature Learning …

199

44. Hiemann, R., Buttner, T., Krieger, T., Roggenbuck, D., Sack, U., Conrad, K.: Challenges of automated screening and differentiation of non-organ specific autoantibodies on hep-2 cells. Autoimmun. Rev. 9(1), 17–22 (2009) 45. Hiemann, R., Buttner, T., Krieger, T., Roggenbuck, D., Sack, U., Conrad, K.: Automatic analysis of immunofluorescence patterns of HEp-2 cells. Ann. N. Y. Acad. Sci. 1109(1), 358–371 (2007) 46. Soda, P., Iannello, G.: Aggregation of classifiers for staining pattern recognition in antinuclear autoantibodies analysis. IEEE Trans. Inf. Technol. Biomed. 13(3), 322–329 (2009) 47. Han, X.-H., Chen, Y.-W., Gang, X.: Integration of spatial and orientation contexts in local ternary patterns for HEp-2 cell classification. Pattern Recognit. Lett. 82, 23–27 (2016) 48. Han, X.-H., Chen, Y.-W.: HEp-2 staining pattern recognition using stacked fisher network for encoding weber local descriptor. Xian-Hua Han and Yen-Wei Chen, Pattern Recognition 63, 542–550 (2017) 49. Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science (2006) 50. Center for Disease Control: Quality assurance for the indirect immunofluorescence test for autoantibodies to nuclear antigen (IF-ANA): approved guideline. NCCLS I/LA2-A 16(11) (1996) 51. Rigon, A., Soda, P., Zennaro, D., Iannello, G., Afeltra, A.: Indirect immunofluorescence in autoimmune diseases: assessment of digital images for diagnostic purpose. Cytom. B (Clin. Cytom.) 72(3), 472–477 (2007) 52. Agrawal, P., Vatsa, M., Singh, R.: Hep-2 cell image classification: a comparative analysi. In: Machine Learning in Medical Imaging. Lecture Notes in Computer Science, pp. 195–202 (2013) 53. Manivannan, S., Li, W., Akbar, S., Wang, R., Zhang, J., McKenna, S.J.: An automated pattern recognition system for classifying indirect immunofluorescence images of hep-2 cells and specimens. Pattern Recognit. 12–26 (2016)

Part III

Application of Deep Learning in Healthcare

Chapter 12

Dr. Pecker: A Deep Learning-Based Computer-Aided Diagnosis System in Medical Imaging Guohua Cheng and Linyang He

Abstract This chapter intends to demonstrate the clinical applications of ComputerAided Diagnosis (CAD) systems based on deep learning algorithms while focusing on their IT infrastructure design. In comparison with traditional CAD systems that are mostly standalone applications designed to solve a particular task, we explain design choices of a cloud-based CAD platform that allows for running computational intense deep learning algorithms in a cost-efficient way. It also provides off-the-shelf solutions to collect, store, and secure data anywhere and anytime from various data sources, which is essential for training deep learning algorithms. In the end, we show the superior performance of using such CAD platform for analyzing medical imaging data of various modality before the conclusion.

12.1 Introduction Diagnosis and treatment for various medical conditions largely depend on radiology. However, staff shortages in radiology are occurring while imaging demands are generally increasing [3]. Such staffing crisis causes current workforce to work under high pressure, this diminishing the quality of patient care. One attempt to resolve this issue can be made by implementing computer-aided diagnosis (CAD) systems to improve current work-flow efficiency. Such systems are designed to automatically locate, identify, classify, and quantify suspicious patterns on the imaging data to reduce the reading effort and on the other hand, to provide potential better interpretation for the radiologist. As computer vision algorithms constantly improve, there is a rapid growth in the number of commercialized CAD systems that are applied in various clinical environments. At present, most of these FDA approved software are G. Cheng (B) JianPei Technology, No. 3 Street Xiasha, Hangzhou, China e-mail: [email protected] L. He JianPei Medical Imaging Research Center, No. 3 Street Xiasha, Hangzhou, China e-mail: [email protected] © Springer Nature Switzerland AG 2020 Y.-W. Chen and L. C. Jain, Deep Learning in Healthcare, Intelligent Systems Reference Library 171, https://doi.org/10.1007/978-3-030-32606-7_12

203

204

G. Cheng and L. He

employed along with clinical experts, e.g., serve as a second reader in the double reading protocol, while only a few of them [9] can fully automate. Specifications of these systems are usually narrowed down to a particular task in a predefined environment as a standalone application. Underlying reasons include that they are designed over a small-scale data set, and many of them are relying on rule-based approaches in algorithm design. These have been bottlenecks for these systems to overcome complex clinical problems that require mixed domain knowledge. As deep learning ecosystem advances, there is a natural trend of moving standalone CAD systems towards cloud-based solutions where large-scale data sets can be easily collected and secured from different sources across the world. These clouds also offer high-performance computing units in a distributed environment which makes computational intense imaging analysis jobs possible. These infrastructure improvements provide foundations to design algorithms for complex clinical problems. In this chapter, we demonstrate the design of such cloud-based CAD solutions by introducing an example: Dr. Pecker,1 which is an award-winning medical image analysis cloud-based solution. Apart from benefits of its cloud-based design, Dr. Pecker offers versatile and reliable CAD functions, which heavily rely on deep learning algorithms. Users are able to remotely access diagnosis services and provide feedback in real-time. These feedbacks, in turn, are transformed into reference standards for future training of deep learning algorithms. Meanwhile, Dr. Pecker allows for 3rd party integration to enrich its CAD functions and hence is capable of addressing complex problems in a variety of clinical scenarios with the use of system ensembles. In the following sections, we first introduce the system design of Dr. Pecker CAD. Then we review its success by showcasing several clinical applications before the conclusion.

12.2 System Overview Dr. Pecker provides cloud-based solutions for diagnosis services based on fast and reliable image analysis using deep learning. On the front-end side, Dr. Pecker offers workstations for accessing diagnosis services and reviewing results, along with a certified DICOM viewer for inspecting medical images. On the back-end side, Dr. Pecker follows a modularized approach to ensure flexibility, in which the components can be easily added or modified catering to the change in clinical requirements. It also maintains a dynamically expandable high-performance computing cluster and storage infrastructure to manage increasing hospital IT complexity. Although Dr. Pecker enables security-rich public cloud services, in many cases, it can also run in private, or even on-premise mode depending on the design of hospital IT infrastructure. All APIs provided by Dr. Pecker comply with RESTFull standard, which realizes the platform independence that simplifies the integration with other service providers.

1 http://www.jianpeicn.com/category/yuepianjiqiren.

12 Dr. Pecker: A Deep Learning-Based Computer-Aided Diagnosis System …

205

12.2.1 Seamless Integration with Hospital IT The user experience of a CAD system mostly counts on how well it is integrated with the clinical work-flow. Such integration entails two aspects: (1) from the end user’s point of view (users that interacted with the CAD), their clinical practices should not be hindered because of the introduction of CAD systems. An adequately integrated CAD should lead to less fragmented usage of all relevant clinical applications. (2) From the perspective of hospital IT, CAD systems should be easily maintainable within the existing IT environment. In simple terms, the use of CAD is to improve efficiency in making the diagnosis rather than creating too many extra costs. To achieve seamless integration, Dr. Pecker deploys a dedicated workstation that meets the requirement of reading images in implementing clinical environment. For instance, Dr. Pecker provides a certified DICOM viewer for reading imaging data in radiology. This allows users to entirely rely on Dr. Pecker to work with medical images rather than switching to other systems. Besides, Dr. Pecker shares diagnosis results with hospital data management systems, such as Picture Archiving and Communication System (PACS).

12.2.2 Learning on Users’ Feedback Deep learning is a data-driven approach. Deep learning models need to see large quantities of examples to be trained well. By providing friendly and easy-to-use cloud services, clinical experts can interact with Dr. Pecker from home. They can make an annotation or provide feedback on systems’ results whenever they are available. In such a way, accumulating training examples to a large scale is tractable, and the performance of underlying deep learning models can be improved continuously. Meanwhile, Dr. Pecker also innovates in collecting samples by processing users’ feedback, where the quality of the collected samples is well controlled.

12.2.3 Deep Learning Cluster Many deep learning algorithms involve solving high dimensional optimization problems over a large number of parameters, which are often computational intense. Many deep learning jobs such as training and inference are nowadays running on GPUs for fast-speed. To allow running multiple computational intense deep learning tasks in parallel and scheduling them efficiently, Dr. Pecker builds a cluster of deep learning machines, and each contains multiple GPUs to fulfill the requirements of quality of services in terms of concurrency and low-latency. The deep learning cluster uses techniques to unify standard CPU resources with GPU resources to improve the capacity

206

G. Cheng and L. He

Fig. 12.1 Conceptual work-flow of Dr. Pecker in radiology

of the whole system. To maximize the resource usage, the cluster dockerizes2 jobs and schedules them through kubernetes engines.3 Deep learning cluster in Dr. Pecker also serves as a separated cloud computing platform that allows for access from 3rd parties to run computational intense jobs separately from their physical domain.

12.2.4 Dr. The work-flow of Dr. Pecker platform in radiology is shown in Fig. 12.1. In Radiology, imaging data are archived into Picture archiving and communication system (PACS) for allowing economical image transmission from the site of image acquisition to multiple physically-disparate locations. Dr. Pecker incorporates with PACS in pulling imaging data and pushing results back. Dr. Pecker simply relies on PACS for accessing multiple modalities (DR, CT, MR etc.) and managing the life-cycle of imaging data sources from acquisition devices. Once diagnosis request is made on the deployed workstations, Dr. Pecker runs pre-processing steps such as desensitization, compression, possible encoding and trunking to ensure sufficient imaging quality and sanity before uploading to the cloud. On the cloud side, diagnostic results and the corresponding summary of the uploaded imaging data are archived, while maintaining relations with the PACS through a set of predefined unique identifications. Dr. Pecker cloud services also keep track of imaging data for the same subject with different modalities and across multiple time points for comprehensive studies such as assessment of treatment response or recurrence by using follow-up scans. Figure 12.2 shows the graphic user interface (GUI) of the Dr. Pecker workstation used in Radiology practices. Dr. Pecker workstation is equipped with a certified DICOM viewer, which prevents readers from switching between multiple tools. 2 https://www.docker.com/. 3 https://kubernetes.io/.

12 Dr. Pecker: A Deep Learning-Based Computer-Aided Diagnosis System …

207

Fig. 12.2 Dr. Peckers platform GUI

Build-in DICOM viewer also reduces the complexity of conducting unnecessary data transmission between multiple workstation client systems. Besides, workstations serve as annotation tools, in which qualified readers may add graphics or textual descriptions regarding objects of interest. The workstation also supports relevant feedback where users can give brief opinions on retrieved diagnostic results. Feedback from different readers on the same objective is interpreted with different weights depending on the qualification and experiences of readers. These feedbacks eventually are stored in the cloud storage and partially selected as potential samples to update deep neural networks in use. Standardized Reporting Dr. Pecker automatically generates diagnosis results in the form of a structured medical reports. Comparing to radiology reports that are written manually as blocks of text, structured reports from Dr. Pecker are rich in content, consisting of graphic visualizations for more intuitive interpretation and use interactive tools such as links to prior reports or ranking checklist in terms of significance. Such a structured report can be easily standardized in accordance with hospital regulations and radiology practices in case. Due to the nature of structured data, Dr. Pecker performs statistical analysis on its generated results to identify critical information and such analytics could also be useful for epidemiology research or department management and planning. In a simple scenario, the department using analytics may conduct strategic staffing based on the analysis of reading workload across different units. In terms of lexicon, Dr. Pecker adopts Radlex, a unified language for radiology terms defined by RSNA, to establish the keywords used in the system. Introducing such widely accepted glossary terms could reduce the chances of confusing subsequent unaffiliated users of their reports, and thus further improve readability.

208

G. Cheng and L. He

12.3 Clinical Applications With the world’s largest population, about a fifth of all global cancer cases occur in China, and cancer has become the leading cause of death in the country in recent years [1, 13]. There were 678,842 records of patients with invasive cancer who were diagnosed between 2003 and 2013 [14]. Cancer screening has been introduced to detect cancers in their early stage. However, a large scale cancer screening event is almost impossible because of requiring a considerably large amount of manual efforts, and because of the fact that inter-reader disagreement is high in detecting early-stage cancers, especially if the screening is in a multi-center setting where the acquisition protocols and devices can vary across centers. However, with Dr. Pecker, a large cancer screening trial can operate in a distributed environment. Different sites can communicate over cabled or even mobile internet using multimedia data formats, such as video conference or voice messages. Besides, Dr. Pecker Platform uses deep learning technology to provide objective automated cancer detection and quantitative analysis, which are based on training over millions of well-annotated clinical examples. Based on the predefined confidence interval, a high scored patient from Dr. Pecker will be referred to clinical experts for further reading, while the quantitative analysis can also provide valuable information such as volumes, ratios and other characteristics of the volume of interest for manual examination.

12.3.1 Disease Screening: Ophthalmic Screening as an Example The incidence of retinal diseases has grown in recent years. The situation is as grim in radiology with ophthalmologists being understaffed, and their numbers currently are far from the clinical needs for ophthalmology diagnosis and treatment planning, while at the same time, the entire pathway of retina diseases depends on the interpretation of ophthalmic images. Therefore, it is natural to introduce CAD systems such as Dr. Pecker to the ophthalmology work-flow. The algorithms in Dr. Pecker for retina images also mainly use deep learning as its base, while the innovations are image-enhancing technologies such as transforming low-resolution image to the super solution, denoising, and vessel enhancement filtering. Many of these approaches are related to recent developments in image processing on low-level pixels. Diabetic Retinopathy (DR) in Dr. Pecker Diabetic Retinopathy (DR) is one of the primary cause of blindness and vision loss in China. This ocular disease is a disorder of the retinal vasculature that eventually develops in nearly all patients with diabetes. Early detection is crucial to avoid constant eyesight problems caused by DR. Dr. Pecker offers a CAD system (DRCAD) for automatic DR screening which allows the identification of diabetic patients

12 Dr. Pecker: A Deep Learning-Based Computer-Aided Diagnosis System …

209

Fig. 12.3 Diabetic retinopathy workflow in Dr. Pecker

Fig. 12.4 Report of DR in Dr. Pecker

that need to be referred to an ophthalmologist. The system not only performs texture profiling on abnormal tissues but also takes surrounding tissues with abnormalities into account. The DR-CAD system uses an ensemble of deep learning algorithms which are initially trained on recognizing various object structures such as tubular, round objects and diffusion. The automatically generated medical reports from Dr. Pecker for DR detection are shown in Fig. 12.3. The report provides a clear indication of whether the patient needs to be referred to an ophthalmologist. The report of DR-CAD in Dr. Pecker is shown in Fig. 12.4. For runtime efficiency, Deep Learning models in Dr. Pecker run extremely fast. Reading 53,276 images of high resolution provided by the Kaggle Diabetic Retinopathy challenge [5] takes in total less than 26 min and segmenting all lesions with clinical significance takes at most at 5 seconds per image. With such speed, Dr. Pecker serves as the first reader in clinical practice to conduct fast screening and problematic cases are further evaluated by clinical experts. Deep learning models yield fine test accuracy on aforementioned Kaggle challenge, achieving a kappa score of 0.81022. These models are further retrained with the large home-grown dataset that are accumulated in multi-center

210

G. Cheng and L. He

settings. The company which develops Dr. Pecker system also conducts large-scale observer studies in which the results show that the collaboration between Dr. Pecker as the second reader and clinical experts could improve the overall reading accuracy by a large margin, outperforming all attending individuals. Optical Coherence Tomography (OCT) Screening Optical coherence tomography (OCT) pertains to imaging technology of biological tissue structure. In ophthalmic treatment, high-resolution images of retinal tissue obtained by OCT can provide a precise diagnostic basis for doctors. At present, OCT has become the most commonly used diagnostic technology. With the rapid development of artificial intelligence, combined with advanced techniques such as in-depth learning, migration learning, and weakly supervised learning, Dr. Pecker offers an OCT assistant diagnosis module which can automatically screen blinding retinal diseases through retinal OCT images and provide diagnosis and treatment suggestions. It has shown to reach accuracy of more than 95%. The module can quickly and accurately identify choroidal neovascularization (CNV), diabetic macular edema (DME), vitreous warts (DRUSEN) and standard OCT images, and determine whether patients need further diagnosis and treatment based on the results of the identification. The diagnosis results are similar to those of experienced human experts. As shown in Fig. 12.5, OCT assistant diagnosis module can also visualize the location information of potential focus of retinal diseases, intuitively tell doctors the cause of such results, increase the transparency of the system, and also increase the confidence of the diagnosis results. OCT system in Dr. Pecker has cooperated with ophthalmological Hospital Affiliated to Wenzhou Medical University to build an artificial intelligence-based ophthalmic disease screening platform, which has achieved excellent clinical and social bene-

Fig. 12.5 Report of OCT in Dr. Peck

12 Dr. Pecker: A Deep Learning-Based Computer-Aided Diagnosis System …

211

fits. The platform provides free intelligent ophthalmic disease screening for China ophthalmological Alliance units operated at the ophthalmological Hospital affiliated to Wenzhou Medical University, including nearly 1500 units.

12.3.2 Lesion Detection and Segmentation Lung cancer is the leading cause of cancer deaths worldwide. The stage of the cancer is highly correlated to the survival rate. Unfortunately, only 16% of lung cancer cases are diagnosed at an early stage. For distant tumors that spread to other organs, the five-year survival rate is only 5% [10] comparing to 55% if diagnosed at early stages. The lack of early diagnosis of lung cancer is because similar symptoms, such as coughing, chest pain, or difficulty in breathing, can also occur in many other pulmonary pathologies. The more evident signs become noticeable only when the tumors are already large in size. To find lung cancer at the early stage, one possible way is to screen high-risk subjects regularly through medical imaging examination, such as computed tomography (CT) or x-ray. Lung Nodule Detection CAD System Based on the results of the National Lung Screening Trial [11], lung cancer mortality in groups with specific high-risk factors can be reduced significantly by annual screening with low-dose computed tomography (LDCT). In practice, the lung cancer screening protocol is defined in two stages. The first is to find all visible nodules from the CT scan, which requires the most manual efforts. Based on these findings, it is essential to assess the malignancy probability of nodules. Dr. Pecker designs 3D hourglass shaped deep convolutional neural networks with dense and residual connections for pulmonary nodule detection. This approach abandons the conventional two-phase paradigm and trains a single network for end-to-end detection without the additional false positive reduction process. In Lung Nodule Analysis 2016 (LUNA2016) [7], Dr. Pecker has set a world record and ranked first in the world. We successfully transformed these technologies into a pulmonary nodule Computer-Aided Detection (CADe) product through engineering. The user interface of lung nodule detection in Dr. Pecker is shown in Fig. 12.6. Such a system ranks nodules of a given scan by a set of chosen variables such as nodule size and solid ratio. As a result, users can easily navigate to the nodules of interest. Dr. Pecker also performs automated lung and lobe segmentation to describe nodule locations in radiology terms. Nodule detection usually only indicates nodule locations in a CT scan. In many cases, accurate segmentation is favorable to many quantifications regarding nodule intensity, shape, and textures. Manual delineation of a lesion is expensive, and therefore Dr. Pecker uses the variation of well known U-Net [12] in automated segmentation of nodules. Meanwhile, based on the segmentation, Dr. Pecker measures the severity of pathology that is highly associated with lung cancer, such as emphysema. Also, the system generates medical reports given a standard template that can be interpreted in the same terms.

212

G. Cheng and L. He

Fig. 12.6 User interface of lung nodule detection in Dr. Pecker

Chest X-Ray CAD System Chest X-Ray examination is the most basic method for screening thoracic diseases. Compared to CT, DR examination has many advantages, such as wide application range, small use restriction, and low cost. However, DR screening has a high rate of misdiagnosis and missed diagnosis compared to CT since it only presents a single slice of the body. The application of Deep Learning technology in DR screening can effectively assist doctors in making a more accurate diagnosis. Focusing on the problem of disease detection in X-ray images, Dr. Pecker leverages on transfer learning in interpreting 2D images. For developing such a system efficiently, Dr. Pecker collects a large scale data set containing chest CT and X-Rays from different medical centers. Then these data set are labeled remotely using its cloud services. Labels on CT imaging are considered weakly-supervised and each labeled CT scan is sliced into 2D images in axial, coronal and sagittal views. Finally, Dr. Pecker trains Deep learning models in semi-supervised fashion on the guidance of manually annotated X-Rays while also taking of CT slices into consideration. Dr. Pecker takes an iterative approach to train and to annotate images. By taking advantage of existing models to generate temporary references, the labeling process can be treated as easily marking the region containing mistakes. Once new labels are available, Deep Models can then be reevaluated and updated. The user interface of Dr. Pecker for analyzing X-Rays in shown Fig. 12.7. Dr. Pecker Platform has been successfully promoted to applications in 87 Chest Alliance hospitals in He Nan Province of China, offering automated screening for chest diseases for 87 hospitals. It is currently benefiting more than 950 million people in He Nan Province, and other parts of central China as well. Dr. Pecker platform contains complete user manuals and video tutorials for radiologists to start with.

12 Dr. Pecker: A Deep Learning-Based Computer-Aided Diagnosis System …

213

Fig. 12.7 GUI of chest x-ray in Dr. Pecker

Fig. 12.8 Dr. Pecker (JianPeiCAD Team) won the World LiTS2018

Liver Tumor Segmentation The liver tumor module provides liver tissue segmentation, liver tumor segmentation, and quantitative analysis functions. According to the characteristics of CT images of liver tumors, the Dr. Pecker designs a twodimensional and three-dimensional neural network hybrid model to overcome a series of problems: the blurred boundary of liver tumors, low contrast with normal tissues, complex structure, gray diversity and other artifacts caused by inaccurate automatic segmentation of liver tumors. Through the use of deep learning in a holistic fashion, Dr. Pecker yields superior segmentation performance compared to standard stagewise solutions where mistakes made at former steps may propagate to latter steps. Dr. Pecker won the World Liver Segmentation (LiTS) 2018 with a dice score of 0.7320 per case. Visualization of Liver segmentation results are shown in Fig. 12.8. LiTS Competition is jointly organized by Munich Polytechnic University, Tel Aviv University in Israel and other universities, research institutes and MICCAI, the top annual conference of international medical imaging analysis. At that time, 1524 teams from 35 countries participated.

214

G. Cheng and L. He

12.3.3 Diagnosis and Risk Prediction Malignancy Analysis of Lung Nodules Once the lesions are detected and quantification analysis is performed, the next step is to make a diagnosis and assess the risk of mortality. Taking pulmonary nodules as an example, nodule characteristics and malignancy relevant factors should be measured and identified carefully. Follow-up scans are needed to track the progression of suspicious nodules. Additional imaging such as contrast-enhanced CT may be ordered to differentiate tissues with suspicious nodules. Once nodules are found highly-suspicious by CT examination, a follow-up pathology examination is needed to confirm the malignancy. Dr. Pecker measures the malignancy of detected nodules with the application of ConvNets, which boils down to solving a binary classification problem: benign versus malignant. In addition to the use of ConvNets, it designs a mixture of models that combine the power of ConvNets with the conventional rule-based or feature-based predictions on nodule malignancy. The most well-known rule-based predictor for nodule malignancy is the PanCan model [8], which formulates a lung cancer risk model given a set of variables that are highly correlated with nodule malignancy, such as nodule type, nodule location, the number of nodules, presence of spiculation and nodule size. In Dr. Pecker, these variables are calculated based on nodule segmentation and merged with the probabilities of ConvNet to produce final risk factors for a nodule to be designated as malignant. Breast Cancer Screening Breast cancer is the most common type of cancer among women. Early diagnosis and screening of breast cancer is conducive to improving the 5-year survival rate of patients and has great clinical significance. Since the 1980s, researchers have proposed some computer-aided diagnostic methods for molybdenum target images. Most of these algorithms are based on traditional CAD algorithms. Since 2012, Deep learning has gradually become the mainstream methodology of computer vision. Dr. Pecker uses the latest detection algorithm MASK RCNN [4] as the main algorithm structure. Firstly, multi-level convolution and pooling of mammograms are performed to extract features automatically, and then feature maps are input into a regional proposal network (RPN) [2] to extract regions of interest (ROI) automatically, and ROI regions then mapped to feature maps. Finally, the mapped feature maps are used to predict the categorical probabilities and locations of lesions. The visualization of Dr. Pecker detection results on mass in Breast Imaging is shown in Fig. 12.9.

12.4 Conclusion In this chapter, we introduce a variety of clinical applications of Dr. Pecker platform and briefly explain the design principles of the algorithms used in Dr. Pecker for each application. It is evident that with an available large amount of training scans, deep learning algorithms can perform comparably to human observers in many common

12 Dr. Pecker: A Deep Learning-Based Computer-Aided Diagnosis System …

215

Fig. 12.9 GUI of mammography in Dr. Pecker

imaging analysis tasks, and in some particular cases, CAD systems are even superior. As the first reader, CAD could pick up relevant abnormalities from a large population and only refer to those with high risk cases to clinical experts, thus saving a large amount of time and manual efforts. As the second reader, a CAD system could provide accurate quantification analysis and prediction based on training over a vast knowledge base (e.g., a large number of training data), eventually improving decision making in clinical practice. Dr. Pecker platform offers fully-automated detection of various diseases in a wide range of imaging modalities. It provides an easy integration of algorithms from 3rd-party vendors to be able to enrich its functionality rapidly. The company is also maintaining collaboration with multiple national hospitals and universities throughout China under common research interests. Dr. Pecker has built a solid relationship with many industrial partners which provide high-performance computing hardware units and software infrastructure. In addition, 3rd parties share the profits from Dr. Pecker platforms, which further motivates them to work together with Dr. Pecker. Dr. Pecker is being used in more than 600 hospitals in China, and more than 100,000 cases per day are read by Dr. Pecker. As a successful example of cloud-based CAD platform based on deep learning algorithms, Dr. Pecker entered the large-scale science and technology program called “Excellent Wisdom” in cosponsored by the Chinese Academy of Sciences and national television CCTV, as shown in Fig. 12.10.

216

G. Cheng and L. He

Fig. 12.10 Dr. Pecker platform show in CCTV

References 1. Ferlay, J., et al.: Cancer incidence and mortality worldwide: sources, methods and major patterns in GLOBOCAN 2012. Int. J. Cancer 136(5), E359–E386 (2015) 2. Girshick, R.: “Fast r-cnn”. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015) 3. Gourd, E.: UK radiologist staffing crisis reaches critical levels. Lancet Oncol. 18(11), e651 (2017). ISSN: 1470-2045 4. He, K. et al., “Mask r-cnn”. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2980–2988. IEEE (2017) 5. Kaggle Diabetic Retinopathy Challenge. https://www.kaggle.com/c/diabetic-retinopathydetection 6. Liver Tumor Segmentation Challengee. https://competitions.codalab.org/competitions/17094 7. LUNA 2016 Nodule detection Grand Challenge. https://luna16.grand-challenge.org/ 8. McWilliams, A., et al.: Probability of cancer in pulmonary nodules detected on first screening CT. N. Engl. J. Med. 369(10), 910–919 (2013) 9. Melendez, J. et al.: An automated tuberculosis screening strategy combining X-ray-based computer-aided detection and clinical information. Sci. Rep. 6, 25265 (2016) 10. National Lung Screening Trial Research Team: The national lung screening trial: overview and study design. Radiology 258(1), 243–253 (2011) 11. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 234–241. Springer, Berlin (2015) 12. Siegel, R.L., Miller, K.D., Jemal, A.: Cancer statistics, 2015. CA: Cancer J. Clin. 65(1), 5–29 (2015) 13. Torre, L.A., et al.: Global cancer statistics, 2012. CA: Cancer J. Clin. 65(2), 87–108 (2015) 14. Zeng, H., et al.: Changing cancer survival in China during 2003–15: a pooled analysis of 17 population-based cancer registries. Lancet Glob. Health 6(5), e555–e567 (2018)

Author Index

C Chen, Danny Z., 95 Cheng, Guohua, 203 Chen, Qingqing, 33, 149 Chen, Tingting, 95 Chen, Yen-Wei, 33, 53, 79, 149, 181

F Feng, Ruiwei, 95 Foruzan, Amir Hossein, 79

G García Ocaña, María Inmaculada, 3, 17 González Ballester, Miguel Ángel, 3, 17 Guo, Ruoqian, 95

H Han, Xian-Hua, 33, 149, 181 He, Linyang, 203 Hirano, Yasuhi, 165 Hu, Hongjie, 33, 149

I Iwamoto, Yutaro, 33, 53, 149

K Kido, Shoji, 111, 165 Kuremoto, Takashi, 165

L Lete Urzelai, Nerea, 3, 17 Lian, Chunfeng, 127 Liang, Dong, 33 Li, Huali, 149 Lin, Lanfen, 33, 149 Lin, Zhiwen, 95 Liu, Mingxia, 127 Liu, Xuechen, 95 Li, Yinhao, 53 López-Linares Román, Karen, 3, 17 Lu, Yifei, 95 M Mabu, Shingo, 165 Macía Oliver, Iván, 3, 17 Mohagheghi, Saeed, 79 P Peng, Liying, 149 S Sakanashi, Hidenori, 111 Shen, Dinggang, 127 Shouno, Hayaru, 111 Suzuki, Aiga, 111 T Tong, Ruofeng, 149 W Wang, Dan, 149 Wang, Weibin, 33

© Springer Nature Switzerland AG 2020 Y.-W. Chen and L. C. Jain, Deep Learning in Healthcare, Intelligent Systems Reference Library 171, https://doi.org/10.1007/978-3-030-32606-7

217

218 Wang, Wenzhe, 95 Wang, Yanjie, 95 Wu, Jian, 95, 149

Author Index Z Zhang, Qiaowei, 33, 149