Medical Image Computing and Computer-Assisted Intervention -- MICCAI 2017 : 20th International Conference, Quebec City, QC, Canada, September 11-13, 2017, Proceedings. Part III 978-3-319-66179-7, 3319661795, 978-3-319-66178-0

The three-volume set LNCS 10433, 10434, and 10435 constitutes the refereed proceedings of the 20th International Confere

845 107 122MB

English Pages 713 [737] Year 2017

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Medical Image Computing and Computer-Assisted Intervention -- MICCAI 2017 : 20th International Conference, Quebec City, QC, Canada, September 11-13, 2017, Proceedings. Part III
 978-3-319-66179-7, 3319661795, 978-3-319-66178-0

Table of contents :
Front Matter ....Pages I-XXVI
Front Matter ....Pages 1-1
Deep Multi-task Multi-channel Learning for Joint Classification and Regression of Brain Status (Mingxia Liu, Jun Zhang, Ehsan Adeli, Dinggang Shen)....Pages 3-11
Nonlinear Feature Space Transformation to Improve the Prediction of MCI to AD Conversion (Pin Zhang, Bibo Shi, Charles D. Smith, Jundong Liu)....Pages 12-20
Kernel Generalized-Gaussian Mixture Model for Robust Abnormality Detection (Nitin Kumar, Ajit V. Rajwade, Sharat Chandran, Suyash P. Awate)....Pages 21-29
Latent Processes Governing Neuroanatomical Change in Aging and Dementia (Christian Wachinger, Anna Rieckmann, Martin Reuter)....Pages 30-37
A Multi-armed Bandit to Smartly Select a Training Set from Big Medical Data (Benjamín Gutiérrez, Loïc Peter, Tassilo Klein, Christian Wachinger)....Pages 38-45
Multi-level Multi-task Structured Sparse Learning for Diagnosis of Schizophrenia Disease (Mingliang Wang, Xiaoke Hao, Jiashuang Huang, Kangcheng Wang, Xijia Xu, Daoqiang Zhang)....Pages 46-54
An Unbiased Penalty for Sparse Classification with Application to Neuroimaging Data (Li Zhang, Dana Cobzas, Alan Wilman, Linglong Kong)....Pages 55-63
Unsupervised Feature Learning for Endomicroscopy Image Retrieval (Yun Gu, Khushi Vyas, Jie Yang, Guang-Zhong Yang)....Pages 64-71
Maximum Mean Discrepancy Based Multiple Kernel Learning for Incomplete Multimodality Neuroimaging Data (Xiaofeng Zhu, Kim-Han Thung, Ehsan Adeli, Yu Zhang, Dinggang Shen)....Pages 72-80
Liver Tissue Classification in Patients with Hepatocellular Carcinoma by Fusing Structured and Rotationally Invariant Context Representation (John Treilhard, Susanne Smolka, Lawrence Staib, Julius Chapiro, MingDe Lin, Georgy Shakirin et al.)....Pages 81-88
DOTE: Dual cOnvolutional filTer lEarning for Super-Resolution and Cross-Modality Synthesis in MRI (Yawen Huang, Ling Shao, Alejandro F. Frangi)....Pages 89-98
Supervised Intra-embedding of Fisher Vectors for Histopathology Image Classification (Yang Song, Hang Chang, Heng Huang, Weidong Cai)....Pages 99-106
GSplit LBI: Taming the Procedural Bias in Neuroimaging for Disease Prediction (Xinwei Sun, Lingjing Hu, Yuan Yao, Yizhou Wang)....Pages 107-115
MRI-Based Surgical Planning for Lumbar Spinal Stenosis (Gabriele Abbati, Stefan Bauer, Sebastian Winklhofer, Peter J. Schüffler, Ulrike Held, Jakob M. Burgstaller et al.)....Pages 116-124
Pattern Visualization and Recognition Using Tensor Factorization for Early Differential Diagnosis of Parkinsonism (Rui Li, Ping Wu, Igor Yakushev, Jian Wang, Sibylle I. Ziegler, Stefan Förster et al.)....Pages 125-133
Physiological Parameter Estimation from Multispectral Images Unleashed (Sebastian J. Wirkert, Anant S. Vemuri, Hannes G. Kenngott, Sara Moccia, Michael Götz, Benjamin F. B. Mayer et al.)....Pages 134-141
Segmentation of Cortical and Subcortical Multiple Sclerosis Lesions Based on Constrained Partial Volume Modeling (Mário João Fartaria, Alexis Roche, Reto Meuli, Cristina Granziera, Tobias Kober, Meritxell Bach Cuadra)....Pages 142-149
Classification of Pancreatic Cysts in Computed Tomography Images Using a Random Forest and Convolutional Neural Network Ensemble (Konstantin Dmitriev, Arie E. Kaufman, Ammar A. Javed, Ralph H. Hruban, Elliot K. Fishman, Anne Marie Lennon et al.)....Pages 150-158
Classification of Major Depressive Disorder via Multi-site Weighted LASSO Model (Dajiang Zhu, Brandalyn C. Riedel, Neda Jahanshad, Nynke A. Groenewold, Dan J. Stein, Ian H. Gotlib et al.)....Pages 159-167
A Multi-atlas Approach to Region of Interest Detection for Medical Image Classification (Hongzhi Wang, Mehdi Moradi, Yaniv Gur, Prasanth Prasanna, Tanveer Syeda-Mahmood)....Pages 168-176
Spectral Graph Convolutions for Population-Based Disease Prediction (Sarah Parisot, Sofia Ira Ktena, Enzo Ferrante, Matthew Lee, Ricardo Guerrerro Moreno, Ben Glocker et al.)....Pages 177-185
Predicting Future Disease Activity and Treatment Responders for Multiple Sclerosis Patients Using a Bag-of-Lesions Brain Representation (Andrew Doyle, Doina Precup, Douglas L. Arnold, Tal Arbel)....Pages 186-194
Sparse Multi-kernel Based Multi-task Learning for Joint Prediction of Clinical Scores and Biomarker Identification in Alzheimer’s Disease (Peng Cao, Xiaoli Liu, Jinzhu Yang, Dazhe Zhao, Osmar Zaiane)....Pages 195-202
Front Matter ....Pages 203-203
Personalized Diagnosis for Alzheimer’s Disease (Yingying Zhu, Minjeong Kim, Xiaofeng Zhu, Jin Yan, Daniel Kaufer, Guorong Wu)....Pages 205-213
GP-Unet: Lesion Detection from Weak Labels with a 3D Regression Network (Florian Dubost, Gerda Bortsova, Hieab Adams, Arfan Ikram, Wiro J. Niessen, Meike Vernooij et al.)....Pages 214-221
Deep Supervision for Pancreatic Cyst Segmentation in Abdominal CT Scans (Yuyin Zhou, Lingxi Xie, Elliot K. Fishman, Alan L. Yuille)....Pages 222-230
Error Corrective Boosting for Learning Fully Convolutional Networks with Limited Data (Abhijit Guha Roy, Sailesh Conjeti, Debdoot Sheet, Amin Katouzian, Nassir Navab, Christian Wachinger)....Pages 231-239
Direct Detection of Pixel-Level Myocardial Infarction Areas via a Deep-Learning Algorithm (Chenchu Xu, Lei Xu, Zhifan Gao, Shen Zhao, Heye Zhang, Yanping Zhang et al.)....Pages 240-249
Skin Disease Recognition Using Deep Saliency Features and Multimodal Learning of Dermoscopy and Clinical Images (Zongyuan Ge, Sergey Demyanov, Rajib Chakravorty, Adrian Bowling, Rahil Garnavi)....Pages 250-258
Boundary Regularized Convolutional Neural Network for Layer Parsing of Breast Anatomy in Automated Whole Breast Ultrasound (Cheng Bian, Ran Lee, Yi-Hong Chou, Jie-Zhi Cheng)....Pages 259-266
Zoom-in-Net: Deep Mining Lesions for Diabetic Retinopathy Detection (Zhe Wang, Yanxin Yin, Jianping Shi, Wei Fang, Hongsheng Li, Xiaogang Wang)....Pages 267-275
Full Quantification of Left Ventricle via Deep Multitask Learning Network Respecting Intra- and Inter-Task Relatedness (Wufeng Xue, Andrea Lum, Ashley Mercado, Mark Landis, James Warrington, Shuo Li)....Pages 276-284
Scalable Multimodal Convolutional Networks for Brain Tumour Segmentation (Lucas Fidon, Wenqi Li, Luis C. Garcia-Peraza-Herrera, Jinendra Ekanayake, Neil Kitchen, Sebastien Ourselin et al.)....Pages 285-293
Pathological OCT Retinal Layer Segmentation Using Branch Residual U-Shape Networks (Stefanos Apostolopoulos, Sandro De Zanet, Carlos Ciller, Sebastian Wolf, Raphael Sznitman)....Pages 294-301
Quality Assessment of Echocardiographic Cine Using Recurrent Neural Networks: Feasibility on Five Standard View Planes (Amir H. Abdi, Christina Luong, Teresa Tsang, John Jue, Ken Gin, Darwin Yeung et al.)....Pages 302-310
Semi-supervised Deep Learning for Fully Convolutional Networks (Christoph Baur, Shadi Albarqouni, Nassir Navab)....Pages 311-319
TandemNet: Distilling Knowledge from Medical Images Using Diagnostic Reports as Optional Semantic References (Zizhao Zhang, Pingjun Chen, Manish Sapkota, Lin Yang)....Pages 320-328
BRIEFnet: Deep Pancreas Segmentation Using Binary Sparse Convolutions (Mattias P. Heinrich, Ozan Oktay)....Pages 329-337
Supervised Action Classifier: Approaching Landmark Detection as Image Partitioning (Zhoubing Xu, Qiangui Huang, JinHyeong Park, Mingqing Chen, Daguang Xu, Dong Yang et al.)....Pages 338-346
Robust Multi-modal MR Image Synthesis (Thomas Joyce, Agisilaos Chartsias, Sotirios A. Tsaftaris)....Pages 347-355
Segmentation of Intracranial Arterial Calcification with Deeply Supervised Residual Dropout Networks (Gerda Bortsova, Gijs van Tulder, Florian Dubost, Tingying Peng, Nassir Navab, Aad van der Lugt et al.)....Pages 356-364
Clinical Target-Volume Delineation in Prostate Brachytherapy Using Residual Neural Networks (Emran Mohammad Abu Anas, Saman Nouranian, S. Sara Mahdavi, Ingrid Spadinger, William J. Morris, Septimu E. Salcudean et al.)....Pages 365-373
Using Convolutional Neural Networks to Automatically Detect Eye-Blink Artifacts in Magnetoencephalography Without Resorting to Electrooculography (Prabhat Garg, Elizabeth Davenport, Gowtham Murugesan, Ben Wagner, Christopher Whitlow, Joseph Maldjian et al.)....Pages 374-381
Image Super Resolution Using Generative Adversarial Networks and Local Saliency Maps for Retinal Image Analysis (Dwarikanath Mahapatra, Behzad Bozorgtabar, Sajini Hewavitharanage, Rahil Garnavi)....Pages 382-390
Synergistic Combination of Learned and Hand-Crafted Features for Prostate Lesion Classification in Multiparametric Magnetic Resonance Imaging (Davood Karimi, Dan Ruan)....Pages 391-398
Suggestive Annotation: A Deep Active Learning Framework for Biomedical Image Segmentation (Lin Yang, Yizhe Zhang, Jianxu Chen, Siyuan Zhang, Danny Z. Chen)....Pages 399-407
Deep Adversarial Networks for Biomedical Image Segmentation Utilizing Unannotated Images (Yizhe Zhang, Lin Yang, Jianxu Chen, Maridel Fredericksen, David P. Hughes, Danny Z. Chen)....Pages 408-416
Medical Image Synthesis with Context-Aware Generative Adversarial Networks (Dong Nie, Roger Trullo, Jun Lian, Caroline Petitjean, Su Ruan, Qian Wang et al.)....Pages 417-425
Joint Detection and Diagnosis of Prostate Cancer in Multi-parametric MRI Based on Multimodal Convolutional Neural Networks (Xin Yang, Zhiwei Wang, Chaoyue Liu, Hung Minh Le, Jingyu Chen, Kwang-Ting (Tim) Cheng et al.)....Pages 426-434
SD-Layer: Stain Deconvolutional Layer for CNNs in Medical Microscopic Imaging (Rahul Duggal, Anubha Gupta, Ritu Gupta, Pramit Mallick)....Pages 435-443
X-Ray In-Depth Decomposition: Revealing the Latent Structures (Shadi Albarqouni, Javad Fotouhi, Nassir Navab)....Pages 444-452
Fast Prospective Detection of Contrast Inflow in X-ray Angiograms with Convolutional Neural Network and Recurrent Neural Network (Hua Ma, Pierre Ambrosini, Theo van Walsum)....Pages 453-461
Quantification of Metabolites in Magnetic Resonance Spectroscopic Imaging Using Machine Learning (Dhritiman Das, Eduardo Coello, Rolf F. Schulte, Bjoern H. Menze)....Pages 462-470
Building Disease Detection Algorithms with Very Small Numbers of Positive Samples (Ken C. L. Wong, Alexandros Karargyris, Tanveer Syeda-Mahmood, Mehdi Moradi)....Pages 471-479
Hierarchical Multimodal Fusion of Deep-Learned Lesion and Tissue Integrity Features in Brain MRIs for Distinguishing Neuromyelitis Optica from Multiple Sclerosis (Youngjin Yoo, Lisa Y. W. Tang, Su-Hyun Kim, Ho Jin Kim, Lisa Eunyoung Lee, David K. B. Li et al.)....Pages 480-488
Deep Convolutional Encoder-Decoders for Prostate Cancer Detection and Classification (Atilla P. Kiraly, Clement Abi Nader, Ahmet Tuysuzoglu, Robert Grimm, Berthold Kiefer, Noha El-Zehiry et al.)....Pages 489-497
Deep Image-to-Image Recurrent Network with Shape Basis Learning for Automatic Vertebra Labeling in Large-Scale 3D CT Volumes (Dong Yang, Tao Xiong, Daguang Xu, S. Kevin Zhou, Zhoubing Xu, Mingqing Chen et al.)....Pages 498-506
Automatic Liver Segmentation Using an Adversarial Image-to-Image Network (Dong Yang, Daguang Xu, S. Kevin Zhou, Bogdan Georgescu, Mingqing Chen, Sasa Grbic et al.)....Pages 507-515
Transfer Learning for Domain Adaptation in MRI: Application in Brain Lesion Segmentation (Mohsen Ghafoorian, Alireza Mehrtash, Tina Kapur, Nico Karssemeijer, Elena Marchiori, Mehran Pesteie et al.)....Pages 516-524
Retinal Microaneurysm Detection Using Clinical Report Guided Multi-sieving CNN (Ling Dai, Bin Sheng, Qiang Wu, Huating Li, Xuhong Hou, Weiping Jia et al.)....Pages 525-532
Lesion Detection and Grading of Diabetic Retinopathy via Two-Stages Deep Convolutional Neural Networks (Yehui Yang, Tao Li, Wensi Li, Haishan Wu, Wei Fan, Wensheng Zhang)....Pages 533-540
Hashing with Residual Networks for Image Retrieval (Sailesh Conjeti, Abhijit Guha Roy, Amin Katouzian, Nassir Navab)....Pages 541-549
Deep Multiple Instance Hashing for Scalable Medical Image Retrieval (Sailesh Conjeti, Magdalini Paschali, Amin Katouzian, Nassir Navab)....Pages 550-558
Accurate Pulmonary Nodule Detection in Computed Tomography Images Using Deep Convolutional Neural Networks (Jia Ding, Aoxue Li, Zhiqiang Hu, Liwei Wang)....Pages 559-567
Discriminative Localization in CNNs for Weakly-Supervised Segmentation of Pulmonary Nodules (Xinyang Feng, Jie Yang, Andrew F. Laine, Elsa D. Angelini)....Pages 568-576
Liver Lesion Detection Based on Two-Stage Saliency Model with Modified Sparse Autoencoder (Yixuan Yuan, Max Q.-H. Meng, Wenjian Qin, Lei Xing)....Pages 577-585
Manifold Learning of COPD (Felix J. S. Bragman, Jamie R. McClelland, Joseph Jacob, John R. Hurst, David J. Hawkes)....Pages 586-593
Hybrid Mass Detection in Breast MRI Combining Unsupervised Saliency Analysis and Deep Learning (Guy Amit, Omer Hadad, Sharon Alpert, Tal Tlusty, Yaniv Gur, Rami Ben-Ari et al.)....Pages 594-602
Deep Multi-instance Networks with Sparse Label Assignment for Whole Mammogram Classification (Wentao Zhu, Qi Lou, Yeeleng Scott Vang, Xiaohui Xie)....Pages 603-611
Segmentation-Free Kidney Localization and Volume Estimation Using Aggregated Orthogonal Decision CNNs (Mohammad Arafat Hussain, Alborz Amir-Khalili, Ghassan Hamarneh, Rafeef Abugharbieh)....Pages 612-620
Progressive and Multi-path Holistically Nested Neural Networks for Pathological Lung Segmentation from CT Images (Adam P. Harrison, Ziyue Xu, Kevin George, Le Lu, Ronald M. Summers, Daniel J. Mollura)....Pages 621-629
Automated Pulmonary Nodule Detection via 3D ConvNets with Online Sample Filtering and Hybrid-Loss Residual Learning (Qi Dou, Hao Chen, Yueming Jin, Huangjing Lin, Jing Qin, Pheng-Ann Heng)....Pages 630-638
CASED: Curriculum Adaptive Sampling for Extreme Data Imbalance (Andrew Jesson, Nicolas Guizard, Sina Hamidi Ghalehjegh, Damien Goblot, Florian Soudan, Nicolas Chapados)....Pages 639-646
Intra-perinodular Textural Transition (Ipris): A 3D Descriptor for Nodule Diagnosis on Lung CT (Mehdi Alilou, Mahdi Orooji, Anant Madabhushi)....Pages 647-655
Transferable Multi-model Ensemble for Benign-Malignant Lung Nodule Classification on Chest CT (Yutong Xie, Yong Xia, Jianpeng Zhang, David Dagan Feng, Michael Fulham, Weidong Cai)....Pages 656-664
Deep Reinforcement Learning for Active Breast Lesion Detection from DCE-MRI (Gabriel Maicas, Gustavo Carneiro, Andrew P. Bradley, Jacinto C. Nascimento, Ian Reid)....Pages 665-673
Pancreas Segmentation in MRI Using Graph-Based Decision Fusion on Convolutional Neural Networks (Jinzheng Cai, Le Lu, Yuanpu Xie, Fuyong Xing, Lin Yang)....Pages 674-682
Modeling Cognitive Trends in Preclinical Alzheimer’s Disease (AD) via Distributions over Permutations (Gregory Plumb, Lindsay Clark, Sterling C. Johnson, Vikas Singh)....Pages 683-691
Does Manual Delineation only Provide the Side Information in CT Prostate Segmentation? (Yinghuan Shi, Wanqi Yang, Yang Gao, Dinggang Shen)....Pages 692-700
Back Matter ....Pages 701-713

Citation preview

LNCS 10435

Maxime Descoteaux · Lena Maier-Hein Alfred Franz · Pierre Jannin D. Louis Collins · Simon Duchesne (Eds.)

Medical Image Computing and Computer Assisted Intervention − MICCAI 2017 20th International Conference Quebec City, QC, Canada, September 11–13, 2017 Proceedings, Part III

123

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board David Hutchison Lancaster University, Lancaster, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Friedemann Mattern ETH Zurich, Zurich, Switzerland John C. Mitchell Stanford University, Stanford, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Dortmund, Germany Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbrücken, Germany

10435

More information about this series at http://www.springer.com/series/7412

Maxime Descoteaux Lena Maier-Hein Alfred Franz Pierre Jannin D. Louis Collins Simon Duchesne (Eds.) •





Medical Image Computing and Computer Assisted Intervention − MICCAI 2017 20th International Conference Quebec City, QC, Canada, September 11–13, 2017 Proceedings, Part III

123

Editors Maxime Descoteaux Université de Sherbrooke Sherbrooke, QC Canada

Pierre Jannin Université de Rennes 1 Rennes France

Lena Maier-Hein DKFZ Heidelberg Germany

D. Louis Collins McGill University Montreal, QC Canada

Alfred Franz Ulm University of Applied Sciences Ulm Germany

Simon Duchesne Université Laval Québec, QC Canada

ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Computer Science ISBN 978-3-319-66178-0 ISBN 978-3-319-66179-7 (eBook) DOI 10.1007/978-3-319-66179-7 Library of Congress Control Number: 2017951405 LNCS Sublibrary: SL6 – Image Processing, Computer Vision, Pattern Recognition, and Graphics © Springer International Publishing AG 2017 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Printed on acid-free paper This Springer imprint is published by Springer Nature The registered company is Springer International Publishing AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

We are very proud to present the conference proceedings for the 20th Medical Image Computing and Computer Assisted Intervention (MICCAI) conference, which was successfully held at the Quebec City Conference Center, September 11–13, 2017 in Quebec City, Canada. Ce fut un plaisir et une fierté de vous recevoir tous et chacun à Québec, berceau de la culture francophone en Amérique du Nord1. The MICCAI 2017 conference, alongside 45 satellite events held on September 10th and 14th, attracted hundreds of word-leading scientists, engineers, and clinicians, involved in medical image processing, medical image formation, and computer assisted medical procedures. You will find assembled in this three-volume Lecture Notes in Computer Science (LNCS) publication the proceedings for the main conference, selected after a thoughtful, insightful, and diligent double-blind review process, which was organized in several phases, described below. The preliminary phase of the review process happened before the curtain was raised, so to speak, as the Program Chairs made the decision to move MICCAI towards novel conference management tools of increasingly common use in the computer vision and machine learning community. These included the Conference Managing Toolkit for paper submissions and reviews (https://cmt.research.microsoft.com); the Toronto Paper Matching System (http://torontopapermatching.org/) for automatic paper assignment to area chairs and reviewers; and Researcher.CC (http://researcher.cc) to handle conflicts between authors, area chairs, and reviewers.

1

It was our pleasure and pride to welcome you each and all to Quebec, the cradle of French-speaking culture in North America.

VI

Preface

The first phase consisted in the management of abstracts per se. In total, 800 submissions were received, from over 1,150 intentions to submit. As seen in Fig. 1, of those submissions, 80% were considered as pure Medical Image Computing (MIC), 14% as pure Computer Assisted Intervention (CAI), and 6% as MICCAI papers that fitted into both MIC and CAI areas. Of note, 21% papers were submitted by a female first author. Fig. 1. Incoming manuscript distribution

Phase 1 of the review process of each paper was handled by an area chair and three reviewers. There was a total of 52 area chairs selected with expertise as shown in Fig. 2. Noticeably, 50% were from the Americas, 35% from Europe, and 15% from Asia, with 44% women.

Fig. 2. PC distribution

Each area chair had 14 to 20 papers to handle. Each reviewer committed to review from 3 to 6 papers. We had a total of 627 reviewers with expertise as detailed in Fig. 3, and of which 20% were women. To assign reviewers for each submitted manuscript, we first used the Toronto Paper Matching System to assign each paper with a ranked list of reviewers. Second, area chairs, blinded to authorship, re-ordered and ranked reviewers assigned for each paper. Finally, the Conference Management Toolkit Fig. 3. Reviewer distribution made the final assignment of papers automatically using the Toronto Paper Matching System scores and rankings from area chairs, while balancing the workload among all reviewers.

Preface

VII

Based on the Phase 1 double-blind reviews and rebuttals sent specifically to area chairs, 152 papers were directly accepted and 405 papers were directly rejected, giving the distribution shown in Fig. 4. Next, the remaining 243 borderline papers went into Phase 2 of the review process. The area chair first ranked the Phase 1 remaining papers and a second area chair Fig. 4. Phase 1 results performed a ranking of the same papers. Papers in agreement by both rankings from area chairs (ranked in top 50% or ranked in bottom 50%) were either accepted or rejected accordingly, and the remaining papers categorized as borderline of Phase 2. This process yielded 103 borderline papers, 217 accepted papers, and 471 rejected papers, as shown in Fig. 5. Finally, the reviews, the area chair rankings, Fig. 5. Phase 2 results and associated rebuttals were subsequently discussed in person among the Program Committee (PC) members during the MICCAI 2017 PC meeting that took place in Quebec City, Canada, May 10–12, 2017, with 38 out of 52 PC members in attendance. The process led to the acceptance of another 38 papers and the rejection of 65 papers. In total, 255 papers of the 800 submitted papers were accepted, for an overall acceptance rate of 32% (Fig. 6), with 45 accepted papers (18%) by a female first author (164 papers were submitted by a female first author). Fig. 6. Final results For these proceedings, the 255 papers have been organized in 15 groups as follows: • Volume LNCS 10433 includes Atlas and Surface-Based Techniques (14 manuscripts), Shape and Patch-Based Techniques (11), Registration Techniques (15), Functional Imaging, Connectivity and Brain Parcellation (17), Diffusion Magnetic Resonance Imaging (MRI) & Tensor/Fiber Processing (20), Image Segmentation and Modelling (12). • Volume LNCS 10434 includes: Optical Imaging (18 manuscripts), Airway and Vessel Analysis (10), Motion and Cardiac Analysis (16), Tumor Processing (9), Planning and Simulation for Medical Interventions (11), Interventional Imaging and Navigation (14), and Medical Image Computing (8). • Volume LNCS 10435 includes: Feature Extraction and Classification Techniques (23 manuscripts) and Machine Learning in Medical Imaging Computing (56).

VIII

Preface

In closing, we would like to thank specific individuals who contributed greatly to the success of MICCAI 2017 and the quality of its proceedings. These include the Satellite Events Committee led by Tal Arbel. Her co-chairs were Jorge Cardoso, Parvin Mousavi, Kevin Whittingstall, and Leo Grady; other members of the Organizing Committee including Mallar Chakravarty (social), Mert Sabuncu (MICCAI 2016), Julia Schnabel (MICCAI 2018), and Caroline Worreth and her team of volunteers and professionals; the MICCAI society, for support and insightful comments; and our partners for financial support and their presence on site. We are especially grateful to all members of the PC for their diligent work in helping to prepare the technical program, as well as the reviewers for their support during the entire process. Last but not least, we thank authors, co-authors, students, and supervisors, who toiled away to produce work of exceptional quality that maintains MICCAI as a beacon of savoir-faire and expertise not to be missed. We look forward to seeing you in Granada, Spain – Au plaisir de vous revoir en 2018! August 2017

Maxime Descoteaux Lena Maier-Hein Alfred Franz Pierre Jannin D. Louis Collins Simon Duchesne

Organization

General Chair Simon Duchesne

Université Laval, Québec, Canada

Program Chair Maxime Descoteaux

Université de Sherbrooke, Sherbrooke, Canada

General and Program Co-chair D. Louis Collins

McGill University, Montreal, Canada

Program Co-chairs Lena Maier-Hein Alfred Franz Pierre Jannin

German Cancer Research Center, Heidelberg, Germany Ulm University of Applied Sciences, Ulm, Germany Université de Rennes 1, Rennes, France

Satellite Events Chair Tal Arbel

McGill University, Montreal, Canada

Satellite Events Co-chair Jorge Cardoso (Workshops) Parvin Mousavi (Challenges) Kevin Whittingstall (Tutorials) Leo Grady (Tutorials)

University College London, London, UK Queen’s University, Kingston, Canada Université de Sherbrooke, Sherbrooke, Canada Heartflow, Redwood City, California

Social Chair Mallar Chakravarty

McGill University, Montreal, Canada

Past and Future MICCAI Chairs Mert Sabuncu (MICCAI 2016) Julia Schnabel (MICCAI 2018)

Cornell University, Ithaca, USA King’s College London, London, UK

X

Organization

Program Committee Ismail B. Ayed Meritxell Bach Sylvain Bouix Weidong Cai Philippe C. Cattin Elvis Chen Jun Cheng Albert C. Chung Marleen de Bruijne Stefanie Demirci Caroline Essert Gabor Fichtinger Alejandro Frangi Stamatia Giannarou Junzhou Huang Ivana Isgum Ameet Jain Pierre-Marc Jodoin Samuel Kadoury Marta Kersten Su-Lin Lee Shuo Li Rui Liao Tianming Liu Herve J. Lombaert Xiongbiao Luo Klaus Maier-Hein Diana Mateus Lauren J. O’Donnell Ingerid Reinertsen Tammy Riklin Raviv Hassan Rivaz Clarisa Sanchez Benoit Scherrer Julia A. Schnabel Li Shen Amber Simpson Stefanie Speidel Ronald M. Summers Raphael Sznitman Pallavi Tiwari Duygu Tosun

Ecoles des Technologies Superieures (ETS) Montreal Lausanne University and University Hospital Brigham and Women’s Hospital University of Sydney University of Basel Robarts Research Institute Institute for Infocomm Research The Hong Kong University of Science and Technology Erasmus MC, The Netherlands/University of Copenhagen, Denmark Technical University of Munich University of Strasbourg ICube Queen’s University University of Sheffield Imperial College London (UCL) University of Texas at Arlington University Medical Center Utrecht Philips Corporate Research Université de Sherbrooke Polytechnique Montreal Concordia University Imperial College London Western University Siemens Medical Solutions USA University of Giogia Ecoles des Technologies Superieures (ETS) Montreal INSERM German Cancer Research Center Technische Universität München Brigham and Women’s Hospital and Harvard Medical School SINTEF Ben Gurion University Concordia University Radboud University Medical Center Boston Children Hospital Harvard Medical School King’s College of London Indiana University Memorial Sloan Kettering Cancer Center Karlsruche Institute of Technology National Institute of Health (NIH) University of Bern Case Western Reserve University University of California San Francisco

Organization

Gozde Unal Ragini Verma Sandrine Voros Linwei Wang Qian Wang Demian Wassermann Yanwu Xu Pew-Thian Yap Guoyan Zheng S. Kevin Zhou

Istanbul Technical University University of Pennsylvania INSERM, TIMC-IMAG Rochester Institute of Technology Shangai University INRIA Sophia Antipolis Institute for Infocomm Research University of North Carolina at Chapel Hill University of Bern Siemens Healthineers Technology Center

Additional Reviewers Aly A. John A. Aly Abdelrahim Ehsan Adeli Iman Aganj Priya Aggarwal Ola Ahmad Shazia Akbar Saad Ullah Akram Amir Alansary Jialin Alansary Shadi Albarqouni Daniel C. Alexander Sharib Ali Riza Alp Guler Guy Amit Elsa Angelini John Ashburner Rahman Attar Paolo Avesani Suyash P. Awate Dogu Aydogan Shekoofeh Azizi Hossein Azizpour Noura Azzabou Ulas Bagci Wenjia Bai Spyridon Bakas Jordan Bano Siqi Bao

Adrian Barbu Anton Bardera Christian Barillot Adrien Bartoli Christian Baumgartner Christoph Baur Maximilian Baust Pierre-Louis Bazin Christos Bergeles Olivier Bernard Boris C. Bernhardt Boris Bernhardt Arnav Bhavsar Marie Bieth Emad M. Boctor Sebastian Bodenstedt Hrvoje Bogunovic Sethu K. Boopathy Jegathambal Louis Borgeat Gerda Bortsova Frédéric Branchaud-Charron Jovan Brankov Joerg Bredno Paul A. Bromiley Michael S. Brown Robert Brown Aurelien Bustin Ryan P. Cabeen Jinzheng Cai Yunliang Cai

XI

XII

Organization

Xiaohuan Cao Tian Cao Gustavo Carneiro Isaac Casm M. Emre Celebi Suheyla Cetin Lotfi Chaari Vimal Chandran Pierre Chatelain Alessandro Chen Alvin Chen Antong Chen Chao Chen Geng Chen Hao Chen Jiawei Chen Terrence Chen Xiaobo Chen Li Cheng Jie-Zhi Cheng Erkang Cheng Veronika Cheplygina Gary Christensen Daan Christiaens Chengwen Chu Philippe Cinquin Cedric Clouchoux Toby Collins Olivier Commowick Sailesh Conjeti Tim Cootes Marc-Alexandre Cote Martin Cousineau Juan D. Adrian V. Dalca Sune Darkner Dhritiman Das Benoit M. Dawant Benjamin De Leener Johan Debayle Alperen Degirmenci Herve Delingette Maxime Descoteaux Nishikant Deshmukh Samuel Deslauriers-Gauthier Christian Desrosiers

Jwala Dhamala Meng Ding Christophe Doignon Jose Dolz Pei Dong Xiao Dong Qi Dou Simon Drouin Karen Drukker Lei Du Lixin Duan Florian Dubost Nicolas Duchateau James S. Duncan Luc Duong Meng Duong Nicha C. Dvornek Ahmet Ekin Mohammed S.M. Elbaz Erin Elizabeth Randy E. Ellis Noha El-Zehiry Guray Erus Juan Eugenio Pascal Fallavollita Mohsen Farzi Aaron Fenster Henrique C. Fernandes Enzo Ferrante Patryk Filipiak James Fishbaugh P. Thomas Fletcher Vladimir S. Fonov Denis Fortun Moti Freiman Benjamin Frisch Huazhu Fu Guillermo Gallardo Melanie Ganz Yi Gao Mingchen Gao Xieping Gao Zhifan Gao Amanmeet Garg Mona K. Garvin Romane Gauriau

Organization

Bao Ge Guido Gerig Sara Gharabaghi Sandesh Ghimire Ali Gholipour Gabriel Girard Mario Valerio V. Giuffrida Ben Glocker Michael Goetz Polina Golland Alberto Gomez German Gonzalez Miguel A. González Ballester Ali Gooya Shiri Gordon Pietro Gori Matthias Guenther Yanrong Guo Anubha Gupta Benjamin Gutierrez Becker Boris Gutman Séverine Habert Ilker Hacihaliloglu Stathis Hadjidemetriou Benjamin D. Haeffele Justin Haldar Andac Hamamci Ghassan Hamarneh Noura Hamze Rabia Haq Adam P. Harrison Hoda Sadat Hashemi Peter Hastreiter Charles Hatt Mohammad Havaei Dave Hawkes Lei He Tiancheng He Mohamed S. Hefny Tobias Heimann Mattias P. Heinrich Christoph Hennersperger Carlos Hernandez-Matas Matt Higger Byung-Woo Hong Qingqi Hong

Yi Hong Nicolas Honnorat Robert D. Howe Kai Hu Yipeng Hu Heng Huang Xiaolei Huang Yawen Huang Sarfaraz Hussein Juan E. Iglesias Laura Igual Atsushi Imiya Madhura Ingalhalikar Jiro Inoue Vamsi Ithapu Seong Jae Mayoore S. Jaiswal Amir Jamaludin Vincent Jaouen Uditha L. Jayarathne Shuiwang Ji Dongsheng Jiang Menglin Jiang Xi Jiang Xiaoyi Jiang Dakai Jin Marie-Pierre Jolly Anand Joshi Shantanu Joshi Leo Joskowicz Christoph Jud Bernhard Kainz Ioannis Kakadiaris Siva Teja Kakileti Verena Kaynig-Fittkau Guillaume Kazmitcheff Aneurin Kennerley Erwan Kerrien April Khademi Siavash Khallaghi Bishesh Khanal Ron Kikinis Boklye Kim Edward Kim Jaeil Kim Benjamin Kimia

XIII

XIV

Organization

Andrew King Jan Klein Stefan Klein Tobias Kober Simon Kohl Ender Konukoglu Nedialko Krouchev Frithjof Kruggel Elizabeth Krupinski Ashnil Kumar Prashnna Kumar Punithakumar Kumaradevan Takio Kurita Sebastian Kurtek Roland Kwitt Jan Kybic Aymen Laadhari Alexander Ladikos ALain Lalande Pablo Lamata Bennett A. Landman Georg Langs Carole Lartizien Tobias Lasser Toni Lassila Andras Lasso Matthieu Le Chen-Yu Lee Sing Chun Lee Julien Lefevre Boudewijn Lelieveldt Christophe Lenglet Wee Kheng Leow Gang Li Qingyang Li Rongjian Li Wenqi Li Xiaomeng Li Chunfeng Lian Jianming Liang Hongen Liao Ruizhi Liao Ben Lin Jianyu Lin Fujun Liu Jianfei Liu

Kefei Liu Liu Liu Jundong Liu Mingxia Liu Sidong Liu Nicolas Loménie Cristian Lorenz Marco Lorenzi Nicolas Loy Rodas Cheng Lu Le Lu Jianwen Luo Zhiming Luo Kai Ma Anderson Maciel Dwarikanath Mahapatra Gabriel Maicas Sokratis Makrogiannis Anand Malpani Tommaso Mansi Giovanni Maria Oge Marques Stephen Marsland Anne L. Martel Gassan Massarweh Michael McCann Steven McDonagh Stephen McKenna Bjoern H. Menze Kim Minjeong Marc Modat Pim Moeskops Kelvin Mok Mehdi Moradi Rodrigo Moreno Kensaku Mori Agata Mosinska Jayanta Mukhopadhyay Anirban Mukhopadhyay Arrate Munoz-Barrutia Maria Murgasova Arya Nabavi Saad Nadeem Layan Nahlawi Laurent Najman Tim Nattkemper

Organization

Peter Neher Dong Ni Dong Nie Marc Niethammer Christophoros Nikou Lipeng Ning Alison Noble Ipek Oguz Arnau Oliver Ee Ping Ong John A. Onofrey Eliza Orasanu Felipe Orihuela-Espina Silas N. Ørting David Owen Danielle F. Pace Blas Pagador Sharath Pankanti Xenophon Papademetris Bartlomiej Papiez Michael Paquette Sarah Parisot Nicolas Passat Gennaro Percannella Sérgio Pereira Loic Peter Igor Peterlik Jens Petersen Caroline Petitjean Simon Pezold Dzung L. Pham Pramod K. Pisharady Stephen Pizer Rosalie Plantefeve Josien Pluim Kilian Pohl JB Poline Philippe Poulin Dipti Prasad Prateek Prasanna Marcel Prastawa Philip Pratt Bernhard Preim Raphael Prevost Jerry L. Prince

Xiaoning Qian Xiang R. Frank R. Mehdi Rahim Yogesh Rathi Nishant Ravikumar Pradeep Reddy Raamana Xiaojun Regis Joseph Reinhardt Islem Rekik Markus Rempfler Mauricio Reyes Gerard R. Ridgway Nicola Rieke Laurent Risser David Robben Emma Robinson Antonio Robles-Kelly Marc-Michel Rohé Robert Rohling Karl Rohr Timo Roine Eduardo Romero James C. Ross Arun Ross Daniel Rueckert Daniel Ruijters Olivier Salvado Ryan Sanford Gerard Sanromà Imari Sato Peter Savadjiev Dustin Scheinost Thomas Schultz Christof Seiler Lama Seoud Abhay Shah Mahsa Shakeri Yeqin Shao Bibo Shi Chaoyang Shi Pengcheng Shi Rakesh Shiradkar Kaleem Siddiqi Viviana Siless

XV

XVI

Organization

Joseph R. Singapogu Ayushi Sinha Arkadiusz Sitek Jayanthi Sivaswamy Greg Slabaugh Dirk Smeets Ahmed Soliman Stefan Sommer Yang Song Lauge Sorensen Aristeidis Sotiras Lawrence H. Staib Aymeric Stamm Marius Staring Darko Stern Danail Stoyanov Colin Studholme Martin Styner Hai Su Jian Sun Ganesh Sundaramoorthi Ali Taalimi Sylvain Takerkart Toru Tamaki Olena Tankyevych Chris Taylor Philippe Thevenaz Paul Thienphrapa Bertrand Thirion Zhiqiang Tian Hamid R. Tizhoosh Matthew Toews Olivia Tong Yubing Tong Akif Burak Tosun Daniel Toth Emanuele Trucco Sotirios A. Tsaftaris Birkan Tunc Carole Twining Tamas Ungi Martin Urschler Mustafa Uzunbas Régis Vaillant An-An van

Nanda van Koen Van Leemput Gijs van Tulder Theo van Walsum Gael Varoquaux Francisco Vasconcelos Gopalkrishna B. Veni Tom Vercauteren Ujjwal Verma François-Xavier Vialard Satish Viswanath Frans Vos Tomaž Vrtovec Tao Wan Zhangyang Wang Bo Wang Chaohui Wang Hongzhi Wang Hua Wang Junyan Wang Lei Wang Li Wang Manning Wang Xiaosong Wang Zhiyong Wang Simon K. Warfield Stijn Wee Wolfgang Wein Fr Werner Rene Werner Daniel Wesierski Carl-Fredrik Westin Ross T. Whitaker Kevin Whittingstall Matthias Wilms Adam Wittek Paul Wohlhart Jelmer M. Wolterink Ken C.L. Wong Ken Wong Jonghye Woo Pengcheng Xi James J. Xia Wenfeng Xia Lei Xiang

Organization

Yiming Xiao Long Xie Yuanpu Xie Fuyong Xing Jing Xiong Daguang Xu Yan Xu Zheng Xu Zhoubing Xu Ziyue Xu Zenglin Xu Jingwen Yan Ke Yan Pingkun Yan Feng Yang Guang Yang Jie Yang Lin Yang Xiao Yang Xing Yang Jiawen Yao Jianhua Yao Chuyang Ye Jinhua Yu Weimin Yu Cheng Yuan Oliver Zettinig Yiqiang Zhan Fan Zhang

Han Zhang Jie Zhang Jiong Zhang Le Zhang Lichi Zhang Lin Zhang Ling Zhang Miaomiao Zhang Shu Zhang Jun Zhang Yu Zhang Liang Zhao Shijie Zhao Yitian Zhao Qingyu Zhao Yinqiang Zheng Jiayu Zhou Luping Zhou Tao Zhou Xiaofeng Zhu Weifang Zhu Xinliang Zhu Yingying Zhu Xiahai Zhuang Aneeq Zia Stephan Zidowitz Lilla Zollei Clement Zotti Reyer Zwiggelaar

XVII

Contents – Part III

Feature Extraction and Classification Techniques Deep Multi-task Multi-channel Learning for Joint Classification and Regression of Brain Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mingxia Liu, Jun Zhang, Ehsan Adeli, and Dinggang Shen

3

Nonlinear Feature Space Transformation to Improve the Prediction of MCI to AD Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pin Zhang, Bibo Shi, Charles D. Smith, and Jundong Liu

12

Kernel Generalized-Gaussian Mixture Model for Robust Abnormality Detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nitin Kumar, Ajit V. Rajwade, Sharat Chandran, and Suyash P. Awate

21

Latent Processes Governing Neuroanatomical Change in Aging and Dementia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christian Wachinger, Anna Rieckmann, and Martin Reuter

30

A Multi-armed Bandit to Smartly Select a Training Set from Big Medical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Benjamín Gutiérrez, Loïc Peter, Tassilo Klein, and Christian Wachinger

38

Multi-level Multi-task Structured Sparse Learning for Diagnosis of Schizophrenia Disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mingliang Wang, Xiaoke Hao, Jiashuang Huang, Kangcheng Wang, Xijia Xu, and Daoqiang Zhang An Unbiased Penalty for Sparse Classification with Application to Neuroimaging Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Li Zhang, Dana Cobzas, Alan Wilman, and Linglong Kong Unsupervised Feature Learning for Endomicroscopy Image Retrieval. . . . . . . Yun Gu, Khushi Vyas, Jie Yang, and Guang-Zhong Yang Maximum Mean Discrepancy Based Multiple Kernel Learning for Incomplete Multimodality Neuroimaging Data . . . . . . . . . . . . . . . . . . . . Xiaofeng Zhu, Kim-Han Thung, Ehsan Adeli, Yu Zhang, and Dinggang Shen Liver Tissue Classification in Patients with Hepatocellular Carcinoma by Fusing Structured and Rotationally Invariant Context Representation. . . . . John Treilhard, Susanne Smolka, Lawrence Staib, Julius Chapiro, MingDe Lin, Georgy Shakirin, and James S. Duncan

46

55 64

72

81

XX

Contents – Part III

DOTE: Dual cOnvolutional filTer lEarning for Super-Resolution and Cross-Modality Synthesis in MRI . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yawen Huang, Ling Shao, and Alejandro F. Frangi

89

Supervised Intra-embedding of Fisher Vectors for Histopathology Image Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yang Song, Hang Chang, Heng Huang, and Weidong Cai

99

GSplit LBI: Taming the Procedural Bias in Neuroimaging for Disease Prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xinwei Sun, Lingjing Hu, Yuan Yao, and Yizhou Wang

107

MRI-Based Surgical Planning for Lumbar Spinal Stenosis . . . . . . . . . . . . . . Gabriele Abbati, Stefan Bauer, Sebastian Winklhofer, Peter J. Schüffler, Ulrike Held, Jakob M. Burgstaller, Johann Steurer, and Joachim M. Buhmann Pattern Visualization and Recognition Using Tensor Factorization for Early Differential Diagnosis of Parkinsonism . . . . . . . . . . . . . . . . . . . . . . . . . . . Rui Li, Ping Wu, Igor Yakushev, Jian Wang, Sibylle I. Ziegler, Stefan Förster, Sung-Cheng Huang, Markus Schwaiger, Nassir Navab, Chuantao Zuo, and Kuangyu Shi Physiological Parameter Estimation from Multispectral Images Unleashed . . . Sebastian J. Wirkert, Anant S. Vemuri, Hannes G. Kenngott, Sara Moccia, Michael Götz, Benjamin F.B. Mayer, Klaus H. Maier-Hein, Daniel S. Elson, and Lena Maier-Hein Segmentation of Cortical and Subcortical Multiple Sclerosis Lesions Based on Constrained Partial Volume Modeling . . . . . . . . . . . . . . . . . . . . . Mário João Fartaria, Alexis Roche, Reto Meuli, Cristina Granziera, Tobias Kober, and Meritxell Bach Cuadra Classification of Pancreatic Cysts in Computed Tomography Images Using a Random Forest and Convolutional Neural Network Ensemble. . . . . . Konstantin Dmitriev, Arie E. Kaufman, Ammar A. Javed, Ralph H. Hruban, Elliot K. Fishman, Anne Marie Lennon, and Joel H. Saltz Classification of Major Depressive Disorder via Multi-site Weighted LASSO Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dajiang Zhu, Brandalyn C. Riedel, Neda Jahanshad, Nynke A. Groenewold, Dan J. Stein, Ian H. Gotlib, Matthew D. Sacchet, Danai Dima, James H. Cole, Cynthia H.Y. Fu, Henrik Walter, Ilya M. Veer, Thomas Frodl, Lianne Schmaal, Dick J. Veltman, and Paul M. Thompson

116

125

134

142

150

159

Contents – Part III

A Multi-atlas Approach to Region of Interest Detection for Medical Image Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hongzhi Wang, Mehdi Moradi, Yaniv Gur, Prasanth Prasanna, and Tanveer Syeda-Mahmood Spectral Graph Convolutions for Population-Based Disease Prediction . . . . . . Sarah Parisot, Sofia Ira Ktena, Enzo Ferrante, Matthew Lee, Ricardo Guerrerro Moreno, Ben Glocker, and Daniel Rueckert

XXI

168

177

Predicting Future Disease Activity and Treatment Responders for Multiple Sclerosis Patients Using a Bag-of-Lesions Brain Representation . . . . . . . . . . Andrew Doyle, Doina Precup, Douglas L. Arnold, and Tal Arbel

186

Sparse Multi-kernel Based Multi-task Learning for Joint Prediction of Clinical Scores and Biomarker Identification in Alzheimer’s Disease . . . . . Peng Cao, Xiaoli Liu, Jinzhu Yang, Dazhe Zhao, and Osmar Zaiane

195

Machine Learning in Medical Image Computing Personalized Diagnosis for Alzheimer’s Disease . . . . . . . . . . . . . . . . . . . . . Yingying Zhu, Minjeong Kim, Xiaofeng Zhu, Jin Yan, Daniel Kaufer, and Guorong Wu GP-Unet: Lesion Detection from Weak Labels with a 3D Regression Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Florian Dubost, Gerda Bortsova, Hieab Adams, Arfan Ikram, Wiro J. Niessen, Meike Vernooij, and Marleen De Bruijne Deep Supervision for Pancreatic Cyst Segmentation in Abdominal CT Scans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yuyin Zhou, Lingxi Xie, Elliot K. Fishman, and Alan L. Yuille Error Corrective Boosting for Learning Fully Convolutional Networks with Limited Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Abhijit Guha Roy, Sailesh Conjeti, Debdoot Sheet, Amin Katouzian, Nassir Navab, and Christian Wachinger Direct Detection of Pixel-Level Myocardial Infarction Areas via a Deep-Learning Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chenchu Xu, Lei Xu, Zhifan Gao, Shen Zhao, Heye Zhang, Yanping Zhang, Xiuquan Du, Shu Zhao, Dhanjoo Ghista, and Shuo Li Skin Disease Recognition Using Deep Saliency Features and Multimodal Learning of Dermoscopy and Clinical Images . . . . . . . . . . . Zongyuan Ge, Sergey Demyanov, Rajib Chakravorty, Adrian Bowling, and Rahil Garnavi

205

214

222

231

240

250

XXII

Contents – Part III

Boundary Regularized Convolutional Neural Network for Layer Parsing of Breast Anatomy in Automated Whole Breast Ultrasound . . . . . . . . . . . . . Cheng Bian, Ran Lee, Yi-Hong Chou, and Jie-Zhi Cheng Zoom-in-Net: Deep Mining Lesions for Diabetic Retinopathy Detection. . . . . Zhe Wang, Yanxin Yin, Jianping Shi, Wei Fang, Hongsheng Li, and Xiaogang Wang Full Quantification of Left Ventricle via Deep Multitask Learning Network Respecting Intra- and Inter-Task Relatedness . . . . . . . . . . . . . . . . . . . . . . . Wufeng Xue, Andrea Lum, Ashley Mercado, Mark Landis, James Warrington, and Shuo Li Scalable Multimodal Convolutional Networks for Brain Tumour Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lucas Fidon, Wenqi Li, Luis C. Garcia-Peraza-Herrera, Jinendra Ekanayake, Neil Kitchen, Sebastien Ourselin, and Tom Vercauteren Pathological OCT Retinal Layer Segmentation Using Branch Residual U-Shape Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stefanos Apostolopoulos, Sandro De Zanet, Carlos Ciller, Sebastian Wolf, and Raphael Sznitman Quality Assessment of Echocardiographic Cine Using Recurrent Neural Networks: Feasibility on Five Standard View Planes . . . . . . . . . . . . . Amir H. Abdi, Christina Luong, Teresa Tsang, John Jue, Ken Gin, Darwin Yeung, Dale Hawley, Robert Rohling, and Purang Abolmaesumi Semi-supervised Deep Learning for Fully Convolutional Networks . . . . . . . . Christoph Baur, Shadi Albarqouni, and Nassir Navab

259 267

276

285

294

302

311

TandemNet: Distilling Knowledge from Medical Images Using Diagnostic Reports as Optional Semantic References . . . . . . . . . . . . . . . . . . . . . . . . . . Zizhao Zhang, Pingjun Chen, Manish Sapkota, and Lin Yang

320

BRIEFnet: Deep Pancreas Segmentation Using Binary Sparse Convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mattias P. Heinrich and Ozan Oktay

329

Supervised Action Classifier: Approaching Landmark Detection as Image Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhoubing Xu, Qiangui Huang, JinHyeong Park, Mingqing Chen, Daguang Xu, Dong Yang, David Liu, and S. Kevin Zhou Robust Multi-modal MR Image Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . Thomas Joyce, Agisilaos Chartsias, and Sotirios A. Tsaftaris

338

347

Contents – Part III

Segmentation of Intracranial Arterial Calcification with Deeply Supervised Residual Dropout Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gerda Bortsova, Gijs van Tulder, Florian Dubost, Tingying Peng, Nassir Navab, Aad van der Lugt, Daniel Bos, and Marleen De Bruijne Clinical Target-Volume Delineation in Prostate Brachytherapy Using Residual Neural Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Emran Mohammad Abu Anas, Saman Nouranian, S. Sara Mahdavi, Ingrid Spadinger, William J. Morris, Septimu E. Salcudean, Parvin Mousavi, and Purang Abolmaesumi Using Convolutional Neural Networks to Automatically Detect Eye-Blink Artifacts in Magnetoencephalography Without Resorting to Electrooculography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Prabhat Garg, Elizabeth Davenport, Gowtham Murugesan, Ben Wagner, Christopher Whitlow, Joseph Maldjian, and Albert Montillo Image Super Resolution Using Generative Adversarial Networks and Local Saliency Maps for Retinal Image Analysis . . . . . . . . . . . . . . . . . Dwarikanath Mahapatra, Behzad Bozorgtabar, Sajini Hewavitharanage, and Rahil Garnavi Synergistic Combination of Learned and Hand-Crafted Features for Prostate Lesion Classification in Multiparametric Magnetic Resonance Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Davood Karimi and Dan Ruan Suggestive Annotation: A Deep Active Learning Framework for Biomedical Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lin Yang, Yizhe Zhang, Jianxu Chen, Siyuan Zhang, and Danny Z. Chen Deep Adversarial Networks for Biomedical Image Segmentation Utilizing Unannotated Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yizhe Zhang, Lin Yang, Jianxu Chen, Maridel Fredericksen, David P. Hughes, and Danny Z. Chen Medical Image Synthesis with Context-Aware Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dong Nie, Roger Trullo, Jun Lian, Caroline Petitjean, Su Ruan, Qian Wang, and Dinggang Shen Joint Detection and Diagnosis of Prostate Cancer in Multi-parametric MRI Based on Multimodal Convolutional Neural Networks . . . . . . . . . . . . . Xin Yang, Zhiwei Wang, Chaoyue Liu, Hung Minh Le, Jingyu Chen, Kwang-Ting (Tim) Cheng, and Liang Wang

XXIII

356

365

374

382

391

399

408

417

426

XXIV

Contents – Part III

SD-Layer: Stain Deconvolutional Layer for CNNs in Medical Microscopic Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rahul Duggal, Anubha Gupta, Ritu Gupta, and Pramit Mallick X-Ray In-Depth Decomposition: Revealing the Latent Structures . . . . . . . . . Shadi Albarqouni, Javad Fotouhi, and Nassir Navab

435 444

Fast Prospective Detection of Contrast Inflow in X-ray Angiograms with Convolutional Neural Network and Recurrent Neural Network . . . . . . . Hua Ma, Pierre Ambrosini, and Theo van Walsum

453

Quantification of Metabolites in Magnetic Resonance Spectroscopic Imaging Using Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dhritiman Das, Eduardo Coello, Rolf F. Schulte, and Bjoern H. Menze

462

Building Disease Detection Algorithms with Very Small Numbers of Positive Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ken C.L. Wong, Alexandros Karargyris, Tanveer Syeda-Mahmood, and Mehdi Moradi Hierarchical Multimodal Fusion of Deep-Learned Lesion and Tissue Integrity Features in Brain MRIs for Distinguishing Neuromyelitis Optica from Multiple Sclerosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Youngjin Yoo, Lisa Y.W. Tang, Su-Hyun Kim, Ho Jin Kim, Lisa Eunyoung Lee, David K.B. Li, Shannon Kolind, Anthony Traboulsee, and Roger Tam Deep Convolutional Encoder-Decoders for Prostate Cancer Detection and Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Atilla P. Kiraly, Clement Abi Nader, Ahmet Tuysuzoglu, Robert Grimm, Berthold Kiefer, Noha El-Zehiry, and Ali Kamen Deep Image-to-Image Recurrent Network with Shape Basis Learning for Automatic Vertebra Labeling in Large-Scale 3D CT Volumes . . . . . . . . . Dong Yang, Tao Xiong, Daguang Xu, S. Kevin Zhou, Zhoubing Xu, Mingqing Chen, JinHyeong Park, Sasa Grbic, Trac D. Tran, Sang Peter Chin, Dimitris Metaxas, and Dorin Comaniciu Automatic Liver Segmentation Using an Adversarial Image-to-Image Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dong Yang, Daguang Xu, S. Kevin Zhou, Bogdan Georgescu, Mingqing Chen, Sasa Grbic, Dimitris Metaxas, and Dorin Comaniciu

471

480

489

498

507

Contents – Part III

Transfer Learning for Domain Adaptation in MRI: Application in Brain Lesion Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mohsen Ghafoorian, Alireza Mehrtash, Tina Kapur, Nico Karssemeijer, Elena Marchiori, Mehran Pesteie, Charles R.G. Guttmann, Frank-Erik de Leeuw, Clare M. Tempany, Bram van Ginneken, Andriy Fedorov, Purang Abolmaesumi, Bram Platel, and William M. Wells III Retinal Microaneurysm Detection Using Clinical Report Guided Multi-sieving CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ling Dai, Bin Sheng, Qiang Wu, Huating Li, Xuhong Hou, Weiping Jia, and Ruogu Fang Lesion Detection and Grading of Diabetic Retinopathy via Two-Stages Deep Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yehui Yang, Tao Li, Wensi Li, Haishan Wu, Wei Fan, and Wensheng Zhang

XXV

516

525

533

Hashing with Residual Networks for Image Retrieval . . . . . . . . . . . . . . . . . Sailesh Conjeti, Abhijit Guha Roy, Amin Katouzian, and Nassir Navab

541

Deep Multiple Instance Hashing for Scalable Medical Image Retrieval . . . . . Sailesh Conjeti, Magdalini Paschali, Amin Katouzian, and Nassir Navab

550

Accurate Pulmonary Nodule Detection in Computed Tomography Images Using Deep Convolutional Neural Networks. . . . . . . . . . . . . . . . . . . . . . . . Jia Ding, Aoxue Li, Zhiqiang Hu, and Liwei Wang

559

Discriminative Localization in CNNs for Weakly-Supervised Segmentation of Pulmonary Nodules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xinyang Feng, Jie Yang, Andrew F. Laine, and Elsa D. Angelini

568

Liver Lesion Detection Based on Two-Stage Saliency Model with Modified Sparse Autoencoder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yixuan Yuan, Max Q.-H. Meng, Wenjian Qin, and Lei Xing

577

Manifold Learning of COPD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Felix J.S. Bragman, Jamie R. McClelland, Joseph Jacob, John R. Hurst, and David J. Hawkes Hybrid Mass Detection in Breast MRI Combining Unsupervised Saliency Analysis and Deep Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Guy Amit, Omer Hadad, Sharon Alpert, Tal Tlusty, Yaniv Gur, Rami Ben-Ari, and Sharbell Hashoul Deep Multi-instance Networks with Sparse Label Assignment for Whole Mammogram Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wentao Zhu, Qi Lou, Yeeleng Scott Vang, and Xiaohui Xie

586

594

603

XXVI

Contents – Part III

Segmentation-Free Kidney Localization and Volume Estimation Using Aggregated Orthogonal Decision CNNs . . . . . . . . . . . . . . . . . . . . . . Mohammad Arafat Hussain, Alborz Amir-Khalili, Ghassan Hamarneh, and Rafeef Abugharbieh Progressive and Multi-path Holistically Nested Neural Networks for Pathological Lung Segmentation from CT Images . . . . . . . . . . . . . . . . . Adam P. Harrison, Ziyue Xu, Kevin George, Le Lu, Ronald M. Summers, and Daniel J. Mollura Automated Pulmonary Nodule Detection via 3D ConvNets with Online Sample Filtering and Hybrid-Loss Residual Learning . . . . . . . . . Qi Dou, Hao Chen, Yueming Jin, Huangjing Lin, Jing Qin, and Pheng-Ann Heng CASED: Curriculum Adaptive Sampling for Extreme Data Imbalance . . . . . . Andrew Jesson, Nicolas Guizard, Sina Hamidi Ghalehjegh, Damien Goblot, Florian Soudan, and Nicolas Chapados Intra-perinodular Textural Transition (Ipris): A 3D Descriptor for Nodule Diagnosis on Lung CT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mehdi Alilou, Mahdi Orooji, and Anant Madabhushi Transferable Multi-model Ensemble for Benign-Malignant Lung Nodule Classification on Chest CT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yutong Xie, Yong Xia, Jianpeng Zhang, David Dagan Feng, Michael Fulham, and Weidong Cai Deep Reinforcement Learning for Active Breast Lesion Detection from DCE-MRI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gabriel Maicas, Gustavo Carneiro, Andrew P. Bradley, Jacinto C. Nascimento, and Ian Reid

612

621

630

639

647

656

665

Pancreas Segmentation in MRI Using Graph-Based Decision Fusion on Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jinzheng Cai, Le Lu, Yuanpu Xie, Fuyong Xing, and Lin Yang

674

Modeling Cognitive Trends in Preclinical Alzheimer’s Disease (AD) via Distributions over Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gregory Plumb, Lindsay Clark, Sterling C. Johnson, and Vikas Singh

683

Does Manual Delineation only Provide the Side Information in CT Prostate Segmentation? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yinghuan Shi, Wanqi Yang, Yang Gao, and Dinggang Shen

692

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

701

Feature Extraction and Classification Techniques

Deep Multi-task Multi-channel Learning for Joint Classification and Regression of Brain Status Mingxia Liu, Jun Zhang, Ehsan Adeli, and Dinggang Shen(B) Department of Radiology and BRIC, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA [email protected]

Abstract. Jointly identifying brain diseases and predicting clinical scores have attracted increasing attention in the domain of computeraided diagnosis using magnetic resonance imaging (MRI) data, since these two tasks are highly correlated. Although several joint learning models have been developed, most existing methods focus on using humanengineered features extracted from MRI data. Due to the possible heterogeneous property between human-engineered features and subsequent classification/regression models, those methods may lead to sub-optimal learning performance. In this paper, we propose a deep multi-task multichannel learning (DM2 L) framework for simultaneous classification and regression for brain disease diagnosis, using MRI data and personal information (i.e., age, gender, and education level) of subjects. Specifically, we first identify discriminative anatomical landmarks from MR images in a data-driven manner, and then extract multiple image patches around these detected landmarks. A deep multi-task multi-channel convolutional neural network is then developed for joint disease classification and clinical score regression. We train our model on a large multi-center cohort (i.e., ADNI-1) and test it on an independent cohort (i.e., ADNI-2). Experimental results demonstrate that DM2 L is superior to the state-of-the-art approaches in brain diasease diagnosis.

1

Introduction

For the challenging and interesting task of computer-aided diagnosis of Alzheimer’s disease (AD) and its prodromal stage (i.e., mild cognitive impairment, MCI), brain morphometric pattern analysis has been widely investigated to identify diseaserelated imaging biomarkers from structural magnetic resonance imaging (MRI) [1–3]. Compared with other widely used biomarkers (e.g., cerebrospinal fluid), MRI provides a non-invasive solution to potentially identify abnormal structural brain M. Liu and J. Zhang—These authors contribute equally to this paper. Electronic supplementary material The online version of this chapter (doi:10. 1007/978-3-319-66179-7 1) contains supplementary material, which is available to authorized users. c Springer International Publishing AG 2017  M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 3–11, 2017. DOI: 10.1007/978-3-319-66179-7 1

4

M. Liu et al.

Fig. 1. Illustration of our deep multi-task multi-channel learning (DM2 L) framework.

changes in a more sensitive manner [4,5]. While extensive MRI-based studies focus on predicting categorical variables in binary classification tasks, the multi-class classification task remains a challenging problem. Moreover, several pattern regression methods have been developed to estimate continuous clinical scores using MRI [6]. This line of research is very important since it can help evaluate the stage of AD/MCI pathology and predict its future progression. Different from the classification task that categorizes an MRI into binary or multiple classes, the regression task needs to estimate continuous values, which is more challenging in practice. Actually, the tasks of disease classification and clinical score regression may be highly associated, since they aim to predict semantically similar targets. Hence, jointly learning these two tasks can utilize the intrinsic useful correlation information among categorical and clinical variables to promote the learning performance [6]. Existing methods generally first extract human-engineered features from MR images, and then feed these features into subsequent classification/regression models. Due to the possibly heterogeneous property between features and models, these methods usually lead to sub-optimal performance because of simply utilizing the limited human-engineered features. Intuitively, integrating the feature extraction and the learning of models into a unified framework could improve the diagnostic performance. Also, personal information (e.g., age, gender, and education level) may also be related to brain status, and thus can affect the diagnostic performance for AD/MCI. However, it is often not accurate to simultaneously match multiple parameters for different clinical groups. In this paper, we propose a deep multi-task multi-channel learning (DM2 L) framework for joint classification and regression of brain status using MRI. Compared with conventional methods, DM2 L can not only automatically learn representations for MRI without requiring any expert knowledge for pre-defining features, but also explicitly embed personal information (i.e., age, gender, and education level) into the learning model. Figure 1 shows a schematic diagram of DM2 L. We first process MR images and identify anatomical landmarks in a data-driven manner, followed by a patch extraction procedure. We then propose a multi-task multi-channel convolutional neural network (CNN) to simultaneously perform multi-class disease classification and clinical score regression.

2

Materials and Methods

Data Description: Two public datasets containing 1396 subjects are used in this study, including Alzheimer’s Disease Neuroimaging Initiative-1 (ADNI1) [7], and ADNI-2 [7]. For independent testing, subjects participated in both

Deep Multi-task Multi-channel Learning

5

ADNI-1 and ADNI-2 are simply removed from ADNI-2. Subjects in the baseline ADNI-1 dataset have 1.5T T1-weighted MRI data, while subjects in the baseline ADNI-2 dataset have 3T T1-weighted MRI data. More specifically, ADNI1 contains 226 normal control (NC), 225 stable MCI (sMCI), 165 progressive MCI (pMCI), and 181 AD subjects. In ADNI-2, there are 185 NC, 234 sMCI, 37 pMCI, and 143 AD subjects. Both sMCI and pMCI are defined based on whether MCI subjects would convert to AD within 36 months after the baseline time. Four types of clinical scores are acquired for each subject in both ADNI-1 and ADNI-2, including Clinical Dementia Rating Sum of Boxes (CDRSB), classic Alzheimer’s Disease Assessment Scale Cognitive (ADAS-Cog) subscale with 11 items (ADAS11), modified ADAS-Cog with 13 items (ADAS13), and Mini– Mental State Examination (MMSE). We process all studied MR images via a standard pipeline, including anterior commissure (AC)-posterior commissure (PC) correction, intensity correction, skull stripping, and cerebellum removing. Data-Driven Anatomical Landmark Identification: To extract informative patches from MRI for both feature learning and model training, we first identify discriminative AD-related landmark locations using a data-driven landmark discovery algorithm [8,9]. The aim is to identify the landmarks that have statistically significant group difference between AD patients and NC subjects in local brain structures. More specifically, using the Colin27 template, both linear and non-linear registration are performed to establish correspondences among voxels in different MR images. Then, morphological features are extracted from local image patches around the corresponding voxels in the linearly-aligned AD and NC subjects from ADNI-1. A voxel-wise group comparison between AD and NC groups is then performed in the template space, through which a p-value can be calculated for each voxel. In this way, a p-value map can be obtained, whose local minima are defined as locations of discriminative landmarks in the template. In Fig. 2(left), we illustrate the identified anatomical landmarks based on AD and NC subjects in ADNI-1. For a new testing MR image, one can first linearly align it to the template space, and then use a pre-trained landmark detector to localize each landmark. In this study, we assume that landmarks with significant differences between AD and NC groups are potential atrophy

160

120

z

80

40 220

180

140

y

100

60

60

100

140

180

220

x 0

0.001

Fig. 2. Illustration of (left) all identified AD-related anatomical landmarks, and (right) L = 50 selected landmarks with colors denoting p-values in group comparison [8].

6

M. Liu et al.

locations of MCI subjects. Accordingly, all pMCI and sMCI subjects share the same landmarks as those identified from AD and NC groups. Patch Extraction from MRI: Based on the identified landmarks, we extract image patches from an MR image of a specific subject. As shown in Fig. 2(left), some landmarks are close to each other. In such a case, patches extracted from these landmark locations will have large overlaps, and thus can only provide limited information about the structures of MR images due to redundant information. To this end, we define a spatial distance threshold (i.e., 20 voxels) to control the distance of landmarks, in order to reduce the overlaps of patches. In Fig. 2(right), we plot the L = 50 selected landmarks, from which we can see that many of these selected landmarks are located in the areas of bilateral hippocampal, parahippocampal, and fusiform. These areas are reported to be related to AD/MCI in previous studies [10,11]. Then, for each subject, we can extract L image patches (with the size of 24 × 24 × 24) based on L landmarks, and each patch center is a specific landmark location. We further randomly extract image patches centered at each landmark location with displacements in a 5 × 5 × 5 cubic, in order to reduce the impact of landmark detection errors. Multi-Task Multi-Channel CNN: As shown in Fig. 3, we propose a multitask multi-channel CNN, which allows the learning model to extract feature representations implicitly from the input MRI patches. This architecture adopts multi-channel input data, where each channel is corresponding to a local image patch extracted from a specific landmark location. In addition, we incorporate personal information (i.e., age, gender, and education level) into the learning model, in order to investigate the impact of personal information on the performance of computer-aided disease diagnosis. As shown in Fig. 3, the input of this network includes L image patches, age, gender, and education level from each subject, while the output contains the class labels and four clinical scores

Fig. 3. Architecture for the proposed multi-task multi-channel CNN.

Deep Multi-task Multi-channel Learning

7

(i.e., CDRSB, ADAS11, ADAS13, and MMSE). Since the appearance of brain MRI is often globally similar and locally different, both global and local structural information could be important for the tasks of classification and regression. To capture the local structural information of MRI, we first develop L-channel parallel CNN architectures. In each channel CNN, there is a sequence of six convolutional layers and two fully connected (FC) layers (i.e., FC7, and FC8). Each convolution layer is followed by a rectified linear unit (ReLU) activation function, while Conv2, Conv4, and Conv6 are followed by 2 × 2 × 2 max-pooling operations for down-sampling. Note that each channel contains the same number of convolutional layers and the same parameters, while their weights are independently optimized and updated. To model the global structural information of MRI, we then concatenate the outputs of L FC8 layers, and add two additional FC layers (i.e., FC9, and FC10) to capture the global structural information of MRI. Moreover, we feed a concatenated representation comprising the output of FC10 and personal information (i.e., age, gender, and education level) into two FC layers (i.e., FC11, and FC12). Finally, two FC13 layers are used to predict the class probability (via soft-max) and to estimate the clinical scores, respectively. The proposed network can also be mathematically described as follows. Let X = {Xn }N n=1 denote the training set, with the element Xn representing the n-th subject. Denote the labels of C (C = 4) categories as yc = s = {ync }N n=1 (c = 1, 2, · · · , C), and S (S = 4) types of clinical scores as z s N {zn }n=1 (s = 1, 2, · · · , S). In this study, both the class labels and clinical scores are used in a back-propagation procedure to update the network weights in the convolutional layers and to learn the most relevant features in the FC   layers. The aim of C S the proposed CNN is to learn a non-linear mapping Ψ : X → {yc }c=1 , {zs }s=1 from the input space to both spaces of the class label and the clinical score, and the objective function is as follows: arg min − W

C 1  1  1 {ync = c} log (P(ync = c|Xn ; W)) C c=1 N Xn ∈X

S 1 1  2 + (zns − zns ) , S s=1 N

(1)

Xn ∈X

where the first term is the cross-entropy loss for multi-class classification, and the second one is the mean squared loss for regression to evaluate the difference between the estimated clinical score zns and the ground truth zns . Note that 1 {·} is an indicator function, with 1 {·} = 1 if {·} is true; and 0, otherwise. In addition, P(ync = c|Xn ; W) indicates the probability of the subject Xn being correctly classified as the category ync using the network coefficients W.

3

Experiments

Experimental Settings: We perform both multi-class classification (NC vs. sMCI vs. pMCI vs. AD) and regression of four clinical scores (CDRSB, ADAS11,

8

M. Liu et al.

ADAS13, and MMSE). The performance of multi-class classification is evaluated by the overall classification accuracy (Acc) for four categories, as well as the accuracy for each category. The performance of regression is evaluated by the correlation coefficient (CC), and the root mean square error (RMSE). To evaluate the generalization ability and the robustness of a specific model, we adopt ADNI1 as the training dataset, and ADNI-2 as the independent testing dataset. We compare our DM2 L method with two state-of-the-art methods, including (1) voxel-based morphometry (VBM) [12], and (2) ROI-based method (ROI). In the VBM , we first normalize all MR images to the anatomical automatic labeling (AAL) template using a non-linear image registration technique, and then extract the local GM tissue density in a voxel-wise manner as features. We also perform t-test to select informative features, followed by a linear support vector machine (LSVM) or a linear support vector regressor (LSVR) for classification or regression. In the ROI , the brain MRI is first segmented into three tissue types, i.e., gray matter (GM), white matter (WM), and cerebrospinal fluid (CSF). We then align the AAL template with 90 pre-defined ROIs to the native space of each subject using a deformable registration algorithm. Then, the normalized volumes of GM tissue in 90 ROIs are used as the representation of an MR image, followed by an LSVM or an LSVR for classification or regression. To evaluate the contributions of the proposed two strategies (i.e., joint learning, and using personal information) adopted in DM2 L, we further compare DM2 L with its three variants. These variants include (1) deep single-task multichannel learning (DSML) using personal information, (2) deep single-task multichannel learning without using personal information (denoted as DSML-1), and (3) deep multi-task multi-channel learning without using personal information (denoted as DM2 L-1). Note that DSML-1 and DSML employ the similar CNN architecture as shown in Fig. 3, but perform the tasks of classification and regression separately. In addition, DM2 L-1 does not adopt any personal information for the joint learning of classification and regression. The size of image patch is empirically set to 24 × 24 × 24 in DM2 L and its three variants, and they share the same L = 50 landmarks as shown in Fig. 2(right). Results: In Table 1, we report the experimental results achieved by six methods in the tasks of multi-class disease classification (i.e., NC vs. sMCI vs. pMCI vs. AD) and regression of four clinical scores. The confusion matrices for multi-class Table 1. Results of multi-class disease classification and clinical score regression. Method Multi-class disease classification

Clinical score regression

(NC vs. sMCI vs. pMCI vs. AD) Acc

CDRSB

AccNC AccsMCI AccpMCI AccAD CC

VBM

0.404 0.557

0.295

ROI

0.431 0.589

0.081

0.469

ADAS11

RMSE CC

0.278 2.010

ADAS13

RMSE CC

0.290 7.406

MMSE

RMSE CC

RMSE

0.327 10.322 0.289 2.889

0.269

0.027

0.594 0.380 1.893

0.360 7.358

0.371 10.319 0.325 2.899

DSML-1 0.467 0.784 0.295

0.189

0.413

0.475 1.859

0.497 6.499

0.508 9.195

0.468 2.593

DSML

0.486 0.611

0.419

0.216

0.503

0.522 1.674

0.542 6.268

0.581 8.591

0.538 2.414

DM2 L-1 0.487 0.665

0.415

0.297

0.427

0.481 1.817

0.516 6.529

0.554 9.771

0.492 2.643

DM2 L

0.513

0.243

0.490

0.533 1.666 0.565 6.200 0.590 8.537 0.567 2.373

0.518 0.600

Deep Multi-task Multi-channel Learning

9

Fig. 4. Scatter plots of estimated scores vs. true scores achieved by different methods.

classification are given in the Supplementary Materials. Figure 4 further shows the scatter plots of the estimated scores vs. the true scores achieved by six different methods for four clinical scores, respectively. Note that the clinical scores are normalized to [0, 1] in the procedure of model learning, and we transform those estimated scores back to their original ranges in Fig. 4. From Table 1 and Fig. 4, we can make at least four observations. First, compared with conventional methods (i.e., VBM, and ROI), the proposed 4 deep learning based approaches generally yield better results in both disease classification and clinical score regression. For instance, in terms of the overall accuracy, DM2 L achieves an 11.4% and an 8.7% improvement compared with VBM and ROI, respectively. In addition, VBM and ROI can only achieve very low classification accuracies (i.e., 0.081 and 0.027, respectively) for the pMCI subjects, while our deep learning based methods can achieve much higher accuracies. This implies that the integration of feature extraction into model learning provides a good solution for improving diagnostic performance, since feature learning and model training can be optimally coordinated. Second , in both classification and regression tasks, the proposed joint learning models are usually superior to the models that learn different tasks separately. That is, DM2 L usually achieves better results than DSML, and DM2 L-1 outperforms DSML-1. Third , DM2 L and DSML generally outperforms their counterparts (i.e., DM2 L-1, and DSML1) that do not incorporate personal information (i.e., age, gender, and education level) into the learning process. It suggests that personal information helps improve the learning performance of the proposed method. Finally, as can be seen from Fig. 4, our DM2 L method generally outperforms those five competing

10

M. Liu et al.

methods in the regression of four clinical scores. Considering different signal-tonoise ratios of MRI in the training set (i.e., ADNI-1 with 1.5T scanners) and MRI in the testing set (i.e., ADNI-2 with 3T scanners), these results imply that the learned model via our DM2 L framework has good generalization capability.

4

Conclusion

We propose a deep multi-task multi-channel learning (DM2 L) framework for joint classification and regression of brain status using MRI and personal information. Results on two public cohorts demonstrate the effectiveness of DM2 L in both multi-class disease classification and clinical score regression. However, due to the differences in data distribution between ADNI-1 and ADNI-2, it may degrade the performance to directly apply the model trained on ADNI-1 to ADNI-2. It is interesting to study a model adaptation strategy to reduce the negative influence of data distribution differences. Besides, studying how to automatically extract informative patches from MRI is meaningful, which will also be our future work.

References 1. Fox, N., Warrington, E., Freeborough, P., Hartikainen, P., Kennedy, A., Stevens, J., Rossor, M.N.: Presymptomatic hippocampal atrophy in Alzheimer’s disease. Brain 119(6), 2001–2007 (1996) 2. Liu, M., Zhang, D., Shen, D.: View-centralized multi-atlas classification for Alzheimer’s disease diagnosis. Hum. Brain Mapp. 36(5), 1847–1865 (2015) 3. Liu, M., Zhang, J., Yap, P.T., Shen, D.: View-aligned hypergraph learning for Alzheimer’s disease diagnosis with incomplete multi-modality data. Med. Image Anal. 36, 123–134 (2017) 4. Frisoni, G.B., Fox, N.C., Jack, C.R., Scheltens, P., Thompson, P.M.: The clinical use of structural MRI in Alzheimer disease. Nature Rev. Neurol. 6(2), 67–77 (2010) 5. Liu, M., Zhang, D., Shen, D.: Relationship induced multi-template learning for diagnosis of Alzheimer’s disease and mild cognitive impairment. IEEE Trans. Med. Imaging 35(6), 1463–1474 (2016) 6. Sabuncu, M.R., Konukoglu, E., Initiative, A.D.N., et al.: Clinical prediction from structural brain MRI scans: A large-scale empirical study. Neuroinformatics 13(1), 31–46 (2015) 7. Jack, C.R., Bernstein, M.A., Fox, N.C., Thompson, P., Alexander, G., Harvey, D., Borowski, B., Britson, P.J., L Whitwell, J., Ward, C.: The Alzheimer’s disease neuroimaging initiative (ADNI): MRI methods. J. Magn. Reson. Imaging 27(4), 685–691 (2008) 8. Zhang, J., Gao, Y., Gao, Y., Munsell, B., Shen, D.: Detecting anatomical landmarks for fast Alzheimer’s disease diagnosis. IEEE Trans. Med. Imaging 35(12), 2524– 2533 (2016) 9. Zhang, J., Liu, M., An, L., Gao, Y., Shen, D.: Alzheimer’s disease diagnosis using landmark-based features from longitudinal structural MR images. IEEE J. Biomed. Health Inform. (2017). doi:10.1109/JBHI.2017.2704614

Deep Multi-task Multi-channel Learning

11

10. De Jong, L., Van der Hiele, K., Veer, I., Houwing, J., Westendorp, R., Bollen, E., De Bruin, P., Middelkoop, H., Van Buchem, M., Van Der Grond, J.: Strongly reduced volumes of putamen and thalamus in Alzheimer’s disease: an MRI study. Brain 131(12), 3277–3285 (2008) 11. Hyman, B.T., Van Hoesen, G.W., Damasio, A.R., Barnes, C.L.: Alzheimer’s disease: cell-specific pathology isolates the hippocampal formation. Science 225, 1168– 1171 (1984) 12. Baron, J., Chetelat, G., Desgranges, B., Perchey, G., Landeau, B., De La Sayette, V., Eustache, F.: In vivo mapping of gray matter loss with voxel-based morphometry in mild Alzheimer’s disease. NeuroImage 14(2), 298–309 (2001)

Nonlinear Feature Space Transformation to Improve the Prediction of MCI to AD Conversion Pin Zhang1 , Bibo Shi2 , Charles D. Smith3 , and Jundong Liu1(B) 1

3

School of Electrical Engineering and Computer Science, Ohio University, Athens, OH, USA [email protected] 2 Department of Radiology, Duke University, Durham, NC, USA Department of Neurology, University of Kentucky, Lexington, KY, USA Abstract. Accurate identification of patients with Mild Cognitive Impairment (MCI) at high risk for conversion to Alzheimer’s Disease (AD) offers an opportunity to target the disease process early. In this paper, we present a novel nonlinear feature transformation scheme to improve the prediction of MCI-AD conversion through semi-supervised learning. Utilizing Laplacian SVM (LapSVM) as a host classifier, the proposed method learns a smooth spatially varying transformation that makes the input data more linearly separable. Our approach has a broad applicability to boost the classification performance of many other semi-supervised learning solutions. Using baseline MR images from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database, we evaluate the effectiveness of the proposed semi-supervised framework and demonstrate the improvements over the state-of-the-art solutions within the same category.

1

Introduction

Alzheimer’s Disease (AD), the most common form of dementia, affects more than 34 million people in 2016. Amnestic mild cognitive impairment (MCI) is often regarded as a prodromal stage of AD, where some patients convert to AD over time, and the others remain stable for many years. Identifying the differences between “converters” and “stable” groups of subjects can offer an opportunity to target the disease early. A number of solutions have been proposed in recent years to tackle AD/MCI early diagnosis problem. Common practices include the utilization of multimodality [1,8,15] and longitudinal [7,21] data to exploit complementary information, dimension reduction to diminish data redundancy and feature selection to extract the most discriminative feature set. Ensembles of classifiers from multiple domains or/and levels have also been employed to improve the overall classification performance [4,7,8,21]. As accurate diagnosis of MCI-AD conversion is often not available until a later time, semi-supervised learning (SSL), utilizing unlabeled data in conjunction with labeled samples (the valuable gold standard confirmed cases) to c Springer International Publishing AG 2017  M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 12–20, 2017. DOI: 10.1007/978-3-319-66179-7 2

Semi-Supervised Learning for prediction of AD conversion from MCI

13

improve classification performance, is uniquely suitable to predict patients’ clinical trajectories. In [19], MCI subjects were used as unlabeled data to boost the classification accuracy in discriminating AD vs. normal control (NC) subjects. Compared with using AD/NC subjects only, a significant improvement was achieved. Similar approaches were proposed in [5,17] to predict disease labels for MCI subjects. Moradi et al. [9] developed a semi-supervised classifier for AD conversion prediction in MCI patients based on low-density separation (LDS). All these studies demonstrate that label augmentation through unlabeled data samples equips SSL with better predictive power over supervised learning. Despite all the strides made in recent years, insufficient attention has been given to rationally selecting appropriate metrics from the training data that could maximize the power of various SSL solutions. Learning a metric from the training input is equivalent to learning a feature transformation [11,13,16], and such transformations can often significantly boost the performance of many metric-based algorithms, such as kNN, k-means, and even SVMs in various tasks [3,12,20]. In this paper, we propose to enhance the prediction of MCI-AD conversion via a novel nonlinear feature transformation scheme. We take Laplacian SVM (LapSVM), a classic graph-based SSL model, as the host classifier, and generalize it through the application of deformable geometric models to transform the feature space. The Coherent Point Drifting (CPD) method from [10] is chosen as the geometric model in this study, due to its remarkable versatility and representation power in accounting for high-order deformations. In this work, we focus on the classification of progressive MCI (pMCI) vs. stable MCI (sMCI) subjects. The connection to MCI-AD conversion is that: if our solution can differentiate pMCI/sMCI subjects at their baseline time, very accurately and robustly, it should be able to make reliable MCI-to-AD predictions for unseen subjects. The data we used are baseline MR images, obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (www.loni.usc.edu/ADNI).

2

Method

Formulated under the standard support vector machine (SVM) framework, LapSVM solves the classification problem by employing two regularization terms: one for SVM maximal margin classification, and the other for label smoothness across the data graph – neighboring nodes should have identical or similar labels. Let X = {xi | xi ∈ Rd , i = 1, · · · , l + u} be the training set, where {xi , yi }li=1 are labeled samples with labels yi ∈ {−1, +1}, and the remaining {xi }l+u i=l+1 have no labels. A graph needs to be established to specify the adjacency relationships among samples, and then LapSVM estimates a membership function f (x) on the graph, by solving the following optimization problem: min

f ∈HK

s.t.

J=

l l+u  1 ξi + C1 f 2K + C2 Wij (f (xi ) − f (xj ))2 l i=1 i,j=1

yi f (xi ) ≥ 1 − ξi ,

ξi ≥ 0,

∀i = 1 . . . l;

(1)

14

P. Zhang et al.

where Wij is the weight of the edge that connects xi and xj in the data adjacency graph. ξi are slack variables to penalize misclassifications. f 2K is the squared norm of f in the Reproducing Kernel Hilbert Space (RKHS). C1 and C2 are hyperparameters controlling the contributions of the two regularization terms. 2.1

Feature Transformation Through Coherent Point Drifting (CPD)

For distance or dissimilarity based classification algorithms, the application of a smooth nonlinear transformation across the space is equivalent to assigning spatially varying metrics at different locations. Our goal is to learn such a transformation so that the displaced samples would better conform to the data distribution assumed by the ensuing classifier, LapSVM. In this work, the CPD model is chosen as the geometric model to drive the deformations. CPD was originally designed for landmark matching. For two sets X and U, each with n points of dimension d, CPD seeks a continuous velocity function v(x) : Rd → Rd that moves the source dataset X towards the target U. Estimation of an optimal v(·) is formulated under Tikhonov regularization n framework: R[v] = 12 i=1 [ui − (xi + v(xi ))]2 + 12 λ||Dv||2 , where D is a linear differentiation operator, || · || is the norm operation, and λ controls the strength of the regularization term. CPD chooses a particular regularization term whose kernel function is a Gaussian low-pass filter. According to [10], the optimal solution v(x) in CPD can be written in the matrix format as: ⎞ ⎛ G(xi , x1 ) v(xi ) = Ψ ⎝ · · · ⎠ = ΨG(xi , X ), (2) G(xi , xn ) where Ψ (size d × n) is the weight matrix for the Gaussian kernel functions. G(xi , xj ) = e− 2.2

(xi −xj )2 2σ 2

, where σ is the width of the Gaussian filter.

CPD Based LapSVM (CPD-LapSVM)

Allowing samples to be moved, our proposed CPD-LapSVM is designed to learn a spatial transformation and a classifier at the same time. Let x0i be the original position of a sample xi . Through the motion regulated by CPD, xi will be moved to a new location x1i : x1i = x0i + v(x0i ) = x0i + ΨG(x0i , X 0 )

(3)

The CPD transformation can be applied in both the original input space and the feature space after kernelization. The classifier to be learned is a LapSVM defined on the CPD-transformed samples: f (xi ) = wT x1i + b. Under the input space, our linear version CPD-LapSVM (note: “linear” refer to decision boundary; the transformation is nonlinear) is built on the LapSVM objective function in Eq. (1). Two modifications are made. First, quadratically

Semi-Supervised Learning for prediction of AD conversion from MCI

15

smoothed hinge loss functions, ξi = max[0, 1 − yi f (x1i )]2 , are included as slack variables to convert the constrained optimization in Eq. (1) to an unconstrained optimization problem. The choice of quadratic form is to make the computation of derivatives more convenient. Second, to reduce the chance of overfitting, we add the squared Frobenius norm Ψ2F , to penalize non-smoothness in the estimated transformations. With these two added terms, our linear CPD-LapSVM minimizes the following updated objective function to find the optimal transformation and classifier, specified by Ψ and {w, b}, respectively: 1 max[0, 1 − yi (wT x1i + b)]2 + C1 w2K + C2 Ψ2F l i=1 l

min J =

Ψ,w,b

+ C3

l+u 

Wjk (w

T

x1j

−w

T

(4)

x1k )2

j,k=1

where C1 , C2 and C3 are trade-off hyperparameters. In this paper, we choose full graphs as the neighborhood adjacency graphs, where every sample pair is assumed to be connected. The edge weight of each connection is calculated as Wjk = exp(− 2α1 2 (x1j − x1k 2 )), where α is the width of heat kernel function. To solve Eq. (4), an EM-like iterative strategy is applied to update Ψ and {w, b} alternatingly. First, when Ψ is fixed, Eq. (4) is reduced to the original LapSVM, performing on deformed training samples X 1 . A standard LapSVM solver, as in [2], can be used to search for the optimal classifier. Second, with {w, b} fixed, the classification decision boundary becomes explicit. The updated objective function now only depends on parameter Ψ (note: x1i is also parameterized with Ψ): 1 max[0, 1 − yi (wT x1i + b)]2 l i=1 l

min J = Ψ

+ C2 Ψ2F + C3

l+u 

(5)

Wjk (wT x1j − wT x1k )2

j,k=1

The objective Eq. (5) is differentiable w.r.t. Ψ. In this work, we used the “fmincon” function in Matlab, a gradient-based nonlinear programming solver, to search for the optimal Ψ. Our SSL framework is a general paradigm. While the above derivations are based on a particular classifier, LapSVM, integrating CPD with many other SSL solutions is often straightforward. We can commonly use CPD to parameterize data samples at new locations, and apply the same two-stage EM procedure to estimate an optimal pair of transformation and classifier jointly. For example, CPD can be integrated with another SSL solution, Transductive SVM (CPDTSVM), in a similar fashion as CPD-LapSVM. In addition, it should be noted that our model is different from kernel learning methods, where the kernel bases need to be pre-defined and only their weights are learned from the training

16

P. Zhang et al.

samples. Our CPD model realizes a fully deformable nonlinear feature transformation, and it is directly estimated from the training samples. Kernelization of CPD-LapSVM. The proposed CPD-LapSVM we introduced so far is developed and therefore applicable under the input space. It can be further kernelized to deal with more complicated data. In this paper, we adopt a kernel principal component analysis (KPCA) based framework in [18]. By first projecting the entire input samples into a kernel feature space introduced by KPCA, we can train CPD-LapSVM under the kernel space to learn both Ψ and {w, b}, in the same way as it is carried out under the original input space. No derivation of any new mathematical formula is needed. The detailed procedures can be found in [18].

3

Experimental Results

In this section, we evaluate the proposed CPD-LapSVM through two binary classification problems: AD vs. NC with MCI subjects as unlabeled samples, and progressive MCI (pMCI) vs. stable MCI (sMCI) with unknown MCI (uMCI) as unlabeled samples (the definition of “unknown MCI” will be given later). The data used in the experiment were obtained from ADNI. We focus on the features extracted from baseline T1 weighted MRIs. Overall, 185 patients with AD, 242 with MCI and 227 with NC (654 subjects in total) were used in our experiments. The features utilized in this study are 113 cortical and sub-cortical regional volumes under “Cross-Sectional Processing aseg” files, available under ADNI. The anatomical structures include left/right Hippocampi, left/right Caudates, etc. The whole list of the structure names can be found in one of our previous studies [12]. All features have been normalized by the corresponding whole brain volumes obtained from “Intracranial Volume Brain Mask” files. The performance of various classification solutions is compared based on four measures: classification accuracy (ACC), sensitivity (SEN), specificity (SPE), and area under the receiver operating characteristic curve (AUC). Three semi-supervised methods: Laplacian regularized least squares (LapRLS), Laplacian SVM (LapSVM) and Optimized LapSVM (OLapSVM) [6] are utilized in all experiments as the competing solutions. For each solution, both linear and RBF Gaussian kernel versions are evaluated. In the end, we also compare our method with five state-of-the-art pMCI/sMCI classification solutions [4,5,9,14,17], which also used baseline T1-weighted MRIs from the ADNI database. 3.1

AD vs. NC with MCI as Unknown

The first set of experiments classify AD and NC subjects, with MCI subjects as unlabeled samples. The AD and NC groups, which are used as labeled subjects, were randomly divided into four folds for cross validation (2 for training, 1 for validation and 1 for testing). All MCI subjects were shared as unlabeled data across folds.

Semi-Supervised Learning for prediction of AD conversion from MCI

17

Table 1. Performance comparison of CPD-LapSVM with other methods for AD vs. NC classifications. Boldface denotes the best ACC & AUC performance. AD vs. NC Methods

Linear kernel

RBF kernel

ACC (%)

SEN (%)

SPE (%)

AUC (%)

ACC (%)

SEN (%)

SPE (%)

AUC (%)

LapRLSC

83.65

83.11

84.02

89.56

84.91

79.05

89.72

90.34

LapSVM

84.11

83.21

84.78

89.98

84.77

79.80

88.76

90.40

OLapSVM

85.36

84.26

86.38

90.89

86.38

81.75

90.03

91.69

CPD-LapSVM

86.33

85.77

86.82

91.44

87.47

81.73

91.97

92.89

The involved hyper-parameters C1 , C2 and C3 are all chosen from {2−5 ∼ 2 } over the cross validations. C1 and C2 are the slackness tradeoff and graph regularization parameters used in all models. C3 is the regularization parameter, only used in our model. All the RBF kernel versions of the methods have an additional parameter to tune: the RBF Gaussian kernel width σ, which is also chosen from {2−5 ∼ 210 } in our experiments. Table 1 summarizes the AD vs. NC classification results for all methods, averaged from 50 random repeats. It is evident that our CPD-LapSVM achieves the highest ACC and AUC scores among the competing methods, for both the linear and kernel versions. It is also noteworthy that the linear version CPDLapSVM obtains comparable results with the kernel versions of other competing methods. The standard deviations of the measures were also computed; however, they are not included in the table mainly due the page limit. 10

3.2

PMCI vs. sMCI with uMCI as Unknown

The second set of experiments are designed to predict AD conversion from MCI patients through classification of pMCI vs sMCI, with uMCI as unlabeled samples. If the initial diagnosis was MCI at baseline, but the follow-up diagnosis are missing or not stable, the patient is categorized as “unknown MCI”. Overall, 110 patients with pMCI, 38 with sMCI and 227 with uMCI (242 MCI subjects in total) were used in our experiments. The same experimental setting and hyperparameter selection approach as in the previous AD/NC classifications are adopted here. uMCI subjects were shared as unlabeled data during 4-fold cross validation. The results are reported in Table 2. Similar results as that in AD vs. NC classification can be observed. The ACC score of our linear-version CPD-LapSVM is significantly higher than all other methods. Comparing with LapSVM, which is the host solution of CPDLapSVM, the ACC improvement on LapSVM is from 67.09% to 69.12%. For RBF Gaussian kernel versions, the highest ACC and AUC scores were both achieved by our model. In order to investigate the effect of the number of labeled data on our method, a test was performed by decreasing the number of revealed labels. The ratio of revealed labels are decreased as: {100%, 80%, 60%, 40% and 20%}. The ACC scores with different labeled sample ratios are shown in Fig. 1. The Solid lines

18

P. Zhang et al.

Table 2. Performance comparison of CPD-LapSVM with other methods for pMCI vs. sMCI classifications. Boldface denotes the best ACC & AUC performance. pMCI vs. sMCI Methods

Linear kernel

RBF kernel

ACC (%)

SEN (%)

SPE (%)

AUC (%)

ACC (%)

SEN (%)

SPE (%)

AUC (%)

LapRLSC

67.09

62.66

80.42

79.11

76.09

92.82

29.61

76.49

LapSVM

66.37

61.86

79.61

79.03

76.14

88.70

40.17

76.46

OLapSVM

67.40

63.69

78.26

79.60

76.54

83.28

55.34

76.79

CPD-LapSVM

69.12

67.92

72.03

78.07

78.27

86.38

52.32

78.58

Fig. 1. ACC score w.r.t different ratio of revealed labels in pMCI vs. sMCI classifications

and prefix “r” denote the results from kernelized classifiers, while dashed lines and “l” are for linear classifiers. It is clear that the ACC values of our CPDLapSVM with both linear and RBF Gaussian kernel are always performing the best. Finally, we summarize several recent works in pMCI vs. sMCI classification as a comparison in Table 3. To best of our knowledge, the best result so far was achieved by [14]. Among the methods using baseline MRIs only, our work achieved the best performance in ACC score. However, it should be noted that direct comparisons of the published neuroimaging algorithms are often not feasible. When different datasets and experimental setups are utilized, higher accuracy or better results over a competing solution ought to be interpreted as more of a side evidence of the model efficacy, rather than the proof of superiority for head-to-head competitions. As for the experiment and data setup, there could be different approaches to include the unlabeled samples. For example, MCI can be used as unknown for AD/NC [5,19], AD/NC as unknown for pMCI/sMCI [17], and uMCI as unknown for pMCI/sMCI [5,9]. We chose the last scheme, but other different settings can be certainly carried out and tested.

Semi-Supervised Learning for prediction of AD conversion from MCI

19

Table 3. Comparisons of pMCI vs. sMCI classification solutions using ADNI database. Boldface denotes the best performance for the measure of ACC & AUC. pMCI vs. sMCI

4

Methods

ACC (%) SEN (%) SPE (%) AUC (%)

Ye et al. [17]

56.10

94.10

40.80

73.00

Filipovych et al. [5] –

79.40

51.7

69.00

Moradi et al. [9]

74.74

88.85

51.46

76.61

Cheng et al. [4]

73.80

69.00

77.40

79.60

Suk et al. [14]

74.82

70.93

78.82

75.89

CPD-LapSVM

78.27

86.38

52.32

78.58

Conclusions

In this paper, we have proposed a nonlinear feature transformation based semisupervised learning strategy to enhance the prediction of MCI-AD conversion through MR images. The proposed CPD-LapSVM model takes advantage of the space deformations regulated by CPD to push the data samples towards better linear separability, which leads to improved LapSVM classification performance. Exploring more transformation models is the directions of our future efforts. We are also interested in applying the proposed strategy to other neuroimage analysis problems.

References 1. An, L., Adeli, E., Liu, M., Zhang, J., Shen, D.: Semi-supervised hierarchical multimodal feature and sample selection for alzheimer’s disease diagnosis. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 79–87. Springer, Cham (2016). doi:10.1007/978-3-319-46723-8 10 2. Belkin, M., Niyogi, P.: Semi-supervised learning on riemannian manifolds. Mach. Learn. 56(1–3), 209–239 (2004) 3. Bellet, A., Habrard, A., Sebban, M.: A survey on metric learning for feature vectors and structured data. arXiv preprint (2013). arXiv:1306.6709 4. Cheng, B., Liu, M., Shen, D., Li, Z., Zhang, D.: Multi-domain transfer learning for early diagnosis of alzheimer’s disease. Neuroinformatics 15(2), 115–132 (2017). https://doi.org/10.1007/s12021-016-9318-5 5. Filipovych, R., Davatzikos, C., Initiative, A.D.N., et al.: Semi-supervised pattern classification of medical images: application to mild cognitive impairment (MCI). Neuroimage 55(3), 1109–1119 (2011) 6. Gu, Y., Feng, K.: Optimized laplacian svm with distance metric learning for hyperspectral image classification. AEORS 6(3), 1109–1117 (2013) 7. Huang, M., Yang, W., Feng, Q., Chen, W.: Longitudinal measurement and hierarchical classification framework for the prediction of alzheimer’s disease. Nature 7, 39880 (2017)

20

P. Zhang et al.

8. Liu, M., Zhang, J., Yap, P.T., Shen, D.: View-aligned hypergraph learning for alzheimer’s disease diagnosis with incomplete multi-modality data. Med. Image Anal. 36, 123–134 (2017) 9. Moradi, E., Pepe, A., Gaser, C., Huttunen, H., Tohka, J., Initiative, A.D.N., et al.: Machine learning framework for early mri-based alzheimer’s conversion prediction in MCI subjects. Neuroimage 104, 398–412 (2015) 10. Myronenko, A., Song, X.: Point set registration: coherent point drift. TPAMI 32(12), 2262–2275 (2010) 11. Shi, B., Chen, Y., Hobbs, K., Liu, J., Smith, C.D.: Nonlinear metric learning for alzheimer’s disease with integration of longitudinal neuroimaging features. In: BMVC (2015) 12. Shi, B., Chen, Y., Zhang, P., Smith, C.D., Liu, J.: Nonlinear feature transformation and deep fusion for alzheimer’s disease staging analysis. Pattern Recognit. 63, 487– 498 (2017) 13. Shi, B., Wang, Z., Liu, J.: Distance-informed metric learning for alzheimer’s disease staging. In: EMBC, pp. 934–937. IEEE (2014) 14. Suk, H.I., Lee, S.W., Shen, D.: Deep ensemble learning of sparse regression models for brain disease diagnosis. Med. Image Anal. 37, 101–113 (2017) 15. Tong, T., Gray, K., Gao, Q., Chen, L., Rueckert, D.: Multi-modal classification of alzheimer’s disease using nonlinear graph fusion. Pattern Recognit. 63, 171–181 (2017) 16. Xing, E.P., Ng, A.Y., Jordan, M.I., Russell, S.: Distance metric learning with application to clustering with side-information. In: NIPS, vol. 15, p. 12 (2002) 17. Ye, D.H., Pohl, K.M., Davatzikos, C.: Semi-supervised pattern classification: application to structural MRI of alzheimer’s disease. In: PRNI, pp. 1–4. IEEE (2011) 18. Zhang, C., Nie, F., Xiang, S.: A general kernelization framework for learning algorithms based on kernel PCA. Neurocomputing 73(4), 959–967 (2010) 19. Zhang, D., Shen, D.: Semi-supervised multimodal classification of alzheimer’s disease. In: ISBI, pp. 1628–1631. IEEE (2011) 20. Zhang, P., Shi, B., Smith, C.D., Liu, J.: Nonlinear metric learning for semisupervised learning via coherent point drifting. In: ICMLA, pp. 314–319. IEEE (2016) 21. Zhu, Y., Zhu, X., Kim, M., Shen, D., Wu, G.: Early diagnosis of alzheimer’s disease by joint feature selection and classification on temporally structured support vector machine. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9900, pp. 264–272. Springer, Cham (2016). doi:10.1007/ 978-3-319-46720-7 31

Kernel Generalized-Gaussian Mixture Model for Robust Abnormality Detection Nitin Kumar, Ajit V. Rajwade, Sharat Chandran, and Suyash P. Awate(B) Computer Science and Engineering Department, Indian Institute of Technology (IIT) Bombay, Mumbai, India [email protected] Abstract. Typical methods for abnormality detection in medical images rely on principal component analysis (PCA), kernel PCA (KPCA), or their robust invariants. However, typical robust-KPCA methods use heuristics for model fitting and perform outlier detection ignoring the variances of the data within principal subspaces. In this paper, we propose a novel method for robust statistical learning by extending the multivariate generalized-Gaussian distribution to a reproducing kernel Hilbert space and employing it within a mixture model. We propose expectation maximization to fit our kernel generalized-Gaussian mixture model (KGGMM), using solely the Gram matrix and without the explicit lifting map. We exploit the KGGMM, including component means, principal directions, and variances, for abnormality detection in images. The results on 4 large publicly available datasets, involving retinopathy and cancer, show that our method outperforms the state of the art. Keywords: Abnormality detection · One-class classification methods · Robustness · Generalized gaussian · Mixture model tation maximization

1

· Kernel · Expec-

Introduction and Related Work

Abnormality detection in medical images [3,10] is a one-class classification problem [13], where training relies solely on data from the normal class. This is motivated by the difficulty of learning a model of abnormal image appearances because of their tremendous variability. Typical methods for abnormality detection rely on principal component analysis (PCA) or kernel PCA (KPCA) [4]. In clinical applications involving large training datasets intended to represent normal images, outliers naturally arise because of errors in specimen preparation (e.g., slicing or staining in microscopy), patient issues (e.g., motion), imaging artifacts, and manual mislabeling of abnormal images as normal. KPCA is very sensitive to outliers in the data, leading to unreliable inference. Some methods for abnormality detection [11] rely on PCA, assuming training sets to be outlier free. Typical robust KPCA (RKPCA) methods [3,5,7–9] are heuristic in their The authors are grateful for funding from Aditya Imaging Information Technologies. S.P. Awate thanks funding from IIT Bombay Seed Grant 14IRCCSG010. c Springer International Publishing AG 2017  M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 21–29, 2017. DOI: 10.1007/978-3-319-66179-7 3

22

N. Kumar et al.

modeling and inference. For instance, [7–9] employ adhoc rules for explicitly detecting outliers in the training set. While [2,5] describe RKPCA based on iterative data-weighting, using distance to the mean, the weighting functions seem adhoc. CHLOE [9] also uses rules involving free parameters to weight data based on kurtosis of individual features. One method [2] distorts data by projecting it onto a sphere (unit norm). In contrast, we propose a method using statistical (mixture) modeling to infer robust estimates of means and covariances. During estimation, our method implicitly, and optimally, reweights the data, to reduce the effect of outliers, based on the covariance structure of the data. Typical abnormality detection methods [3,5,7,8] compute robust means and modes of variation, but fail to compute and exploit variances along the modes. Thus, they perform poorly when the abnormal data lies within the subspace spanned by the normal data. In contrast, our method optimizes, in addition to means and modes, the associated variances to improve performance. Some methods [3,6] for robust PCA model learning rely on Lp norms (p ≥ 1) in input space. In contrast, our method exploits Lq quasi-norms (q > 0) coupled with Mahalanobis distances in a reproducing kernel Hilbert space (RKHS). Some kernel methods for abnormality detection rely on the support vector machine (SVM), e.g., one-class SVM [13] and support vector data description (SVDD) [15]. Unlike KPCA, these SVM methods model only a spherical distribution or decision boundary in RKHS and, thus, are inferior to KPCA theoretically and empirically [4]. Also, the SVM methods lack robustness to outliers in the training data. In contrast, our method is robust to outliers and enables us to model arbitrarily curved distributions as well as decision boundaries in RKHS. We propose a novel method for robust statistical learning by extending the multivariate generalized-Gaussian distribution to a RKHS for mixture modeling. We propose expectation maximization (EM) to fit our kernel generalizedGaussian mixture model (KGGMM), using solely the Gram matrix, without the explicit lifting map. We model geometric and photometric properties of image texture via standard texton-label histograms [16]. We exploit the KGGMM, including component means, principal directions, and variances, for abnormality detection. The results on 4 large publicly available datasets, involving retinopathy and cancer, show that our method outperforms the state of the art.

2

Methods

In RD , the generalized Gaussian [12] is parametrized by the mean μ ∈ RD , covariance matrix C ∈ RD×D , and shape ρ ∈ R>0 ; Gaussian (ρ = 2), Laplacian (ρ = 1), uniform (ρ → ∞). We extend the generalized Gaussian to RKHS for mixture modeling. We exploit ρ < 1, when the distribution has increased concentration near the mean and heavier tails, for robust fitting amidst outliers. 2.1

Kernel Generalized Gaussian (KGG)

Consider a set of N data points {xn ∈ RD }N n=1 in input space. Consider a Mercer kernel κ(·, ·) that implicitly maps the data to a RKHS H such that each datum

Kernel Generalized-Gaussian Mixture Model, Robust Abnormality Detection

23

I xn gets mapped to φ(xn ). Consider 2 vectors in RKHS: f := i=1 αi φ(xi ) and    J I J  f  := j=1 βj φ(xj ).The inner product f, f H := i=1 j=1 αi βj κ(xi , xj ).   The norm f H := f, f H . When f, f ∈ H\{0}, let f ⊗ f be the rank-one operator defined as f ⊗ f  (g) := f  , gH f . The generalized Gaussian extended to RKHS  is parametrized by shape ρ ∈ R>0 , mean μ ∈ H, and covariance operQ ator C = q=1 λq vq ⊗ vq , where λq is the q-th largest eigenvalue of covariance C, vq is the corresponding eigenfunction, and Q < N is a regularization parameter. We set Q to the number of principal eigenfunctions that capture 95% of the eigenspectrum energy. For f ∈ H, the squaredMahalanobis distance is Q d2M (f ; μ, C) := f − μ, C −1 (f − μ)H , where C −1 = q=1 (1/λq )vq ⊗ vq is the sample inverse-covariance operator. Then, our generalized Gaussian in RKHS is   ρ/2  δ(ρ/2) 2 exp − η(ρ/2)d (f ; μ, C) , where (1) PG (f ; μ, C, ρ) := M 2|C|0.5 δ(r) := rΓ (2/r)/(πΓ (1/r)2 ), |C| := 2.2

Q

q=1

λq , and η(r) := Γ (2/r)/(2Γ (1/r)).

Kernel Generalized-Gaussian Mixture Model (KGGMM)

We propose to model the distribution of data x := {xn ∈ RD }N n=1 using a Mercer kernel to implicitly map the data to a RKHS, i.e., {φ(xn ) ∈ H}N n=1 , and then representing the distribution in RKHS using a mixture of KGG distributions. Consider a KGG mixture model with K components, where the k-th component is the KGG PG (·; μk , Ck , ρ) coupled with weight ωk ∈ R≥0 , such that ωk ≤ 1 and K k=1 ωk := 1. For each datum xn , let Zn be the hidden (label) random variable indicating the mixture component from which the datum was drawn. . Thus, we repEach mean μk must lie in the span of the mapped data {φ(xi )}N i=1 N resent each mean, using coefficient vector βk ∈ RN , as μk (βk ) := i=1 βki φ(xi ). Estimating μk is then equivalent to estimating βk . We represent each covariance operator Ck using its Q principal eigenvectors {vkq ∈ H}Q q=1 and eigenvalues Q {λkq ∈ R>0 }q=1 . Each eigenvector of Ck must lie in the span of the mapped data. So, we represent the q-th eigenvector of Ck , using coefficient vector αkq ∈ RN , as N vkq (αkq ) := j=1 αkqj φ(xj ). Estimating vkq is equivalent to estimating αkq . Model Fitting. We propose EM to fit the KGGMM to the mapped data to maximize the likelihood function. The prior label probability P (zn = k) := ωk . N The complete-data likelihood P (z, x) := n=1 P (zn )PG (φ(xn ); μzn , Czn , ρ). We show that EM does not need the map φ(·), but only the Gram matrix G, where Gij := φ(xi ), φ(xj )H = κ(xi , xj ). In our framework, ρ is a free parameter (fixed before EM) that we tune using training data; ρ < 1 gives best results. Initialization. We use kernel k-means to initialize the parameters. We initialize (i) mean μk to the k-th cluster center, (ii) weight ωk to the fraction of data assigned to cluster k, and (iii) covariance Ck using KPCA on cluster k.

24

N. Kumar et al.

t E Step. At the t-th iteration, let the set of parameters be θt := {βkt ∈ RN , {αkq ∈ Q N Q t t K R }q=1 , {λkq ∈ R}q=1 , ωk ∈ R}k=1 . Let αk denote a N × Q matrix, representing the Q eigenfunctions of Ck , such that its q-th column is αkq . Let λk denote a Q × Q diagonal matrix, representing the Q eigenvalues of Ck , such that its q-th diagonal element is λkq . Given θt , the E step defines the function Q(θ; θt ) := EP (Z|x,θt ) [log P (Z, x; θ)] that can be simplified to K N   n=1 k=1

 t γnk

     ρ  Q  log(λkq ) ρ φ(xn ) − μk (βk ), vkq (αkq )2H 2 + η log ωk − 2 2 λkq q=1

excluding terms independent of θ, and where the membership of datum xn to mixture component k, given the current parameter estimate θt , is the posterior t := P (Zn = k|xn , θt ) = ωkt PG (φ(xn ); μk , Ck , ρ)/P (xn ; θt ) by Bayes rule. γnk t M Step. The M step updates parameter estimates to θt+1 := arg maxθ Q(θ; θ ) subject to constraints on: (i) weights, such that ωk ≥ 0, k ωk = 1, (ii) eigenvalues, such that λkq > 0, and (iii) coefficients, such that eigenvectors vkq (αkq ) are unit norm (vkq H = 1) and mutually orthogonal (vkq , vkr H = 0, ∀q = r).

Estimating Weights. The optimal weights ωkt+1 are given by the solution to N K t log ωk , subject to the positivity and sum-to-unity conarg maxω n=1 k=1 γnk N t /N . straints. The method of Lagrange multipliers gives ωkt+1 = n=1 γnk t+1 Estimating Means. Given weights ωkt+1 , the optimal mean μt+1 k (βk ) is given

  ρ/2 t   2 by βkt+1 := arg minβk n γnk , where Gn is the q (Gn αkq − βk Gαkq ) /λkq n-th column of the Gram matrix G. We optimize via gradient descent with adaptive step size (adjusted at each update) to ensure that each update improves the objective function value. When ρ = 2, the mean estimate is the (weighted) sample mean that is affected by outliers. As ρ reduces, the effect of the outliers decreases in the objective function; the gradient term for an outlier j is weighted down far more than for the inliers, leading to robust estimates. t+1 Estimating Eigenvectors. Given weights ωkt+1 and means μt+1 k (βk ), the t+1 t+1 t+1 := arg minαk optimal set of eigenfunctions vk (αk ) is given by αk  t  t+1  −1 t+1 ρ/2    , subject to orthonorn γnk (αk Gn − αk Gβk ) λk (αk Gn − αk Gβk ) mality constraints on the set of eigenfunctions {vkq (αkq )}Q q=1 . We optimize via projected gradient descent with adaptive step size, where each step (i) first uses

k , implicitly updating the eigena gradient-descent step to update matrix αk to α αkq )}Q

k to αkt+1 by projecting the functions to { vkq ( q=1 , and (ii) then updates α Q αkq )}q=1 onto the space of orthogonal eigenfunction bases. eigenfunction set { vkq ( In Euclidean space, the projection of a set of Q vectors, represented as the columns of a matrix M , onto the space of Q orthogonal vectors is given by LR where matrices L and R comprise the left and right singular vectors in the singular value decomposition (SVD) of M . In Euclidean space, LR = M (M  M )−0.5 . In a RKHS, we replace the SVD by the kernel SVD as follows.

Kernel Generalized-Gaussian Mixture Model, Robust Abnormality Detection

25

Consider Q functions F := {fq ∈ H}Q that are not orthogonal. Let the Q q=1 kernel SVD of F be the operator q=1 sq aq ⊗ bq , where the singular values are sq ∈ R≥0 , and the left and right singular vectors are the orthonormal sets Q Q {aq ∈ H}Q q=1 and {bq ∈ R }q=1 , respectively. Consider the Q×Q matrix Y where Q Q Yij := fi , fj H . The matrix Y also equals q =1 sq bq ⊗aq ( q =1 sq aq ⊗bq ) Q that reduces to q=1 s2q bq b q because of the orthogonality of the left singular vectors. Thus, an eigen decomposition of the matrix Y yields the eigenvalues as s2q and the eigenvectors as bq . Subsequently, we observe that the required projection of F onto the  space of orthogonal in RKHS Q functions Q is given by  Q Q −1 −0.5  b ) = s a ⊗ b (Y ) = s a ⊗ b ( s b   q q q q q q q q q=1 q=1 q =1 q q=1 aq ⊗ bq . In practice, when we represent the eigenvectors using the N × Q matrix α

k , the αk and the projection gives us αkt+1 = α

k (Y )−0.5 . matrix Y = α

k G t+1 Estimating Variances. Given weights ωkt+1 , means μt+1 k (βk ), and t+1 t+1 := eigenfunctions vk (αk ), each optimal eigenvalue is given by λt+1 kq N t 2 ρ/2 arg minλkq >0 n=1 γnk [0.5 log(λkq ) + (η(ρ/2)ankq /λkq ) ], where ankq := t+1 t+1  t+1 G n αkq − (βk ) Gαkq . We optimize via projected gradient descent.

KGGMM for Abnormality Detection. We use the KGGMM with a small number of mixture components K, such that each component k models a significant fraction of the data, i.e., ωk are not close to zero and comparable for different components k. After KGGMM fitting, we define a decision boundary B enclosing the normal class by a threshold τ on the minimum Mahalanobis distance across all K mixture components, such that, for a chosen component k, 98.5% of the probability mass lies within B. τ varies with ρ; for the univariate Gaussian (ρ = 2) and variance σ 2 , τ limits the distance to 2.5σ from the component-k mean. For the univariate generalized Gaussian, τ can be computed via the inverse cumulative distribution function that is known analytically. Because τ relies on Mahalanobis distance that is independent of scale, τ naturally extends to the multivariate case. Thus, B is set automatically via ρ and θ.

3

Results and Discussion

We evaluate our method for abnormality detection on simulated data and 4 large publicly available medical image datasets. Indeed, the training, i.e., model learning, for abnormality detection methods relies solely on data from the normal class, which includes outliers and mislabeled data incorrectly labeled to the normal class. We compare our KGGMM method with 7 other methods: (i) KGG, which is a special case of KGGMM with K = 1, (ii) standard KPCA [14], which is a special case of KGG when ρ = 2, (iii) Huang et al.’s RKPCA [5], (iv) oneclass regularized kernel SVM [13], (v) regularized kernel SVDD [15], (vi) 2-class regularized kernel SVM, and (vii) CHLOE: a software tuned for outlier detection in images [9]. We use cross validation to tune free parameters underlying all methods, i.e., concerning the kernel, ρ (for KGGMM), and SVM regularization.

26

N. Kumar et al.

Results on Simulated Data. We simulate data in 2D Euclidean space to mimic what a real-world dataset would lead to in RKHS (after kernel-based mapping). We simulate data (Fig. 1) from a Gaussian mixture having K = 2 components (normal class): mean (0, 5) and (5, 0), modes of variation as the cardinal axes, and standard deviations along the modes of variation as (0.25, 1.4) and (1.4, 0.25). We then contaminate the data with outliers of 2 kinds: (i) spread uniformly over the domain; (ii) clustered at a location far away. For training, the normal-class sample size is 5000 contaminated with 1000 outliers. For testing, the normal-class sample size is 5000 and abnormal-class sample size is 3000. The kernel is the Euclidean inner-product. Our KGGMM learning (with K = 2, ρ = 0.6) is far more robust to outliers, with a classification accuracy of 93%, outperforming (i) KGG (K = 1, ρ = 0.6; accuracy 77%), (ii) KPCA (accuracy 54%), (iii) SVDD (accuracy 70%), and (iv) 2-class SVM (accuracy 38%). Results on Real-World Medical Image Data. We use 4 large publicly available image datasets. We use 2 retinopathy datasets: Messidor (www.adcis.net/en/Download-Third-Party/Messidor.html; Fig. 2) and Kaggle (www.kaggle.com/c/diabetic-retinopathy-detection; Fig. 3). We use 2 endoscopy datasets for cancer detection: chromoendoscopy in Gastric cancer (aidasub-chromogastro.grand-challenge.org; Fig. 5) and confocal laser endomicroscopy in Barrett’s esophagus (aidasub-clebarrett.grand-challenge.org; Fig. 6) comprising normal images including intestinal metaplasia and 2 kinds of abnormal images including dysplasia (potentially leading to cancer) and neoplastic mucosa (advanced stage cancer). The figures show that all 4 datasets, even though carefully constructed, already have outliers in the normal class. We use

Fig. 1. Results on simulated data. Data from a 2D Gaussian mixture model (2 components, shown in blue and black) contaminated with outliers (red and green).

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 2. Retinopathy data: Messidor. (a)–(b) Normal images. (c)–(d) Images labeled normal, but are outliers. (e)–(f ) Abnormal images.

Kernel Generalized-Gaussian Mixture Model, Robust Abnormality Detection

(a)

(b)

(c)

(d)

(e)

27

(f)

Fig. 3. Retinopathy data: Kaggle. (a)–(b) Normal images. (c)–(d) Images labeled normal, but are outliers. (e)–(f ) Abnormal images.

Accuracy

0.98 0.96

1

0.9

Accuracy

Accuracy

recognition accuracy vs ρ 1

1 0.9 0.8 0.7 0.6

0.94

0.8

0.7

0.6 0.5

0.92 0.5 0.4

0.9 2

1

ρ

0.8

KGGMM KGG KPCA Huang SVM BIN

(a) KGGMM only.

SVM SVDD CHLOE ONE

(b) Messidor Data.

KGGMM KGG KPCA Huang SVM BIN

SVM SVDD CHLOE ONE

(c) Kaggle Data.

Fig. 4. Results on retinopathy data. Classification accuracy after learning on training sets contaminated with outliers, for: (a) KGGMM, Messidor, varying ρ, (ρ = 2 is Gaussian); (b) all methods, Messidor; (c) all methods, Kaggle. The box plots show variability in accuracy with resampling (uniform random) training data (20 repeats).

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 5. Chromoendoscopy data: gastric cancer. (a)–(b) Normal images. (c)– (d) Images labeled normal, but are outliers. (e)–(f ) Abnormal images.

the texton-based histogram feature, using patches (9 × 9) to compute textons, to classify regions (50 × 50) as normal or abnormal. We use the intersection kernel [1]. From each dataset, we select training sets with 12000 normal image regions and, to mimic a clinical scenario, contaminate it by adding another 5–10% of abnormal image regions mislabeled as normal. The test set has 8000 normal and 5000 abnormal images. KGGMM performs best when ρ < 1 in retinopathy (Fig. 4(a)) and endoscopy datasets (Fig. 7(a)). The abnormalitydetection accuracy of KGGMM is significantly more than all other methods for retinopathy (Fig. 4(b)–(c)) and endoscopy data (Fig. 7(b)–(c)). In almost all cases, KGGMM (we use K = 2 for model simplicity) performs better than KGG. Conclusion. We have proposed a novel method for robust kernel-based statistical learning that relies on the generalization of the multivariate generalized Gaussian

28

N. Kumar et al.

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 6. Confocal endoscopy data: barrett’s esophageal cancer. (a)–(b) Normal images. (c)–(d) Images labeled normal, but are outliers. (e)–(f ) Abnormal images. 1

1.1

recognition accuracy vs ρ

0.6 0.4

Accuracy

0.8

0.8

0.9

Accuracy

Accuracy

1

0.9

1

0.8 0.7 0.6

1

ρ

0.8

(a) KGGMM only.

0.6 0.5

0.5

0.4

0.4

0.3

0.3

2

0.7

0.2

KGGMM KGG KPCA Huang SVM BIN

SVM SVDD CHLOE ONE

(b) Gastric Cancer.

KGGMM KGG KPCA Huang SVM BIN

SVM SVDD CHLOE ONE

(c) Esophageal Cancer.

Fig. 7. Results on endoscopy data. Classification accuracy after learning on training sets contaminated with outliers, for: (a) KGGMM, gastric cancer, varying ρ, (ρ = 2 is Gaussian); (b) all methods, gastric cancer; (c) all methods, esophageal cancer. Box plots show accuracies with resampling (uniform random) training data (20 repeats).

to RKHS for mixture modeling. We fit our KGGMM using EM, using solely the Gram matrix. We exploit KGGMM, including covariance operators, for abnormality detection in medical applications where a (small) fraction of training data is inevitably contaminated because of outliers and mislabeling. The results on 4 large datasets, in retinopathy and cancer, shows that KGGMM outperforms one-class classification methods (KPCA, one-class kernel SVM, kernel SVDD), 2-class kernel SVM, and software tuned for outlier detection in images [9].

References 1. Barla, A., Odone, F., Verri, A.: Histogram intersection kernel for image classification. In: IEEE International Conference on Image Processing, vol. 3, pp. 513–516 (2003) 2. Debruyne, M., Verdonck, T.: Robust kernel principal component analysis and classification. Adv. Data Anal. Classif. 4(2), 15167 (2010) 3. Fritsch, V., Varoquaux, G., Thyreau, B., Poline, J.B., Thirion, B.: Detecting outliers in high-dimensional neuroimaging datasets with robust covariance estimators. Med. Imag. Anal. 16(7), 1359–1370 (2012) 4. Hoffmann, H.: Kernel PCA for novelty detection. Pattern Recog. 40(3), 863 (2007) 5. Huang, H., Yeh, Y.: An iterative algorithm for robust kernel principal component analysis. Neurocomputing 74(18), 3921–3930 (2011) 6. Kwak, N.: Principal component analysis by Lp-norm maximization. IEEE Trans. Cybern. 44(5), 594–609 (2014)

Kernel Generalized-Gaussian Mixture Model, Robust Abnormality Detection

29

7. Li, Y.: On incremental and robust subspace learning. Pattern Recog. 37(7), 1509– 1518 (2004) 8. Lu, C., Zhang, T., Zhang, R., Zhang, C.: Adaptive robust kernel PCA algorithm. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 6, pp. 621–624 (2003) 9. Manning, S., Shamir, L.: CHLOE: a software tool for automatic novelty detection in microscopy image datasets. J. Open Res. Soft. 2(1), 1–10 (2014) 10. Mourao-Miranda, J., Hardoon, D., Hahn, T., Williams, S., Shawe-Taylor, J., Brammer, M.: Patient classification as an outlier detection problem: an application of the one-class support vector machine. Neuroimage 58(3), 793–804 (2011) 11. Norousi, R., Wickles, S., Leidig, C., Becker, T., Schmid, V., Beckmann, R., Tresch, A.: Automatic post-picking using MAPPOS improves particle image detection from cryo-EM micrographs. J. Struct. Biol. 182(2), 59–66 (2013) 12. Novey, M., Adali, T., Roy, A.: A complex generalized Gaussian distributioncharacterization, generation, and estimation. IEEE Trans. Sig. Proc. 58(3), 1427– 1433 (2010) 13. Scholkopf, B., Platt, J., Shawe-Taylor, J., Smola, A., Williamson, R.: Estimating the support of a high-dimensional distribution. Neural Comp. 13(7), 1443 (2001) 14. Scholkopf, B., Smola, A., Muller, K.: Nonlinear component analysis as a kernel eigenvalue problem. Neural Comp. 10(5), 1299–1319 (1998) 15. Tax, D., Duin, R.: Support vector data description. Mach. Learn. 54(1), 45–66 (2004) 16. Varma, M., Zisserman, A.: A statistical approach to material classification using image patch exemplars. IEEE Trans. Pattern Anal. Mach. Intell. 31(11), 2032–2047 (2009)

Latent Processes Governing Neuroanatomical Change in Aging and Dementia Christian Wachinger1(B) , Anna Rieckmann2 , and Martin Reuter3,4 1

Artificial Intelligence in Medical Imaging (AI-Med), KJP, LMU M¨ unchen, Munich, Germany [email protected] 2 Department of Radiation Sciences, Ume˚ a Univeristy, Ume˚ a, Sweden 3 DZNE, Bonn, Germany 4 Department of Radiology, Harvard Medical School, Boston, USA Abstract. Clinically normal aging and pathological processes cause structural changes in the brain. These changes likely occur in overlapping regions that accommodate neural systems with high susceptibility to deleterious factors. Due to the overlap, the separation between aging and pathological processes is challenging when analyzing brain structures independently. We propose to identify multivariate latent processes that govern cross-sectional and longitudinal neuroanatomical changes across the brain in aging and dementia. A discriminative representation of neuroanatomy is obtained from spectral shape descriptors in the BrainPrint. We identify latent factors by maximizing the covariance between morphological change and response variables of age and a proxy for dementia. Our results reveal cross-sectional and longitudinal patterns of change in neuroanatomy that distinguishes aging processes from disease processes. Finally, latent processes do not only yield a parsimonious model but also a significantly improved prediction accuracy.

1

Introduction

We view aging as the passage of time that is characterized by a multifaceted set of neurobiological cascades that occur at different rates in different people, together with complex and often interdependent effects on cognitive decline [1]. The distinction between disease-related processes and “normal” aging is important for etiology and diagnostics, however, the boundaries of aging and neurodegenerative diseases remain difficult to separate [7]. Some of the aging-related neurobiological changes may be the result of developing pathology, such as preclinical Alzheimer’s disease (AD) or incipient cerebrovascular disease. Other changes may have similarities to certain diseases, while arising from a different etiology than the pathological pathway linked to disease, such as dopamine loss in Parkinson’s disease. In this article, we investigate if it is possible to use magnetic resonance imaging to differentiate disease-related changes in brain morphology from those associated with normal aging in the same set of individuals. Important for the distinction between what is normal and what may be an indicator of disease is to consider that brain structural changes are not uniform c Springer International Publishing AG 2017  M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 30–37, 2017. DOI: 10.1007/978-3-319-66179-7 4

Latent Processes in Aging and Dementia

31

across the brain. Regions that accommodate neural systems with high susceptibility to deleterious factors are likely affected by disease as well as aging-related changes. At the same time, there are other structures, which are known to show effects of aging but are relatively spared in a neurodegenerative disease, e.g., the striatum in Alzheimer’s disease [7]. Capitalizing on the regional heterogeneity in aging and disease-related effects, joint modeling of changes across multiple structures rather than focusing on single structures in isolation is a promising avenue to identifying general patterns that are best at distinguishing aging from disease. We assume that there are a number of underlying processes that cause these changes, which we model as latent factors. It is further known that aging and disease can cause different changes in subregions of anatomical structures. For instance, recent analysis on high-field MRI suggests that hippocampal subfields subiculum and CA1 are associated with AD and CA3/DG with aging [7]. While volume measurements of the entire hippocampus do not permit to distinguish between such variations in subregions, they cause shape changes in the structure that can potentially be identified by shape descriptors. To obtain a discriminative characterization of neuroanatomy, we work with BrainPrint [11]; a composition of spectral shape descriptors from the LaplaceBeltrami operator. The change in shape is studied in cross-sectional and longitudinal designs. We hypothesize that observed, high-dimensional shape changes are governed by a few underlying, latent processes. We identify neuroanatomical processes that are best associated to aging and disease by maximizing the covariance between morphology and response variables, yielding the projection of the data to latent structures. An alternative to the latent factor model would be to directly estimate changes in aging from clinically normal subjects, however, it is presumed that preclinical forms of disease are likely to be present in normals so that a pure aging process cannot be measured [7]. In this work, we focus on Alzheimer’s disease but the developed technology is of general nature. Related Work: A model for healthy aging based on image voxels and relevance vector regression was used for the prediction of Alzheimer’s disease in [6]. The longitudinal progression of AD-like patterns in brain atrophy in normal aging subjects and, furthermore, an accelerated AD-like atrophy in subjects with mild cognitive impairment (MCI) was reported in [2]. A framework for the spatiotemporal statistical analysis of longitudinal shape data based on diffeomorphic deformation fields was presented in [4].

2

Method

We are given structural magnetic resonance (MR) images from N subjects, I1 , . . . , IN , with corresponding response variables, R1 , . . . , RN , including age and scores from neuropsychological tests. MR scans for each subject n are available for m time points, In1 , . . . , Inm . Our objective is to find patterns of neuroanatomical change associated to aging and dementia.

32

2.1

C. Wachinger et al.

Longitudinal Change in Brain Morphology

We use BrainPrint [11] as representation of brain morphology based on the automated segmentation of anatomical brain structures with FreeSurfer [5]. BrainPrint uses the spectral shape descriptor shapeDNA [10] to capture shape information from cortical and subcortical structures. ShapeDNA is computed from the intrinsic geometry of an object by calculating the Laplace-Beltrami spectrum Δf = −λf. (1) The solution consists of eigenvalue λi ∈ R and eigenfunction fi pairs, sorted by eigenvalues, 0 ≤ λ1 ≤ λ2 ≤ . . . The first l non-zero eigenvalues, computed with the finite element method, form the ShapeDNA: λ = (λ1 , . . . , λl ). We normalize the eigenvalues to make the representation independent of the objects’ 2 scale and therefore focus on the shape information, λ = vol d λ, where vol is the Riemannian volume of the d-dimensional manifold (i.e., the area for 2D surˆ i = λ /i, to balance faces) [10]. We further linearly re-weight the eigenvalues, λ i the impact of higher eigenvalues that show higher variance [11]. The morphology of each scan I is described by the concatenation of the spectra of η brain structures Λ = (λ1 , . . . , λη ), yielding a D = l · η dimensional representation. In addition to the cross-sectional BrainPrint Λn for subject n, we also compute the longitudinal change in morphology. Given the BrainPrints Λ1n , . . . , Λm n for m time points, we use linear least squares fitting to estimate the slope sn . The slope has the same dimensionality D as the original shape characterization and captures longitudinal morphological change within a subject. We process followup scans with the longitudinal processing stream in FreeSurfer [9], which avoids processing bias in the surface reconstruction and segmentation by an unbiased, robust, within-subject template creation. 2.2

Latent Factor Model

We consider observed neuroanatomical changes as the result of a combination of a few underlying processes related to aging and disease that are shared across the population. The response variables are chronological age and performance of the mini-mental state examination (MMSE), a clinical screening instrument for loss of memory and intellectual abilities (from hereon simply referred to as age and dementia, respectively). Our objective is to extract latent factors that account for much of the manifest factor variation. Latent variable models such as factor analysis, principal component analysis (PCA), or independent component analysis are a natural choice for this task. However, these models only focus on describing the data matrix and do not take the response variables into account. The extracted components may therefore well explain the variation in the data but may not be associated to specific variations in aging or dementia. To address this issue, we use projections to latent structures (PLS) [12], also known as partial least squares, which combines information about the variation of both the predictors and the responses, and the correlations among them.

Latent Processes in Aging and Dementia

33

The rows of the data matrix X ∈ RN ×D are the baseline BrainPrint Λn or the slopes sn , depending on a cross-sectional or longitudinal analysis. The matrix Y ∈ RN ×M gathers associated responses, with M the dimensionality of the responses. PLS regression searches for a set of components or latent vectors that performs a simultaneous decomposition of the data matrix X and the response matrix Y with the constraint that these components explain as much as possible of the covariance between X and Y . The underlying assumption of PLS is that the observed data is generated by a system or process, which is driven by a small number of latent variables. For K latent factors and mean centered matrices X and Y , the PLS equation model is xn =

K 

tn,k pk + en

X = TP + E

(2)

un,k qk + fn

Y = U Q + F,

(3)

k=1

yn =

K  k=1

where we show next to the matrix notation also the vector notation to highlight the notion of neuroanatomy xn being explained by a linear combination of a few processes pk . The loadings matrix P ∈ RD×K contains all the K processes. The scores matrix T ∈ RN ×K presents a lower, K-dimensional embedding of a subject. The matrices U ∈ RN ×K and Q ∈ RM ×K contain scores and loadings with respect to the response variable. E and F are matrices of residuals. The relation between the scores T and the original variables X is expressed as T = XW with the weight matrix W . The weights provide an interpretation of the scores and are essential for understanding which of the original variables are important. The PLS method is an iterative procedure that finds, in a first step, the latent score vectors t1 and u1 by maximizing the sample covariance among predictors and responses [ˆr, ˆs] = arg max [cov(Xr, Y s)]2 = arg max [r X  Y s/N ]2 , r=s=1

(4)

r=s=1

where t1 = Xˆr and u1 = Y ˆs. From Eq. (4), we see that ˆr and ˆs correspond to the first pair of left and right singular vectors, which permits an efficient computation. After the first score vectors were obtained, the matrices are deflated by subtracting their rank-one approximations based on t1 and u1 . Several algorithms exist that vary in the details on the iterative scheme, where we use SIMPLS [3], which directly deflates the cross-product X  Y and therefore makes the factors easier to interpret as linear combinations of the original variables. PLS is related to PCA, which maximizes, maxr=1 var(Xr). PCA finds principal components that explain the data well, but does not account for corresponding response variables, which makes the association of components to specific processes difficult and limits the predictive power. Canonical correlation analysis (CCA) takes the response variable into account and maximizes the correlation, maxr=s=1 cor(Xr, Y s)2 . The maximization of the covariance in PLS and the

34

C. Wachinger et al.

correlation in CCA are similar, cov(Xr, Y s)2 = var(Xr)cor(Xr, Y s)2 var(Y s), where PLS also requires to explain the variances.

3

Results

We perform experiments on data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) [8]. We select all subjects from the ADNI1 dataset with baseline scans and follow-up scans after 6, 12, and 24 months, resulting in N = 393 subjects and 1,572 MRI scans. The average age is 75.4 years (SD = 6.31), with 221 male and 172 female subjects, and diagnosis HC = 133, MCI = 175, AD = 85. We use meshes from η = 23 brain structures in the multivariate analysis. Per mesh, we compute l = 5 eigenvalues. To a priori give equal importance to all variables in X, we center them and scale them to unit variance. We jointly model age and MMSE as response variables in Y . We set the number of PLS components K = 4 according to results from explained variance in X and Y , together with an evaluation of the model complexity by computing the mean squared error with 5-fold cross-validation and 20 Monte-Carlo repetitions. To understand the identified components, we investigate the scores in matrix T , which form a four-dimensional embedding of each subject. Table 1 reports the correlation of the low-dimensional embedding with respect to age and MMSE. The first cross-sectional process is significantly associated to MMSE and age. The correlations vary in the sign, which is explained by the general decrease in cognitive ability with increasing age. The second and fourth cross-sectional processes only show significant correlations with MMSE, while the third one shows a significant correlation with age. For the longitudinal processes, each one only shows significant correlations with one of the response variables, yielding two dementia (first and third) and two aging (second and fourth) processes. Overall, the correlation for MMSE is higher than for age. We will focus on the longitudinal progression in the following analysis because it more accurately reflects true aging-related brain changes, while cross-sectional estimates can be subject to selective drop out and biased sampling. To gain insights about the neuroanatomical change evoked by the latent factors, we study the weight matrix W , where numerically large weights indicate the importance of X variables [12]. Figures 1 and 2 illustrate a lateral and medial view of the four processes by coloring the brain structures according to their weights, summed up Table 1. Pearson’s correlation of the four cross-sectional and longitudinal processes with MMSE and age. Significant correlations (p < 0.01) are in bold face.

1

Cross-sectional 2 3

4

1

Longitudinal 2 3

4

MMSE −0.46 −0.37 0.06 −0.20 −0.41 −0.16 0.29 −0.03 Age

0.36 −0.10 0.36

0.11 −0.17

0.30 0.04

0.20

Latent Processes in Aging and Dementia

35

Fig. 1. Lateral view on longitudinal processes on the left hemisphere. Colors are summed up weights across eigenvalues and signify importance.

across eigenvalues. The first and third processes relate to progression of dementia (Table 1). The first process describes opposing effects on hippocampus and amygdala on the one hand, and the lateral and third ventricle on the other hand. This pattern likely reflects the typical brain changes associated with dementia: shrinkage of the hippocampus and amygdala together with an expansion of the ventricular spaces. When comparing the first and third process, we have to consider that one is positively and one is negatively correlated with MMSE, i.e., the colors are inverted. This suggests the existence of two separable dementiarelated processes with inverse effects on the amygdala and accumbens. For the aging processes (second and fourth), the weights of the hippocampus and amygdala are notably lower in comparison to the dementia processes. Aging processes mainly evoke shape changes in the ventricular system, where process two exhibits higher weights for lateral ventricles and process four for the third ventricle. Finally, we evaluate the predictive performance of the latent factor model with cross-validation and compare it to traditional multiple linear regression (MLR) on BrainPrint. We further compare to the prediction with volume measurements instead of shape in the PLS model. Figure 3 shows the mean absolute prediction error for age and MMSE on cross-sectional data. The prediction with PLS BrainPrint yields significant improvements over PLS volume and MLR BrainPrint, highlighting the advantages of neuroanatomical shape characterization and the latent factor model.

36

C. Wachinger et al.

Fig. 2. Medial view on longitudinal processes on the left hemisphere. Colors are summed up weights across eigenvalues and signify importance.

Fig. 3. Prediction results from cross-validation for MMSE (left) and age (right). Bars show mean absolute error and lines show standard error. * and *** indicate significance levels at 0.05 and 0.001.

4

Conclusion

We presented a method for identifying latent processes that cause structural changes associated with aging and dementia in cross-sectional and longitudinal designs. Neuroanatomical changes were computed with the BrainPrint, and subsequently projected to latent structures by accounting for the response variables. Taken together, the results reveal the existence of four latent processes

Latent Processes in Aging and Dementia

37

that separate the progression of dementia from normal aging and that can be clearly distinguished neuroanatomically. The large majority of previous work has used univariate analyses to relate morphology of individual structures or voxels to aging and dementia. Our work demonstrates the importance of multivariate analysis of multiple brain structures to accurately capture related, but spatially distributed, morphological changes of brain shapes. Finally, the latent factor model with BrainPrint yieldied significantly better prediction results. Future work may now investigate possible neuropathological correlates of the morphological processes identified in this paper (i.e., tau pathology versus accumulation of amyloid). Acknowledgement. This work was supported in part by the Faculty of Medicine at LMU (F¨ oFoLe) and the Bavarian State Ministry of Education, Science and the Arts in the framework of the Centre Digitisation.Bavaria (ZD.B).

References 1. Buckner, R.L.: Memory and executive function in aging and AD. Neuron 44(1), 195–208 (2004) 2. Davatzikos, C., Xu, F., An, Y., Fan, Y., Resnick, S.M.: Longitudinal progression of Alzheimer’s-like patterns of atrophy in normal older adults: the SPARE-AD index. Brain 132(8), 2026–2035 (2009) 3. De Jong, S.: SIMPLS: an alternative approach to partial least squares regression. Chemometr. Intell. Lab. Syst. 18(3), 251–263 (1993) 4. Durrleman, S., Pennec, X., Trouv´e, A., Braga, J., Gerig, G., Ayache, N.: Toward a comprehensive framework for the spatiotemporal statistical analysis of longitudinal shape data. Int. J. Comput. Vis. 103(1), 22–59 (2013) 5. Fischl, B., Salat, D.H., et al.: Whole brain segmentation: automated labeling of neuroanatomical structures in the human brain. Neuron 33(3), 341–355 (2002) 6. Gaser, C., Franke, K., Kl¨ oppel, S., Koutsouleris, N., Sauer, H.: Brainage in mild cognitive impaired patients: predicting the conversion to Alzheimer’s disease. PloS one 8(6), e67346 (2013) 7. Jagust, W.: Vulnerable neural systems and the borderland of brain aging and neurodegeneration. Neuron 77(2), 219–234 (2013) 8. Mueller, S.G., Weiner, M.W., Thal, L.J., et al.: The Alzheimer’s disease neuroimaging initiative. Neuroimaging Clin. North Am. 15(4), 869–877 (2005) 9. Reuter, M., Schmansky, N.J., Rosas, H.D., Fischl, B.: Within-subject template estimation for unbiased longitudinal image analysis. Neuroimage 61(4), 1402–1418 (2012) 10. Reuter, M., Wolter, F.E., Peinecke, N.: Laplace-beltrami spectra as “shape-DNA” of surfaces and solids. Comput. Aided Des. 38(4), 342–366 (2006) 11. Wachinger, C., Golland, P., Kremen, W., Fischl, B., Reuter, M.: Brainprint: a discriminative characterization of brain morphology. Neuroimage 109, 232–248 (2015) 12. Wold, S., Sj¨ ostr¨ om, M., Eriksson, L.: PLS-regression: a basic tool of chemometrics. Chemometr. Intell. Lab. Syst. 58(2), 109–130 (2001)

A Multi-armed Bandit to Smartly Select a Training Set from Big Medical Data Benjam´ın Guti´errez1,2(B) , Lo¨ıc Peter2,3 , Tassilo Klein4 , and Christian Wachinger1 1

3

Artificial Intelligence in Medical Imaging (AI-Med), KJP, LMU M¨ unchen, Munich, Germany [email protected] 2 CAMP, Technische Universit¨ at M¨ unchen, Munich, Germany Translational Imaging Group, University College London, London, UK 4 SAP SE Berlin, Berlin, Germany

Abstract. With the availability of big medical image data, the selection of an adequate training set is becoming more important to address the heterogeneity of different datasets. Simply including all the data does not only incur high processing costs but can even harm the prediction. We formulate the smart and efficient selection of a training dataset from big medical image data as a multi-armed bandit problem, solved by Thompson sampling. Our method assumes that image features are not available at the time of the selection of the samples, and therefore relies only on meta information associated with the images. Our strategy simultaneously exploits data sources with high chances of yielding useful samples and explores new data regions. For our evaluation, we focus on the application of estimating the age from a brain MRI. Our results on 7,250 subjects from 10 datasets show that our approach leads to higher accuracy while only requiring a fraction of the training data.

1

Introduction

Machine learning has been one of the driving forces for the huge progress in medical imaging analysis over the last years. Of key importance for learningbased techniques is the training dataset that is used for estimating the model parameters. Including all available data in a training set is becoming increasingly impractical, since processing the data to create training models can be very time consuming on huge datasets. In addition, most processing may be unnecessary because it does not help the model estimation for a given task. In this work, we propose a method to select a subset of the data for training that is most relevant for a specific task. Foreshadowing some of our results, such a guided selection of a subset for training can lead to a higher performance than using all the available data while requiring only a fraction of the processing time. The task of selecting a subset of the data for training is challenging because at the time of making the decision, we do not yet have processed the data and we do therefore not know how the inclusion of the sample would affect the prediction. On the other hand, in many scenarios each image is assigned metadata about c Springer International Publishing AG 2017  M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 38–45, 2017. DOI: 10.1007/978-3-319-66179-7 5

A MAB to Smartly Select a Training Set from Big Medical Data

39

the subject (sex, diagnosis, age, etc.) or the image acquisition (dataset of origin, location, imaging device, etc.). We hypothetize that some of this information can be useful to guide the selection of samples but it is a priori not clear which information is most relevant and how it should be used to guide the selection process. To address this, we formulate the selection of the samples to be included in a training set as reinforcement learning problem, where a trade-off must be reached between the exploration of new sources of data and the exploitation of sources that have been shown to lead to informative data points in the past. More specifically, we model this as a multi-armed bandit problem solved with Thompson sampling, where each arm of the bandit corresponds to a cluster of samples generated using meta information. In this paper, we apply our sample selection method to brain age estimation [7] from MRI T1 images. The estimated age serves as a proxy for biological age, whose difference to the chronological age can be used as indicator of disease [6]. The age estimation is a well-suited application for testing our algorithm as it allows us to work with a large number of datasets, since the subject’s age is one of the few variables that is included in every neuroimaging dataset. 1.1

Related Work

Our work is mostly related to active learning approaches, whose aim is to select samples to be labeled out of a pool of unlabeled data. Examples of active learning approaches applied to medical imaging tasks include the work by Hoi et al. [9], where a batch mode active learning approach was presented for selecting medical images for manually labeling the image category. Another active learning approach was proposed for the selection of histopathological slices for manual annotation in [21]. The problem was formulated as constrained submodular optimization problem and solved with a greedy algorithm. To select a diverse set of slices, the patient identity was used as meta information. From a methodological point of view, our work relates to the work of Bouneffouf et al. [1], where an active learning strategy based on contextual multi-armed bandits is proposed. The main difference between all these active learning approaches and our method is that image features are not available a priori in our application, and therefore can not be used in the sample selection process. Our work also relates to domain adaptation [15,20]. In instance weighting, the training samples are assigned weights according to the distribution of the labels (class imbalance) [10] and the distribution of the observations (covariate shift) [16]. Again these methods are not directly applicable in our scenario because the distribution of the metadata is not always defined on the target dataset.

2 2.1

Method Incremental Sample Selection

In supervised learning, we model a predictive function f : (x, p) → y depending on a parameter vector p, relating an observation x to its label y. In our application, x ∈ Rm is a vector with m quantitative brain measurements from the

40

B. Guti´errez et al.

image and y ∈ R is the age of the subject. The parameters p are estimated by using a training set S T = {s1 , s2 , . . . , sNtrain }, where each sample s = (x, y) is a pair of a feature vector and its associated true label. Once the parameters are ˜ with y˜ = f (˜ estimated, we can predict the label y˜ for a new observation x x, p∗ ), where the prediction depends on the estimated parameters and therefore the training dataset. In our scenario, the samples to be included in the training set S T are selected from a large source set S = {h1 , h2 , .., hNtotal } containing hidden ˆ and label yˆ samples of the form h = {ˆ x, yˆ, m}. Each h contains hidden features x that can only be revealed after processing the sample. In addition, each hidden sample possesses a d-dimensional vector of metadata m ∈ Zd that encodes characteristics of the patient or the image such as sex, diagnosis, and dataset of ˆ and yˆ, m is known a priori and can be observed at no origin. In contrast to x cost. To include a sample h from set S into S T , first its features and labels have to be revealed, which comes at a high cost. Consequently, we would like to find a sampling strategy that minimizes the cost by selecting only the most relevant samples according to the metadata m. 2.2

Multiple Partitions of the Source Data

In order to guide our sample selection algorithm, we create multiple partitions of the source dataset, where each one considers different information from the metadata m. Considering the j-th meta information (1 ≤ j ≤ d), we create ηj Cij with ηj a predefined number of bins for m[j]. the j-th partition S = ∪i=1 As a concrete example, sex could be used for partitioning the data, so S = sex sex ∪ Cmale and ηsex = 2. In the case of continuous variables such as age, Cfemale partitions can be done by quantizing the variable into bins. All the clusters generated using different meta information are merged into a set of clusters C = {Cιj }. Since partitions can be done using different elements of m a sample can be assigned to more than one cluster. We hypothesize that given this partitioning, there exist clusters Ci ∈ C that contain more relevant samples than others for a specific task. Intuitively, we would like to draw samples h from clusters with a higher probability of returning a relevant sample. However, since the relationship between the metadata and the task is uncertain, the utility of each cluster for a specific task is unknown beforehand. We will now describe a strategy that simultaneously explores the clusters to find out which ones contain more relevant information and exploits them by extracting as many samples from relevant clusters as possible. 2.3

Sample Selection as a Multi-armed Bandit Problem

We model the task of sequential sample selection as a multi-armed bandit problem. At each iteration t, a new sample is added to the training dataset S T . For adding a sample, the algorithm decides which cluster Ci ∈ C to exploit and randomly draws a training sample st from cluster Ci . The corresponding feature vector xt and label yt are revealed and the usefulness of the sample st for the given task is evaluated, yielding a reward rt ∈ {−1, 1}. A reward rt = 1 is given

A MAB to Smartly Select a Training Set from Big Medical Data

41

if adding the sample improves the prediction accuracy of the model and rt = −1 otherwise. At t = 0, we do not possess knowledge about the utility of any cluster. This knowledge is incrementally built as more and more samples are drawn and their rewards are revealed. To this end, each cluster is assigned a distribution of rewards Πi . With every sample the distribution better approximates the true expected reward of the cluster, but every new sample also incurs a cost. Therefore, a strategy needs to be designed that explores the distribution for each of the clusters, while at the same time exploiting as often as possible the most rewarding sources. To solve the problem of selecting from which Ci to sample at every iteration t, we follow a strategy based on Thompson sampling [17] with binary rewards. In this setting, the expected rewards are modeled using a probability Pi following a Bernoulli distribution with parameter πi ∈ [0, 1]. We maintain an estimate of the likelihood of each πi given the number of successes αi and failures βi observed for the cluster Ci so far. Successes (r = 1) and failures (r = −1) are defined based on the reward of the current iteration. It can be shown that this likelihood follows the conjugate distribution of a Bernoulli law, i.e., a Beta distribution Beta(αi , βi ) so that P (πi |αi , βi ) =

Γ (αi + βi ) (1 − πi )βi −1 πiαi −1 . Γ (αi )Γ (βi )

(1)

with the gamma function Γ . At each iteration, π ˆi is drawn from each cluster ˆi is chosen. The procedure distribution Pi and the cluster with the maximum π is summarized in Algorithm 1.

Algorithm 1. Thompson Sampling for Sample Selection 1: αi = 1, βi = 1, ∀i ∈ {1, . . . , N } 2: for t = 1, 2, ... do 3: for i = 1, . . . , N do 4: Draw π ˆi from Beta(αi , βi ). ˆi . 5: Reveal sample ht = {xt , yt , mt } from cluster Cj where j := arg maxi π 6: Add sample ht to S T and remove from all clusters. 7: Obtain new model parameters p∗ from updated training set S T . 8: Compute reward rt based on new prediction y˜ = f (x, p∗ ). 9: if rt == 1 then αj = αj + 1 10: else βj = βj + 1

3

Results

In order to showcase the advantages of the multi-armed bandit sampling algorithm (MABS), we evaluate our method in estimating the biological age of a subject given a set of volume and thickness features of the brain. We choose this task because of the big number of available brain scans in public databases and

42

B. Guti´errez et al.

the relevance of age estimation as a diagnostic tool for neurodegenerative diseases [18]. For predicting the age, we reconstruct brain scans with FreeSurfer [5] and extract volume and thickness measurements to create feature vectors x. Based on these features, we train a regression model for predicting the age of previously unseen subjects. 3.1

Data

We work on MRI T1 brain scans from 10 large-scale public datasets: ABIDE [3], ADHD200 [14], AIBL [4], COBRE [13], IXI1 , GSP [2], HCP [19], MCIC [8], PPMI [12] and OASIS [11]. From all of these datasets, we obtain a total number of 7,250 images, which is to the best of our knowledge the largest dataset ever used for brain age prediction. Since each one of these datasets is targeted towards different applications, the selected population is heterogeneous in terms of age, sex, and health status. For the extraction of thickness and volume measurements, we process the images with FreeSurfer. Even though this is a fully automatic tool, the feature extraction is a computationally intensive task, which is by far the bottleneck of our age prediction regression model. 3.2

Age Estimation

We perform age estimation on two different testing scenarios. In the first, we create a testing dataset by randomly selecting subsets from all the datasets. The aim of this experiment is to show that our method is capable of selecting samples that will create a model that can generalize well to a heterogeneous population. In the second scenario, the testing dataset corresponds to a single dataset. In this scenario, we show that the sample selection permits tailoring the training dataset to a specific target dataset. Experiment 1. For the first experiment we take all the images in the dataset and we divide them randomly into three sets: (1) a small validation set of 2% of all samples to compute the rewards given to MABS, (2) a large testing set of 48% to measure the performance of our age regression task, and (3) a large hidden training set of 50%, from which samples are taken sequentially using MABS. We perform the sequential sample selection described in Algorithm 1 using the following metadata to construct the clusters C: age, dataset, diagnosis, and sex. We experiment with considering all of the metadata separately, to investigate the importance of each one, and the joint modeling considering all partitions at once. We opted to use ridge regression as our learning algorithm because of its fast training and good performance for our task, but other regression models can be easily plugged into our method. Rewards r are given to each bandit by estimating and observing if the r2 score of the prediction in the validation set increases. It is important to emphasize that the testing set is not observed by the bandits in the process of giving rewards. Every experiment is repeated 20 times using different 1

http://brain-development.org/ixi-dataset/.

A MAB to Smartly Select a Training Set from Big Medical Data

43

random splits and the mean results are shown. We compare with two baselines: the first one (RANDOM) consists of obtaining samples at random from the hidden set and adding them sequentially to the training set. As a second baseline (AGE PRIOR), we add samples sequentially by following the age distribution of the testing set. The results of this first experiment are shown in Fig. 1 (top left). In almost all of the cases, using MABS as a selection strategy performed better than the baselines. Notably, an increase in performance is obtained not only when the relationships between the metadata and the task are direct, like in the case of the clusters constructed by age, but also when this relationship is not clear, like in the case of clustering the images using only dataset or diagnostic information. Another important aspect is that even when the meta information is not informative, like in the case of the clusters generated by sex, the prediction using MABS is not affected.

Fig. 1. Results of our age prediction experiments in terms of r2 score. A comparison is made between MABS using different strategies to build the clusters C, a random selection of samples, and a random selection based on the age distribution of the test data. To improve the presentation of the results, we limit the plot to 4,000 samples.

Experiment 2. For our second experiment, we perform age estimation with the test data being a specific dataset. This experiment follows the same methodology as the previous one with the important difference of how the datasets are split. This time the split is done by choosing: (1) a small validation set, taken only from the target dataset, (2) a testing set, which corresponds to the remaining samples in the target dataset not included in the validation set, and (3) a hidden dataset containing all the samples from the remaining datasets. The goal of this experiment is to show that our approach can be applied to selecting samples

44

B. Guti´errez et al.

according to a specific population and prediction task. Figure 1 shows the results for three different target datasets. We observe that bandits operating on single metadata like diagnosis or dataset can perform very well for the sample selection. However, the best metadata is different for each of the presented datasets. We also observe that MABS using all available metadata extracts informative samples more efficiently than the baselines and always close to the best performing single metadata MABS. This strengthens our hypothesis that it is difficult to define an a priori relationship between the metadata and the task. Consequently, it is a better strategy to pass all the metadata from multiple sources to MABS and let it select the most relevant information.

4

Conclusion

We have proposed a method for efficiently and intelligently sampling a training dataset from a large pool of data. The problem was formulated as reinforcement learning, where the training dataset was sequentially built after evaluating a reward function at every step. Concretely, we used a multi-armed bandit model that was solved with Thompson sampling. The intelligent selection considered metadata of the scan to construct a distribution about the expected reward of a training sample. Our results showed that the selective sampling approach leads to higher accuracy than using all the data, while requiring less time for processing the data. We demonstrated that our technique can either be used to build a general model or to adapt to a specific target dataset, depending on the composition of the test dataset. Since our method does not require to observe the information contained in the images, it could also be applied to predict useful samples even before the images are acquired, guiding the recruitment of subjects. Acknowledgements. This work was supported in part by SAP SE, the Faculty of Medicine at LMU (F¨ oFoLe), and the Bavarian State Ministry of Education, Science and the Arts in the framework of the Centre Digitisation.Bavaria (ZD.B).

References 1. Bouneffouf, D., Laroche, R., Urvoy, T., Feraud, R., Allesiardo, R.: Contextual bandit for active learning: active thompson sampling. In: Loo, C.K., Yap, K.S., Wong, K.W., Teoh, A., Huang, K. (eds.) ICONIP 2014. LNCS, vol. 8834, pp. 405– 412. Springer, Cham (2014). doi:10.1007/978-3-319-12637-1 51 2. Buckner, R., Hollinshead, M., Holmes, A., Brohawn, D., Fagerness, J., O’Keefe, T., Roffman, J.: The brain genomics superstruct project. Harvard Dataverse Network (2012) 3. Di Martino, A., Yan, C., et al.: The autism brain imaging data exchange: towards a large-scale evaluation of the intrinsic brain architecture in autism. Mol. Psychiatry 19(6), 659–667 (2014) 4. Ellis, K., Bush, A., Darby, D., et al.: The Australian imaging, biomarkers and lifestyle (AIBL) study of aging: methodology and baseline characteristics of 1112 individuals recruited for a longitudinal study of alzheimer’s disease. Int. Psychogeriatr. 21(04), 672–687 (2009)

A MAB to Smartly Select a Training Set from Big Medical Data

45

5. Fischl, B., Salat, D., Busa, E., Albert, M., Dieterich, M., Haselgrove, C., van der Kouwe, A., Killiany, R., Kennedy, D., Klaveness, S., Montillo, A., Makris, N., Rosen, B., Dale, A.: Whole brain segmentation: automated labeling of neuroanatomical structures in the human brain. Neuron 33(3), 341–355 (2002) 6. Franke, K., Luders, E., May, A., Wilke, M., Gaser, C.: Brain maturation: predicting individual brainage in children and adolescents using structural mri. Neuroimage 63(3), 1305–1312 (2012) 7. Franke, K., Ziegler, G., Kl¨ oppel, S., Gaser, C., Alzheimer’s Disease Neuroimaging Initiative: Estimating the age of healthy subjects from t 1-weighted mri scans using kernel methods: Exploring the influence of various parameters. Neuroimage 50(3), 883–892 (2010) 8. Gollub, R.L., Shoemaker, J., King, M., White, T., Ehrlich, S., Sponheim, S., Clark, V., Turner, J., Mueller, B., Magnotta, V., et al.: The mcic collection: a shared repository of multi-modal, multi-site brain image data from a clinical investigation of schizophrenia. Neuroinformatics 11(3), 367–388 (2013) 9. Hoi, S., Jin, R., Zhu, J., Lyu, M.: Batch mode active learning and its application to medical image classification. In: ICML, pp. 417–424. ACM (2006) 10. Japkowicz, N., Stephen, S.: The class imbalance problem: a systematic study. Intel. Data Anal. 6(5), 429–449 (2002) 11. Marcus, D.S., Wang, T.H., Parker, J., Csernansky, J.G., Morris, J.C., Buckner, R.L.: Open access series of imaging studies (oasis): cross-sectional mri data in young, middle aged, nondemented, and demented older adults. J. Cognitive Neurosci. 19(9), 1498–1507 (2007) 12. Marek, K., Jennings, D., Lasch, S., Siderowf, A., Tanner, C., Simuni, T., Coffey, C., Kieburtz, K., Flagg, E., Chowdhury, S., et al.: The parkinson progression marker initiative (PPMI). Prog. Neurobiol. 95(4), 629–635 (2011) 13. Mayer, A., Ruhl, D., Merideth, F., Ling, J., Hanlon, F., Bustillo, J., Ca˜ nive, J.: Functional imaging of the hemodynamic sensory gating response in schizophrenia. Hum. Brain Mapp. 34(9), 2302–2312 (2013) 14. Milham, M.P., Fair, D., Mennes, M., Mostofsky, S.H., et al.: The ADHD-200 consortium: a model to advance the translational potential of neuroimaging in clinical neuroscience. Frontiers Syst. Neurosci. 6, 62 (2012) 15. Pan, S., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2010) 16. Shimodaira, H.: Improving predictive inference under covariate shift by weighting the log-likelihood function. J. Stat. Plan. Inference 90(2), 227–244 (2000) 17. Thompson, W.R.: On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4), 285–294 (1933) 18. Valizadeh, S., H¨ anggi, J., M´erillat, S., J¨ ancke, L.: Age prediction on the basis of brain anatomical measures. Hum. Brain Mapp. 38(2), 997–1008 (2017) 19. Van Essen, D.C., Smith, S.M., Barch, D.M., Behrens, T., Yacoub, E., Ugurbil, K., WU-Minn HCP Consortium, et al: The WU-Minn human connectome project: an overview. Neuroimage 80, 62–79 (2013) 20. Wachinger, C., Reuter, M.: Domain adaptation for alzheimer’s disease diagnostics. Neuroimage 139, 470–479 (2016) 21. Zhu, Y., Zhang, S., Liu, W., Metaxas, D.N.: Scalable histopathological image analysis via active learning. In: Golland, P., Hata, N., Barillot, C., Hornegger, J., Howe, R. (eds.) MICCAI 2014. LNCS, vol. 8675, pp. 369–376. Springer, Cham (2014). doi:10.1007/978-3-319-10443-0 47

Multi-level Multi-task Structured Sparse Learning for Diagnosis of Schizophrenia Disease Mingliang Wang1 , Xiaoke Hao1 , Jiashuang Huang1 , Kangcheng Wang2 , Xijia Xu3 , and Daoqiang Zhang1(B) 1

3

College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China [email protected] 2 Department of Psychology, Southwest University, Chongqing 400715, China Department of Psychiatry, Nanjing Brain Hospital of Nanjing Medical University, Nanjing University, Nanjing 210029, China

Abstract. In recent studies, it has attracted increasing attention in multi-frequency bands analysis for diagnosis of schizophrenia (SZ). However, most existing feature selection methods designed for multifrequency bands analysis do not take into account the inherent structures (i.e., both frequency specificity and complementary information) from multi-frequency bands in the model, which are limited to identify the discriminative feature subset in a single step. To address this problem, we propose a multi-level multi-task structured sparse learning (MLMT-TS) framework to explicitly consider the common features with a hierarchical structure. Specifically, we introduce two regularization terms in the hierarchical framework to impose the common features across different bands and the specificity from individuals. Then, the selected features are used to construct multiple support vector machine (SVM) classifiers. Finally, we adopt an ensemble strategy to combine outputs of all SVM classifiers to achieve the final decision. Our method has been evaluated on 46 subjects, and the superior classification results demonstrate the effectiveness of our proposed method as compared to other methods.

1

Introduction

Schizophrenia is the most common chronic and devastating mental disorders affecting 1% of the population worldwide [1]. Until now, the pathological mechanism of schizophrenia remains unclear and there is no definitive standard in the diagnosis of schizophrenia. While it has been reported that there is a significant change in the structure, function and metabolism of the brain in schizophrenia patients [2]. Moreover, some related studies have suggested that resting-state This work was supported in part by the National Natural Science Foundation of China (61422204; 61473149) and the NUAA Fundamental Research Funds (No. NE2013105). c Springer International Publishing AG 2017  M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 46–54, 2017. DOI: 10.1007/978-3-319-66179-7 6

Multi-level Multi-task Structured Sparse Learning

47

functional analyses are beneficial to achieve more complete information of functional connectivity. In fact, most resting-state functional magnetic resonance imaging (RS-fMRI) studies have examined spontaneous low-frequency oscillations (LFO) at a specific frequency band of 0.01–0.1 Hz [3]. Therefore, we can use the RS-fMRI data for whole-brain analysis in this study. In addition, as the complexity of the schizophrenia itself, some studies have reported mixed results even opposite conclusions [3], which may be due to the different frequency bands used in these studies. Hence, it has been recognized that neural disorder specific changes could be restricted to the specific frequency bands recently. Also, more and more studies have indicated that taking the high-frequency bands into consideration is helpful to measure the intrinsic brain activity of schizophrenia. Therefore, in this study we decompose RS-fMRI LFO into four distinct frequency bands based on prior work (slow-5(0.01–0.027 Hz), slow-4(0.027–0.073 Hz), slow-3(0.073–0.198 Hz), slow-2(0.198–0.25 Hz)) [4]. To exploit potential information sharing among different frequency bands, instead of treating each frequency band as a single-task classification problem, multi-task learning (MTL) paradigm learns several related tasks simultaneously to improve performance [5]. However, the main limitation of existing multi-task works is that all tasks are considered in a single level, which may miss out some relevant features shared between a smaller group of tasks. Meanwhile, it may be not enough to model such complex structure information in schizophrenia studies through a single level method. Accordingly, in this paper, we propose a multi-level multi-task structured sparse (MLMT-TS) learning method, to explicitly model the structure information of multi-frequency data for diagnosis of schizophrenia. In our hierarchical framework, 1,1 -norm is brought to induce the sparsity and select the specific features, meanwhile a new regularization term is introduced to capture the common features across different bands. Hence, contrary to the single level manner, the hierarchical framework gradually enforces different levels of features sharing to model the complex structure information efficiently.

2

Method

Data and Pre-processing: In this study, we use 46 subjects in total from the Department of Psychiatry, Affiliated Nanjing Brain Hospital of Nanjing Medical University. Among them, 23 are schizophrenia patients, and the rest 23 subjects are normal controls (NC). All subjects RS-fMRI images were processed as described in [3], and after preprocessing the fractional amplitude of LFO were calculated using REST software1 . Because the size of the RS-fMRI image is 61 × 73 × 61, the voxel-based analysis is too large and noisy to directly used for disease diagnosis. Thus, we adopt a simple and effective way to extract more relevant and discriminative features for neuroimage analysis and classification. We first utilize the patch-based method with patch size 3 × 3 × 3 voxels to divide the whole brain into some candidate patches. Then, we perform the t-test on the 1

http://www.restfmri.net.

48

M. Wang et al. Functional Data

SVM 1

slow-5

slow-4

slow-3

slow-2 SVM 2 Classifier Ensemble

Multi-Frequency Bands Feature Extraction

SVM 3

SVM 4 st

1 Level

st

2 Level

H Level

Multi-Level Multi-task Structured Sparse Feature Selection

Ensemble Classification

Fig. 1. The framework of the proposed classification algorithm.

candidate patches and select the significant patch with the p-value smaller than 0.05. Finally, we calculate the mean of each patch, and treat it as the feature of the selected patch. Multi-level Multi-task Structured Sparse Learning Model: The framework of the proposed method is illustrated in Fig. 1. After the patch-based multifrequency bands features are extracted from the RS-fMRI data, our multi-level multi-task structured sparse (MLMT-TS) method is used to select the more relevant and discriminative features. The features are selected in an iterative fashion: It starts with a low level where sharing features among all tasks are induced, and then gradually enhances the incentive to share in successive levels. However the specific features of different bands are induced with a high level, then the incentive is gradually decreased. In addition, we also need to note that the learned coefficient matrix corresponding to each level is forwarded to the next hierarchy for further leaning in the same manner. In such a hierarchical manner, we gradually select the most discriminative features in order to sufficiently utilize the complementary and specific information of multi-frequency bands. Then, the selected features are used to train SVM classifiers for schizophrenia disease classification. Finally, we adopt an ensemble classification strategy, a simple and effective classifier fusion method, to combine the outputs of different SVM classifiers to make a final decision. In the following, we explain in detail how the MLMT-TS feature selection method works. Assume that there are M supervised learning tasks corresponding to the number of frequency bands. And the training data matrix on m-th task from N

Multi-level Multi-task Structured Sparse Learning

49

training subjects is denoted by Xm = [xm,1 , xm,2 , . . . , xm,N ]T ∈ RN ×d , and Y = [y1 , y2 , . . . , yN ]T ∈ RN as the response vector from these training subjects, where xm,n is the feature vector of the n-th subject and the corresponding class label is yn . Denote the coefficient matrix as W = [w1 , . . . , wM ] ∈ Rd×M , where wm ∈ Rd is a linear discriminant function for task m, and we assume the bias term b is absorbed into W. As mentioned above, the coefficient matrix is forwarded to subsequent learning to share more structure information. Hence, we decompose the coefficient matrix into H components where each hierarchy can capture the level-specific task group structure features. Specifically, the coefficient matrix W can be defined as W=

H 

Wh

(1)

h=1

where Wh = [wh,1 , . . . , wh,M ] ∈ Rd×M is the coefficient matrix, which is corresponding to the h-th level, and wh,m is the m-th column of Wh in the h-th level. Then the objective function for MLMT-TS feature selection method can be written as: minW

M H   1 Y − Xm wh,m 22 + Rt (W) + Rs (W) 2 m=1

(2)

h=1

where Rt (W) and Rs (W) are the structure regularization term and the 1,1 norm regularization term, respectively, which are defined as follows Rt (W) =

H 

λh

h=1

M 

wh,p − wh,q 2

(3)

p 1. It is worth noting that, when β1 = 0, our method will reduce to the multi-level task grouping method (MLMT-T)[6]. Also, when λ1 = 0, our method will reduce to a multi-level Lasso method (MLMT-S). Below, we will develop a new method for optimizing the objective function in Eq. (2).

50

M. Wang et al.

Optimization Algorithom: To optimize the objective function in Eq. (2), we propose a top-down iterative scheme, where the problem (2) is decomposed into several sub-problems consistent with the levels, which is described as follows: minWh

M M   1 Y − Xm wh,m 22 + λh wh,p − wh,q 2 + βh Wh 1,1 2 p 0 (a − 1)λ

(5)

with ρλ (0) = 0, (z)+ = max(z, 0), I the indicator function, λ and a model parameters. As usual, a = 3.7 is used [8]. Figure 1(a) shows the SCAD penalty (blue) and l1 penalty (red). To better understand the behavior of SCAD penalty, consider the penalized least square problem minβ (z −β)2 +P(β), where P(β) is chosen as the LASSO or the SCAD penalty. The solution is unique βˆ = Sλ (z) where Sλ is a thresholding function. Figure 1 displays the thresholding function for LASSO (b) and SCAD (c) with λ = 2. We notice that the SCAD penalty shrinks small coefficients to zero, while keeping large coefficients intact, while the l1 penalty tends to shrink 10

5

0 -10

-5

0

5

10

z

(a) SCAD and l1 penalties

10

10

5

5

0

0

-5

-5

-10 -10

-10 -10

-5

0

5

10

z

(b) l1 thresholding

-5

0

5

10

z

(c) SCAD thresholding

Fig. 1. Illustration of SCAD penalty: (a) SCAD (blue) and l1 (red) penalty functions; thresholding function with l1 (b) and SCAD (c) penalty and λ = 2.

58

L. Zhang et al.

all coefficients. This unbiased property of SCAD penalty comes from the fact that ρλ (t) = 0, when t is large enough. Extending the SCAD definition for vector data and discrete gradients of the coefficients we define the combined SCAD and SCADTV penalties as: m PSCAD = l=1 ρλ (|βl |) (6) m PSCADT V = l=1 ρλ (|∇i βl |) + ρλ (|∇j βl |) + ρλ (|∇k βl |) (7) where (i, j, k) denotes the 3 orthogonal dimensions of the image data as in the definition of TV norm. Similar to SCAD, SCADTV shrinks small gradients encouraging neighboring coefficients to have the same values, but leaves large gradients unchanged. We propose three types of penalty functions that are compared with the classic Pl1 +T V model in the context of logistic regression classification: PSCAD+SCADT V , Pl1 +SCADT V and PSCAD+T V . 2.4

Optimization and Parameter Tuning

Note that the SCAD penalty, unlike l1 and TV, is not convex. We solve this problem using ADMM [4] that was successfully applied to convex problems. Recently it was shown [18] that several ADMM algorithms including SCAD are guaranteed to converge. The tuning parameters λ, γ are chosen by generalized information criterion (GIC).

3 3.1

Experimental Results Synthetic Data

Medical imaging data has no available ground truth on the significant anatomical regions discriminating two populations. We therefore generated synthetic data xi of size 32 × 32 × 8 containing four 8 × 8 × 4 foreground blocks with high spatial coherence (see Fig. 2). Background values are generated from a normal distribution N (0, 1), while the correlated values inside the four blocks are drawn from a multinormal distribution N (0, Σr ), with r ∈ {0.25, 0.5}. Binary labels yi are then assigned based on the logistic probability following a Bernoulli distribution. The coefficient vector β has fixed values of 0 outside the four blocks and piecewise smooth values inside, with increasing strength for the data signal in the following order: top-left, top-right, bottom-right, bottom-left. Figure 2 topleft presents a 2D slice of the synthetic data and bottom-left presents the 3D view of the nonzero coefficients. Binary labels are assigned based on the logistic probability following a Bernoulli distribution. Each dataset contains n = 300 subjects, making the data matrix X of size n × 8192. For each coherence value r we repeated the test 96 times.

An Unbiased Penalty for Sparse Classification

59

4

3.5

3

2.5

2

1.5

1

0.5

0

2D view ground truth

SCAD+SCADTV

SCAD+TV

3D view ground truth

L1+SCADTV

L1+TV

Fig. 2. (top-left) shows a 2D slice of the ground truth coefficients for simulated data. (bottom-left) shows a 3D view of the ground truth nonzero coefficients. The following figures show significant regions on synthetic data detected by the 4 methods. Shaded gray regions correspond to the true nonzero coefficients and the red regions are calculated from the estimated nonzero coefficients averaged over 96 trials.

3.2

Neuroimaging Data

Our neuroimaging data belongs to an in-house multiple sclerosis (MS) study. Following recent research that suggests a possible pivotal role for iron in MS [15], we are investigating if iron in deep gray matter is a potential biomarker of disability in MS. High field (4.7T) quantitative traverse relaxation rate (R2*) images are used as they are shown to be highly influenced by non-heme iron [14]. Sample R2* slices can be viewed in Fig. 4 (top). The focus is subcortical deep gray matter structures: caudate, putamen, thalamus and global pallidus. Forty subjects with relapsing remitting MS (RRMS) and 40 age- and gender-matched controls were recruited. Ethical approval and informed consent were obtained. Prior to analysis, the MRI data is pre-processed and aligned with an in-house unbiased template using ANTs [1]. The multimodal template is built from 10 healthy controls using both T1w and R2*. Pre-processing involves intra-subject alignment of R2* with T1w and bias field intensity normalization for T1w [17]. Nonlinear registration in the template space is done using SyN [3]. Aligned R2* values are used as iron-related measurements. The measurement row vectors xi of size 158865 are formed by selecting only voxels inside a deep gray matter mask manually traced on the atlas. 3.3

Evaluation Methodology

We compare the performance of the four penalized logistic regression models described in Sect. 2: SCAD + SCADT V , SCAD + T V , l1 + SCADT V and

60

L. Zhang et al. 1

0.95

1950

0.98

0.9

1900

0.85

1850

0.8

1800

0.75

1750

0.7

1700

0.65

1650

0.6

1600

0.55

1550

0.96 0.94 0.92 0.9 0.88 0.86 0.84 0.82

SCAD+SCADTV SCAD+TV L1+SCADTV L1+TV

0.8

0.5 acc

AUC

sens

(a) Classification results

spec

1500 =0.25

=0.5

(b) Dice

=0.25

=0.5

(c) SAE coefficients

Fig. 3. Results for synthetic experiments (a) Classification scores for noise level r = 0.25. (b) Dice scores between ground truth and estimated nonzero coeff. (c) Sum of Absolute Error (SAE) between ground truth and estimated coeff. for r = 0.25, 0.5.

l1 + T V . Training and test data is selected for each of the 96 synthetic datasets (200 training and 100 test) and for the real data (5 folds cross-validation). Results are reported on the test data using the β coefficients computed on the training data. The sparse regions are selected from all nonzero coefficients. Classification results are evaluated using accuracy (proportion of correctly classified samples), sensitivity (true positive rate), specificity (true negative rate) and the area-under-the-curve (AUC) for the receiver operating characteristic (ROC) curve. Variable selection accuracy compared to ground truth for synthetic data is evaluated using a dice score. We also compute the mean absolute error of recovered vs ground truth coefficients. For real data, we measured the stability of the detected regions using a dice score between the estimated regions in each of the 5 folds (dice folds). 3.4

Results

Comparative results on synthetic data with two levels of coherence r ∈ {0.25, 0.5} for the multinormal distribution are reported in the bar graphs in Fig. 3(a), (b) and (c). When evaluating the classification accuracy in plot (a) results are comparable for all four methods with a mean of about 94% for SCAD + SCADT V , SCAD + T V and L1 + T V and a bit lower for L1 + SCADT V . But, when looking at the accuracy of variable selection using dice score (b) as well as the accuracy of the recovered sparse coefficients (c), we see that the SCAD + SCADT V penalizer is superior compared to the others. It achieves the highest dice score and the lowest SAD of recovered coefficients. To visualize the results of the 96 trials, we average the estimated nonzero coefficients, as binary masks, and threshold at 0.2. Results as illustrated in Fig. 2 confirm the numerical evaluation showing that the SCAD + SCADT V penalty gives the cleanest and closest to ground truth variable selection results while the l1 + T V penalty archives the worse performance.

An Unbiased Penalty for Sparse Classification

61

Table 1. Results for real MRI data. Means on the 5 folds are reported. Class. rate = classification rate, Sens. = sensitivity, Spec. = specificity, AUC; Dice Folds = Dice score between detected sparse regions. Bold highlights best results among methods. Method

Class. rate Sens. Spec. AUC Dice Folds

SCAD + SCADT V 0.75

0.75

0.75

0.75 0.79

0.75

0.67

SCAD + T V

0.71

0.71

L1 + SCADT V

0.76

0.76 0.77 0.75 0.68

L1 + T V

0.74

0.74

SCAD + SCADTV

SCAD + TV

0.75

0.72

L1+SCADTV

0.71 0.61

L1+TV

Fig. 4. Illustration of the significant anatomy detected by the 4 methods using MRI data. Top: 2D axial slices with the R2* data as background; Bottom: a 3D view of the result. The deep gray matter mask used for selecting the voxels included in the observation vectors xi is contoured in white and the selected significant regions in red.

Comparative classification results on real neuroimaging MRI data are reported in Table 1. As ground truth on selected sparse regions is not available for real data, we estimated the quality of the detected sparse regions using a stability over folds measured using between-folds dice scores (DiceFolds). We report the average over the 10 distinct folds combinations. While classification results are comparable among proposed penalizers, results on stability of detected regions clearly show that the new penalties SCAD and SCADTV achieve superior results. To visualize the results, Fig. 4 displays sample axial slices and a 3D view of the regions recovered by the four methods. The regions were calculated from all data with optimal parameters for each method. Most methods recover compact regions in very similar brain locations.

62

4

L. Zhang et al.

Discussion

We introduced a new penalty based on SCAD for variable selection in the context of sparse classification in high dimensional neuroimaging data. While SCAD penalty was proposed in statistical literature to overcome the inherent bias of l1 and TV penalties, it was not yet used in medical imaging population studies. We experimentally shown on simulated and real MRI data that the proposed models based on SCAD are better at selecting the true nonzero coefficients and achieve higher accuracy. Part of our future work, we are looking at deriving theoretical results on coefficients bounds and accuracy of variable selection for the SCAD based models. Extending our work, similar penalizers could be used for regression or data representation (ex. PCA, CCA).

References 1. ANTS (2011). http://www.picsl.upenn.edu/ants/ 2. Ashburner, J., Friston, K.: Voxel-based morphometry - the methods. NeuroImage 11(6), 805–821 (2000) 3. Avants, B., Epstein, C., Grossman, M., Gee, J.: Symmetric diffeomorphic image registration with cross-correlation: evaluating automated labeling of elderly and neurodegenerative brain. Med. Image Anal. 12(1), 26–41 (2008) 4. Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2011) 5. Chopra, A., Lian, H.: Total variation, adaptive total variation and nonconvex smoothly clipped absolute deviation penalty for denoising blocky images. Pattern Recogn. 43(8), 2609–2619 (2010) 6. Davatzikos, C.: Why voxel-based morphometric analysis should be used with great caution when characterizing group differences. Neuroimage 23(1), 17–20 (2004) 7. Eickenberg, M., Dohmatob, E., Thirion, B., Varoquaux, G.: Grouping total variation and sparsity: statistical learning with segmenting penalties. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9349, pp. 685–693. Springer, Cham (2015). doi:10.1007/978-3-319-24553-9 84 8. Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its Oracle properties. J. Am. Stat. Assoc. 96(456), 1348–1360 (2001) 9. Gramfort, A., Thirion, B., Varoquaux, G.: Identifying predictive regions from fMRI with TV-L1 prior. In: International Workshop on PRNI, pp. 17–20 (2013) 10. Grosenick, L., Klingenberg, B., Katovich, K., Knutson, B., Taylor, J.: Interpretable whole-brain prediction analysis with graphnet. Neuroimage 72, 304–21 (2013) 11. Kandel, B., Avants, B., Gee, J., Wolk, D.: Predicting cognitive data from medical images using sparse linear regression. In: IPMI, pp. 86–97 (2013) 12. Krishnapuram, B., Carin, L., Figueiredo, M., Hartemink, A.: Sparse multinomial logistic regression: fast algorithms and generalization bounds. IEEE Trans. Pattern Anal. Mach. Intell. 27(6), 957–68 (2005) 13. Mehranian, A., Rad, H.S., Rahmim, A., Ay, M.R., Zaidi, H.: Smoothly clipped absolute deviation (SCAD) regularization for compressed sensing MRI using an augmented lagrangian scheme. Magn. Reson. Imaging 31(8), 1399–1411 (2013) 14. Schenck, J., Zimmerman, E.: High-field MRI of brain iron: birth of a biomarker? NMR Biomed. 17, 433–45 (2004)

An Unbiased Penalty for Sparse Classification

63

15. Stephenson, E., Nathoo, N., Mahjoub, Y., et al.: Iron in multiple sclerosis: roles in neurodegeneration and repair. Nat. Rev. Neurol. 10(8), 459–68 (2014) 16. Tibshirani, R.: Regression shrinkage and selection via the Lasso. J. Royal Stat. Soc. 58(1), 267–288 (1996) 17. Tustison, N., Avants, B., Cook, P., et al.: N4ITK: improved N3 bias correction. IEEE Trans. Med. Imaging 29(6), 1310–20 (2010) 18. Wang, Y., Yin, W., Zeng, J.: Global convergence for ADMM in nonconvex nonsmooth optimization, arXiv:1551.06324

Unsupervised Feature Learning for Endomicroscopy Image Retrieval Yun Gu1,2,3 , Khushi Vyas3 , Jie Yang1,2(B) , and Guang-Zhong Yang3 1

School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, China 2 Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai, China [email protected] 3 Hamlyn Centre for Robotic Surgery, Imperial College London, London, UK

Abstract. Learning the visual representation for medical images is a critical task in computer-aided diagnosis. In this paper, we propose Unsupervised Multimodal Graph Mining (UMGM) to learn the discriminative features for probe-based confocal laser endomicroscopy (pCLE) mosaics of breast tissue. We build a multiscale multimodal graph based on both pCLE mosaics and histology images. The positive pairs are mined via cycle consistency and the negative pairs are extracted based on geodetic distance. Given the positive and negative pairs, the latent feature space is discovered by reconstructing the similarity between pCLE and histology images. Experiments on a database with 700 pCLE mosaics demonstrate that the proposed method outperforms previous works on pCLE feature learning. Specially, the top-1 accuracy in an eight-class retrieval task is 0.659 which leads to 10% improvement compared with the state-of-theart method. Keywords: Probe-based laser endomicroscopy · Histology · Graph mining · Feature learning

1

Introduction

Probe based Confocal laser endomicroscopy (pCLE) is a popular optical biopsy technique capable of in situ and in-vivo imaging of tissue architecture with microscopic resolution. Using flexible fiber bundle and miniaturized optics, it provides clinicians with real-time access to histological information during surgical procedures and has demonstrated promising sensitivities and specificities in various preclinical and clinical studies, including in the gastro-intestinal tract, urinary tract, breast and respiratory system. In surgical procedures, histopathological examination of biopsy samples by trained pathologists is the gold standard for disease diagnosis, grading and classification. Although pCLE enables to acquire in vivo microscopic images that resemble closely to histology images, in vivo diagnosis is still challenging for many clinicians who have little histopathology expertise and training. Further the high c Springer International Publishing AG 2017  M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 64–71, 2017. DOI: 10.1007/978-3-319-66179-7 8

Unsupervised Feature Learning for Endomicroscopy Image Retrieval

65

variability in the appearance of pCLE images and presence of atypical conditions makes it difficult to provide accurate diagnosis by manual identification. In the past few years, Content-based medical image retrieval (CBMIR) technique has been increasingly applied for accurate clinical diagnosis and assessment of medical images [1,2]. Given the query image, CBMIR systems return the most related examples from the large databases which could provide informative diagnosis/decision support to clinicians. The existing literature on CBMIR for pCLE images mainly focuses on learning discriminative visual features to improve the retrieval accuracy which can be divided into two categorizations including unimodal methods [3,4] and multimodal methods [5]. For unimodal methods, the discriminative features are learned based on pCLE images only. Examples of unimodal methods include densely sampled SIFT [4] and the metric learning based on SIFT [3] for CBIR tasks with pCLE videos. Although previous works demonstrate promising performance, the pCLE systems have the limited field of view, particularly when compared to histology slides which means that only a small number of morphological features can be visualized in each image. As a result, discriminating between benign, atypical and neoplastic lesions solely based on pCLE data might be a difficult task. On the other hand, a multimodal framework which includes both pCLE and histology images can bridge this gap and enhance the discriminative feature of pCLE for accurate decision support. Recently, Gu et.al [5] demonstrated that the multimodal framework contributes to promising accuracy in pCLE mosaics classification. Although the multimodal method in [5] aids to enhance the discriminative features by using pCLE and histology images, a one-to-one correspondence is required to achieve the desired accuracies. As shown in Fig. 1, due to huge differences in the field of view, the pCLE mosaic only corresponds to a small region in histology image. Although we can get an approximate location during the scanning, it is still time-consuming to find the corresponded region from histology images for a specific mosaic. Particularly for freehand pCLE imaging, manually finding a large number of pCLE-histology pairs to learn the discriminative features could be challenging. In this paper, we propose to overcome this challenge by developing an Unsupervised Multimodal Graph Mining (UMGM) framework to learn discriminative features for pCLE mosaics. Our approach is inspired by recent advances

Fig. 1. The region correspondence between histology images and pCLE mosaics. (a) is the original histology image with the high resolution 86 k × 68 k; (b) is a subregion of (a) with 40x zoom in; (c) is a subregion of (b) within the field-of-view (FoV) of endomicroscopy and the corresponded pCLE mosaic.

66

Y. Gu et al.

in metric learning based on graph analysis [6,7]. The first step of our work is to extract similar and dissimilar histology patches for a specific pCLE mosaic without supervised information. (In this paper, we use ‘similar/dissimilar’ and ‘positive/negative’ interchangeably.) In order to extract sufficient data pairs, we tend to discover the latent similarity between pCLE mosaics and histology patches within the field-of-view (FoV) as well as the region out of FoV. Although the pCLE mosaic corresponds to a small FoV in the complete histology image, we observe that there are large amount of regions which are out of FoV with similar tissue morphology of specific region scanned by pCLE. These regions can cover more variance of the structure of cells that tend to provide extra information for feature learning. Inspired by this observation, a multiscale multimodal graph is built based on pCLE mosaics and histology patches. The latent similarity among pCLE and histology patches is mined by leveraging graph-based analysis over a large collection of histology patches. Specifically, we propose to use a cycle consistency criterion for mining positive pairs and the geodetic distance to find the hard negative pairs. The discriminative feature of pCLE is learned via minimizing the distance between positive pairs and maximizing the distance between negative pairs. We validate the performance of the representation on dataset with 45 cases which consists of 700 pCLE mosaics and their corresponding histology slides. The experiments demonstrate that the proposed method outperforms previous single modal approach [4] and the supervised multimodal method [5].

2 2.1

Methodlogy Building and Mining the Multimodal Graph

In this paper, pCLE mosaics are denoted by {XiP }, i = 1, . . . , nP where nP is the number of pCLE mosaics. Each mosaic is matched with a small FoV in the whole histology image which are denoted by {YiP }, i = 1, . . . , nP . We randomly select patches YiR,s with size s from histology images where s ∈ {128, 256, 512, 1024}. For pCLE mosaics and histology patches, dense Scale Invariant Feature Transformation (dense-SIFT) [4] is adopted as visual representation. In order to learn the discriminative features, we build the multimodal graph G = {V, E}. The nodes V in the graph are composed with patch nodes V p and anchor nodes V a . The patch nodes are histology patches randomly extracted from the whole histology images. The elements of anchor nodes V a in the graph are doublets of pCLE mosaics and their corresponding histology patches where V a = {(XiP , YiP ), i = 1, . . . , Np }. The directed edges E = {ei,j } indicate the connection between nodes where ei,j = 1 if Vj belongs to the k-nearest neighbors of Vi . The edge weight wi,j is defined between nodes Vi and Vj is defined as follows: ⎧ R ⎨ Yi − YjR 2 , if Vi ∈ V p ∧ Vj ∈ V p wi,j = YiR − YjP 2 , if Vi ∈ V p ∧ Vj ∈ V a ⎩ P Yi − YjR 2 , if Vi ∈ V a ∧ Vj ∈ V p

Unsupervised Feature Learning for Endomicroscopy Image Retrieval

67

Fig. 2. The visualization of graph based on tSNE [8]. (a) is the overview of the multimodal graph; (b) is an example of anchor node and its nearest neighbours

where  · 2 is the L2 norm. The distance between patch nodes are measured by the Euclidean distance based on their visual feature. Since the similarity between pCLE mosaics and histology patches cannot be directly calculated yet, the distance between anchor nodes and patch nodes is determined by the distance between the histology images only. The visualization of the graph via tSNE [8] is shown in Fig. 2. The anchor node and its nearest neighbours in Fig. 2(b) represent the dense scatter of cell nucleus. Based on the graph, we tend to extract positive pairs and negative pairs for feature learning. Starting from anchor node Via , we define that patch node Vjp is an n-order k-nearest neighbor of the node Via if there exists a directed path of length n from node Via to node Vjp . As defined in [6], if Vi belongs to its own n-order k-nearest neighbors, we can obtain a directed cycle as follows: (n)

Vi ∈ Nk (Vi ), n ≥ 2 (n)

(1)

where Nk (Vi ) is the set of n-order k-nearest neighbors for Vi . As shown in Fig. 3(a), for each anchor node in multimodal graph, we search its n-order knearest neighbours and detect its n-order cycle. An n-order cycle contains the histology image in anchor node and n − 1 histology patches which can finally generate n different pairs of pCLE and histology matching. We combine these positive samples with the pCLE mosaic in the anchor node, which are denoted (pos) (pos) , Yi }, as similar pairs which are used for feature learning. Comby {Xi pared with conventional nearest-neighbour schemes for positive sample mining, the nodes in a cycle indicate the consistent relationship and robustness to outliers. Besides the positive pairs, we also tend to extract negative pairs which are dissimilar samples to enhance the visual representation. In this paper, we use the geodetic distance between anchor nodes and patch nodes in the multimodal

Fig. 3. Unsupervised graph mining for positive/negative pair extraction.

68

Y. Gu et al.

graph. As shown in Fig. 3(b), we first use the Floyd-Warshall algorithm [9] to find the shortest paths between each node in the graph. The geodesic distance gij is the accumulated edge weights along the shortest path from anchor node Via to patch node Vjp . We then perform random selection among those image pairs with the geodesic distance larger than the threshold dm given as negative (neg) (neg) samples which are denoted by {Xi , Yi }. 2.2

Discriminative Feature Learning (pos)

(pos)

(neg)

(neg)

Based on the positive pairs {Xi , Yi } and negative pairs {Xi , Yi }, we tend to learn two transformations fX and fY which independently map the pCLE mosaics and histology images to a latent feature space. These transformations should promise that the positive pairs are similar in the latent space while the negative pairs are not similar. Therefore, we build the loss function penalizing the distances of positive pairs greater than a threshold, and distances of negative pairs smaller than the same threshold. The objective function is illustrated as follows:  (pos) l(αi fX (Xi ) − fY (Yi )2 − 1)− min (pos)

(Xi ,Yi )∈{Xi



λ

(neg)

(Xi ,Yi )∈{Xi

(pos)

,Yi

} (neg)

l(αi (neg)

,Yi

(2)

fX (Xi ) − fY (Yi )2 − 1)

} (pos)

where l(·) is the generalized logistic loss function l(x) = log(1+eβx )/β. αi and (neg) αi are sample weights for positive and negative pairs respectively. The sample weights are determined by the accumulated distance from anchor nodes to patch nodes. Although many different kinds of functions can be used to define fX and T x and fY , we adopt the commonly used linear function form where fX (x) = WX T fY (y) = WY y. The problem in Eq. (2) can be solved via simple gradient descent approaches as used in [10]. For pCLE mosaics, the discriminative representation T X. can be finally obtained by multiplying the transformation matrix WX

3 3.1

Experiments Dataset and Experimental Settings

The dataset is collected by a pre-clinical pCLE system (Cellvizio, Mauna Kea Technologies, Paris, France) as described in [11]. Breast tissue samples are obtained from 45 patients that are diagnosed with three main classes including normal, benign and neoplastic. Eight sub-classes are defined based on tissue and diagnosis information. For normal cases, the mosaics are further classified into adipose tissue, elastic fibres, collagen fibres and normal breast lobule. For benign cases, the subclasses contain Dilated Breast Ducts and Fibroadenoma. For neoplastic cases, diagnosis result supports the existence of specific lesion

Unsupervised Feature Learning for Endomicroscopy Image Retrieval

69

including ductal carcinoma in situ (DCIS) and invasive cancers. After completion of pCLE imaging, each sample underwent routine histopathology processing to generate the histology images. We finally obtain 700 pCLE mosaics and 144 of them are matched with histology patches. In training procedure, we use 144 pairs of pCLE mosaics and histology images as anchor nodes while the rest of data are used in testing phase. The complete histology image from all 45 patients are scanned with multi-scale windows and 8,000 patches are randomly selected to build the multimodal graph. For visual representation, we generate 500D BoW SIFT feature for mosaics and 200D BoW SIFT feature for histology images. The dimension of the latent space learned by the proposed feature is set to 64. In order to evaluate the proposed method, we conduct the retrieval tasks based on both main-classes and sub-classes. The top-1 and top-5 accuracy are both reported according to the retrieved result. Several baselines are implemented in this paper for comparison including dense-SIFT in [4] and MVMME in [5]. The proposed method with only positive pair mining is also set as baseline for comparison. All experiments are performed by 10-fold cross-validation. The hardware platform for evaluation is a PC with Intel 2.4 GHz CPU and 16 GB RAM. Methods are implemented with MATLAB. In the experiments, we set the number of nearest neighbours to k = 5. The length of cycle in Eq. (1) is set to 5. After the positive and negative mining, we finally obtain 720 pairs of similar examples and 5000 dissimilar samples for feature learning. 3.2

Results

In this section, we present the numerical results of retrieval tasks on pCLE dataset. Table 1 shows the retrieval performance of multiple baseline approaches and the proposed UMGM. The following observations can be derived: – Compared with approach based on pCLE mosaics only, the accuracy of retrieval system is effectively improved by multimodal embedding strategies. The corresponding histology images of pCLE mosaics can provide informative features to distinguish different types of tissues. – In retrieval tasks for three main-classes, we can only achieve slight improvement with 2% higher accuracy than unimodal approaches. Since the mainclass task combines all fine-grained tissues into the same class, the differences among sub-classes are ignored. Even with unimodal approaches, the performance is still promising. – In retrieval tasks for eight sub-classes, the improvement gained from the proposed method is significant. Compared with previous works, the top-1 retrieval accuracy is 0.659 while DenseSIFT and MVMME are below 0.6. In order to demonstrate in details, we take DCIS and Invasive as example. The characteristic of DCIS is the thickened ductal epithelium. However, in some cases, the duct structure will not be completely scanned while the rest tissues within the FoV are scattered nucleus. The scattered nucleus can also appear in invasive carinomas which lead to incorrect retrieval result. In our approach,

70

Y. Gu et al.

Table 1. Performance of CBMIR tasks. UMGM-Pos indicates the proposed method with only positive examples Method

Main-class top1 Main-class top5 Sub-class top1 Sub-class top5

DenseSIFT

0.875

MVMME

0.886

0.943

0.568

0.829

UMGM-Pos 0.891

0.937

0.619

0.835

0.937

0.562

0.818

UMGM

0.892

0.962

0.659

0.864

Method

DCIS Top1

DCIS Top5

Invasive Top1

Invasive Top5

DenseSIFT

0.244

0.511

0.648

0.956

MVMME

0.267

0.467

0.713

0.978

UMGM-Pos 0.333

0.533

0.692

0.989

0.578

0.747

0.989

UMGM

0.378

the cycle the tissues from the whole histology slide can provide a better view of specific structure. As shown in Table 1, the proposed method effectively improve the accuracy for DCIS and Invasive tissues. – Compared with the graph mining based on only positive sample selection, the negative samples can lead to further improvement on retrieval accuracy. We also present some examples of the retrieval based on the proposed method in Fig. 4 with three typical sub-classes. The texts in blue indicate the true response while the texts in red refer to the false result. It can be observed that the proposed method can return proper examples according to the query in many cases with both promising accuracy and good diversity. Some failure cases are also presented in Fig. 4.

Fig. 4. Examples of retrieval tasks.

Unsupervised Feature Learning for Endomicroscopy Image Retrieval

4

71

Conclusion

In this paper, we propose an Unsupervised Multimodal Graph Mining (UMGM) framework to learn discriminative features for endomicroscopy retrieval. A multiscale multimodal graph is built based on pCLE mosaics and histology patches. The latent similarity among pCLE and histology patches is mined by leveraging graph-based analysis over a large collection of histology patches. The discriminative feature of pCLE is learned via minimizing the distance between positive pairs and maximizing the distance between negative pairs. Compared with the previous works, the embedding of multimodal images contributes to higher accuracy on retrieval tasks without supervised information. Acknowledgement. This work is partially supported by NSFC, China (No: 61572315,6151101179) and 973 Plan, China (No. 2015CB856004). Yun Gu is also supported by Chinese Scholarship Council (CSC). The tissue specimens were obtained from consented patients using the Imperial tissue bank ethical protocol following the R-12047 project.

References 1. Jiang, M., Zhang, S., Huang, J., Yang, L., Metaxas, D.N.: Scalable histopathological image analysis via supervised hashing with multiple features. Med. Image Anal. 34, 3–12 (2016) 2. Zhang, X., Su, H., Yang, L., Zhang, S.: Fine-grained histopathological image analysis via robust segmentation and large-scale retrieval. In: CVPR, pp. 5361–5368 (2015) 3. Andr´e, B., Vercauteren, T., Buchner, A.M., Wallace, M.B., Ayache, N.: A smart atlas for endomicroscopy using automated video retrieval. Med. Image Anal. 15(4), 460–476 (2011) 4. Andr´e, B., Vercauteren, T., Perchant, A., Buchner, A.M., Wallace, M.B., Ayache, N.: Endomicroscopic image retrieval and classification using invariant visual features. In: IEEE ISBI 2009, pp. 346–349. IEEE (2009) 5. Gu, Y., Yang, J., Yang, G.Z.: Multi-view multi-modal feature embedding for endomicroscopy mosaic classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 11–19 (2016) 6. Li, D., Hung, W.-C., Huang, J.-B., Wang, S., Ahuja, N., Yang, M.-H.: Unsupervised visual representation learning by graph-based consistent constraints. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 678–694. Springer, Cham (2016). doi:10.1007/978-3-319-46493-0 41 7. Zhai, X., Peng, Y., Xiao, J.: Heterogeneous metric learning with joint graph regularization for cross-media retrieval. In: AAAI 2013, pp. 1198–1204 (2013) 8. van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. JMLR 9, 2579–2605 (2008) 9. Floyd, R.W.: Algorithm 97: shortest path. Commun. ACM 5(6), 345 (1962) 10. Mignon, A., Jurie, F.: Cmml: a new metric learning approach for cross modal matching. In: Asian Conference on Computer Vision, p. 14 (2012) 11. Chang, T.P., Leff, D.R., Shousha, S., Hadjiminas, D.J., Ramakrishnan, R., Hughes, M.R., Yang, G.Z., Darzi, A.: Imaging breast cancer morphology using probe-based confocal laser endomicroscopy: towards a real-time intraoperative imaging tool for cavity scanning. Breast Cancer Res. Treat. 153(2), 299–310 (2015)

Maximum Mean Discrepancy Based Multiple Kernel Learning for Incomplete Multimodality Neuroimaging Data Xiaofeng Zhu, Kim-Han Thung, Ehsan Adeli, Yu Zhang, and Dinggang Shen(B) Department of Radiology and BRIC, University of North Carolina at Chapel Hill, Chapel Hill, USA [email protected]

Abstract. It is challenging to use incomplete multimodality data for Alzheimer’s Disease (AD) diagnosis. The current methods to address this challenge, such as low-rank matrix completion (i.e., imputing the missing values and unknown labels simultaneously) and multi-task learning (i.e., defining one regression task for each combination of modalities and then learning them jointly), are unable to model the complex datato-label relationship in AD diagnosis and also ignore the heterogeneity among the modalities. In light of this, we propose a new Maximum Mean Discrepancy (MMD) based Multiple Kernel Learning (MKL) method for AD diagnosis using incomplete multimodality data. Specifically, we map all the samples from different modalities into a Reproducing Kernel Hilbert Space (RKHS), by devising a new MMD algorithm. The proposed MMD method incorporates data distribution matching, pair-wise sample matching and feature selection in an unified formulation, thus alleviating the modality heterogeneity issue and making all the samples comparable to share a common classifier in the RKHS. The resulting classifier obviously captures the nonlinear data-to-label relationship. We have tested our method using MRI and PET data from Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset for AD diagnosis. The experimental results show that our method outperforms other methods.

1

Introduction

Alzheimer’s Disease Neuroimaging Initiative (ADNI) has collected data from various modalities, such as Magnetic Resonance Imaging (MRI), Positron Emission Tomography (PET), biospecimen, and many others, aiming to use these data to better understand the pathological progression of Alzheimer’s Disease (AD) and to develop accurate AD biomarkers. However, due to budget limitation This work was supported in part by NIH grants (EB006733, EB008374, EB009634, AG041721, and AG042599). X. Zhu was supported in part by the National Natural Science Foundation of China under grant 61573270. c Springer International Publishing AG 2017  M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 72–80, 2017. DOI: 10.1007/978-3-319-66179-7 9

Maximum Mean Discrepancy Based Multiple Kernel Learning

73

and other constraints, not all the modality data were collected for each subject in the study. For example, at baseline, while all the subjects underwent MRI scans, only half of them had PET scans. Assuming that MRI or PET data of one subject can be represented as a row vector, the ADNI neuroimaging multimodality data (e.g., MRI and PET) is block-wise missing, as shown in Fig. 1(a). It is challenging to maximally utilize this kind of incomplete multimodality data (i.e., some modalities are not available for certain subjects) for AD diagnosis. Many AD studies using multimodality data simply dispose the subjects with incomplete data and conduct AD study using only the subjects with complete data [1,5,8,12,14], as shown in Fig. 1(b). This “disposal” method not only significantly reduces the number of the subjects for AD analysis, but also wastes a lot of information in the incomplete subjects, e.g., the red box in Fig. 1(c).

MRI

PET

Y

Available data

Missing data

Imputed data MMD-based RKHS

Train

Test (a)

(b)

(c)

(d)

(e)

Fig. 1. (a) Block-wise incomplete multimodality data, (b) Disposal method, (c) Imputation method, (d) Multi-task learning method, and (e) Proposed method, which nonlinearly maps heterogeneous MRI and PET data into a common RKHS so that they are comparable, and thus allowing to learn a common MKL-based classifier.

Unlike the disposal method, the imputation method and multi-task learning method are designed to utilize all the samples in incomplete multimodality data for AD study. The imputation method imputes the missing data, as shown in Fig. 1(c), so that any machine learning method can be employed subsequently. Unfortunately, current imputation methods, such as expectation maximization and low-rank matrix completion, are only effective when the data are uniformly missing, and become less effective while the data is block-wise missing, as in our case [9,17]. Without the need of imputation, multi-task learning methods [6,15,16], as shown in Fig. 1(d), first divide the incomplete multimodality data into several subsets of complete data, and then jointly learn a classifier for each subset to conduct AD diagnosis. The main drawback for the imputation and the multi-task learning methods is their underlying assumption of linear datato-label relationship, which is insufficient to model the complexity of AD progression. Moreover, the data heterogeneity across the modalities (modality heterogeneity for short) is also ignored in their formulations. On the other hand, though the advanced machine learning method such as Multiple Kernel Learning (MKL), is able to model the complex data-to-label relationship of heterogeneous multimodality data [2,7,13], it is currently only applicable to the set of complete data.

74

X. Zhu et al.

In this paper, we propose a new Maximum Mean Discrepancy (MMD) based MKL, so that we can use MKL to conduct AD diagnosis when the data are block-wise missing, as in our case. To do this, we design a new MMD mapping criterion to map the data from different modalities into a common Reproducing Kernel Hilbert Space (RKHS), as shown in Fig. 1(e). The traditional MMD only considers to minimize the data distribution difference (i.e., a type of high order data relationship) among the modalities [3], while our proposed MMD additionally enforces multiple kernel learning, feature selection and pair-wise sample mismatch minimization. Through the MMD non-linear mapping, the complex data-to-label relationship is captured, the modality heterogeneity is alleviated, and all the data in different modalities become comparable in the RHKS, where a common MKL-based classifier (for these data) is constructed for AD diagnosis.

2

Method

In this paper, we denote X = [xT1 , · · · , xTm ]T ∈ Rm×p and U = [uT1 , · · · , uTn ]T ∈ Rn×q as the Region of Interest (ROI)-based MRI and PET data, respectively, where m and n are the numbers of the samples of the MRI data and the PET data, respectively, p and q indicate the numbers of features in MRI and PET data, respectively, and the superscript T of a matrix indicates its transpose. In addition, y ∈ {−1, 1}m and v ∈ {−1, 1}n denote the diagnostic labels of the MRI and PET data, respectively. 2.1

Maximum Mean Discrepancy Based MKL

Many studies minimize the heterogeneity among the modalities by using Canonical Correlation Analysis (CCA), which maps all the modalities into a common space [7] via pair-wise distance minimization of all the samples. Since CCA uses pair-wise distances, it is unable to deal with the multimodality data with different numbers of samples, as in our case. Thus, in this study, we design a new MMD criterion to relief the modality heterogeneity between MRI and PET. Traditional MMD criterion [4] uses the data distribution mismatch minimization to make the data from different modalities have similar data distribution in the common RKHS, which does not require equal number of the samples from each modality. The empirical estimation of MMD between X and U can be defined as the minimization of the following formulation: 

1 n 1 m φ(xi ) − φ(ui )H , i=1 i=1 m n

(1)

where H is a universal RKHS and φ is a nonlinear feature mapping of an universal kernel. Recall from kernel methods, the inner product between φ(xi ) and φ(xj ) is equivalent to a kernel function, i.e., k(xi , xj ) = φ(xi )T φ(xj ).

Maximum Mean Discrepancy Based Multiple Kernel Learning

75

In MMD, the empirical estimation distance in the RKHS is regarded as the distance between two different data distributions, as in Eq. (1). Actually, Eq. (1) captures high order statistics of multimodality data (i.e., high order moments of probability distribution) [4], so that the multimodality data are effectively transformed into a high-dimensional or even infinite dimensional space through the nonlinear feature mapping φ, where their distributions will be close so that the heterogeneous data are comparable. When the value of Eq. (1) is close to zero, the high order moments of the multimodality data (i.e., their distributions) become matched. Mathematically, the minimization of Eq. (1) can be reduced to the minimization of the following term: 1 n 1 m φ(xi ) − φ(ui )H ⇔ tr(KS), (2) i=1 i=1 m n  (1,1) (1,2)  K K where K = ∈ R(m+n)×(m+n) is a composite kernel matrix with K(2,1) K(2,2) 

T

{K(1,1) = [k(xi , xj )] ∈ Rm×m , K(1,2) = [k(xi , ug )] ∈ Rm×n , K(2,1) = K(1,2) , K(2,2) = [k(ug , ul )] ∈ Rn×n , i, j = 1, ..., m, and g, l = 1, ..., n}, and S = s × sT (where s = [1/m, ..., 1/m, −1/n, ..., −1/n]T ∈ R(m+n) ), and tr(·) is the trace       m

n

operator of a matrix. Equation (2) uses all the ROI-based features of MRI and PET data to build the kernel matrix K. However, not all ROIs are related the AD [9,11], so the resulting K could be noisy. To address this, we design a feature-level version of Eq. (2) to select a subset of MRI features and PET features for AD diagnosis, via first building a kernel for each feature separately and then combining them through their summation. Specifically, we first extend X ∈ Rm×p and U ∈ Rn×q , ˜ = [0n×p , U] ∈ Rn×(p+q) , ˜ = [X, 0m×q ] ∈ Rm×(p+q) and U respectively, to X ˜ and U ˜ into the where 0 is a matrix with all zero elements. We then map X RKHS by assigning a kernel function to each feature: (p+q) ˜ i S), αi K min tr(

(3)

i=1

α

˜ i ∈ R(m+n)×(m+n) (corresponding where αi is the weight of each kernel matrix K ˜ to each feature) and the kernel matrix Ki has four components as the kernel matrix K in Eq. (2). In addition, we also prefer to construct MKL for each feature, rather than fixing a single type of kernel for them, to more flexibly capture nonlinear data-to-label relationships, which leads to (p+q) M ˆ i,j S) ⇔ min β T aaT β, βi,j K min tr( β

i=1

j=1

β

(4)

ˆ i,j ∈ R(m+n)×(m+n) is where M is the number of kernel types and K the kernel matrix of the i-th feature and j-th kernel type, β = [β1,1 , ..., β1,M , ...., β(p+q),M ]T ∈ R((p+q)×M )×1 , a = [a1,1 , ..., a1,M , ...., a(p+q),M ]T ∈ ˆ i,j S). By comparing R((p+q)×M )×1 with its element given as ai,j = tr(K

76

X. Zhu et al.

Eqs. (2) and (3) with Eq. (4), we can see that the original MMD in Eq. (2) [4] has been extended to feature selection based MMD in the MKL framework (i.e., in (p+q) ˜ i , and then with (p+q) M βi,j K ˆ i,j . Eq. (4)), by replacing K with i=1 αi K i=1 j=1 In this way, the problem of minimizing distribution mismatch via MMD is converted to the issue of a MKL with the optimal coefficient vector β in Eq. (4), which is called MMD based MKL in this paper and can be achieved using MKL algorithm [7]. Hence, MMD is embedded into the framework of MKL to capture the nonlinear data-to-label relationship among incomplete multimodality data, where the modalities have different numbers of samples. 2.2

Subject Consistency

In Sect. 2.1, we design a new MMD criterion in the MKL framework to map the available MRI and the PET data (i.e., the left box in Fig. 1(e)) to the RKHS. The pair-wise information between the MRI and PET data of the same subject is not considered yet. As MRI and PET data are mapped into a common RKHS so that they are comparable, we would also like to include the subject consistency in our formulation, i.e., samples from the same subject (but different modalities) should be close to each other in the RKHS. To do this, we constrain that the corresponding MRI and PET data to be consistent in the RKHS. Specifically, we consider the pair-wise sample mismatch minimizations (i.e., minimizing the element-wise similarity for each of pair-wise samples) in the RKHS to conduct subject consistency, i.e., (5) min β T ddT β β

where d = [d1,1 , ..., d1,M , ...., d(p+q),M ]T ∈ R((p+q)×M )×1 , di,j = (kˆi,j − kˆi+m,j ) (where kˆi,j and kˆ(i+m),j , respectively, are the kernel values of the MRI data and their corresponding PET data, i = 1, ..., (p + q), and j = 1, ..., M ). 2.3

Joint Feature Selection and Classification

We use MKL-based max-margin classifier (i.e., SVM) to conduct joint feature selection and classification under two constraints that have been described in the previous sections, i.e., distribution mismatch minimization (Sect. 2.1) and subject consistency (Sect. 2.2). Thus the final objective function of our proposed method is defined as follows: (m+n) yi , f (ˆ xi )) + λ1 β T aaT β + λ2 β T ddT β + λ3 β1 , min 21 f 2H + C i=1 L(ˆ f,β (6) s.t., βi ≥ 0, i = 1, ..., (m + n). ˆ = [y; v] ∈ R(m+n) where C > 0 and λj (j = 1, 2, 3) are the tuning parameters, y ˆ = [X; ˜ U] ˜ = is a vector of diagnostic labels for MRI and PET samples, X ˆT(m+n) ] ∈ R(m+n)×(p+q) is the concatenation of extended MRI and PET [ˆ xT1 ; ...; x feature matrix (Sect. 2.1), f is the prediction function associated with a RKHS H (i.e., f ∈ H and f (ˆ x) = wT φ(ˆ x) + b), and L is the hinge loss function.

Maximum Mean Discrepancy Based Multiple Kernel Learning

3

77

Experiments

We used the ADNI dataset (‘www.adni-info.org’) to conduct experiments and compare with various previous works. The used dataset includes 412 MRI subjects (i.e., 186 ADs and 226 Healthy Controls (HCs)) and 194 PET subjects (i.e., 93 ADs and 101 HCs). More specifically, PET subjects have 218 missing subjects (i.e., 93 ADs and 125 HCs), compared to 412 MRI subjects. In this paper, we use ROI-based features from both MRI and PET images. The MRI data were sequentially preprocessed by anterior commissure and posterior commissure correction, skull-stripping, cerebellum removal, intensity inhomogeneity correction, segmentation, and registration. Subsequently, we dissected a cerebrum into 90 regions by the AAL template, followed by computing the gray matter tissue volume of each region to yield 90 features for an MRI image. We linearly aligned each PET image to its corresponding MRI image, and then used the mean intensity value of each ROI as PET feature. Finally, we used 90-dimension ROI-based features to represent, MRI and PET data, respectively. 3.1

Experiment Setting

We tested our model by conducting two kinds of binary classification experiments, i.e., AD diagnosis using the incomplete MRI and PET data (namely the incomplete data experiment) and AD diagnosis using the PET data with the help of the MRI data (namely the transfer learning experiment). We employed classification accuracy, sensitivity, specificity, and Area Under Curve (AUC) as performance metrics to compare our proposed method with the other methods. The comparison methods for the two experiments including Baseline (i.e., SVM classification using the MRI data for the incomplete data experiment, and using the PET data for the transfer learning experiment), Lasso [10] (i.e., similar to the Baseline except it performs the Lasso feature selection prior classification), and a multi-task learning method (i.e., regression-based incomplete Multi-Source Feature (iMSF) [11]). In addition, we also included an imputation method (i.e., Low-Rank Matrix Completion with sparse feature selection (LRMC) [9]) for the incomplete data experiment and a popular multiple kernel learning method (i.e., SimpleMKL [7]) for the transfer learning experiment. 3.2

Experimental Results

We present the results of all the methods for the two classification experiments in Fig. 2. The results of the incomplete data experiment indicate that the proposed method performs consistently better than all the comparison methods in terms of four evaluation metrics. For example, in terms of accuracy, our method (i.e., 90.9%) on average outperforms the Baseline, Lasso, iMSF, and LRMC methods by 9.1%, 5.9%, 4.7%, and 3.9%, respectively. The superiority of our proposed method is probably due to the nonlinear data-to-label mapping, modality heterogeneity alleviation, and joint feature selection and classification

78

X. Zhu et al.

Percentage

Baseline 1 0.9

Lasso

∗ ∗ ∗ ∗



0.8

LRMC

Proposed

∗ ∗ ∗ ∗

∗ ∗ ∗

∗ ∗ ∗

0.7 Accuracy Baseline

Percentage

iMSF

1 0.9

∗ ∗ ∗ ∗

Sensitivity Lasso

iMSF

Specificity

SimpleMKL

∗ ∗ ∗ ∗

∗ ∗

Proposed



AUC

∗ ∗

0.8 0.7 Accuracy

Sensitivity

Specificity

AUC

Fig. 2. Comparisons between the proposed method and the comparison methods in two classification experiments, i.e., the incomplete data experiment (Upper row) and the transfer learning experiment (Bottom row). Error bars: standard deviations; *: statistically significant.

in RKHS of our proposed model. We also observe that all the methods with feature selection (i.e., our proposed method, LRMC, iMSF, and Lasso) outperform the Baseline method, which did not conduct any feature selection. This shows that feature selection is necessary for AD study, which is consistent with the findings in [9,11]. In the transfer learning experiment, we use the MRI data to assist AD diagnosis on PET data, so the method LMRC cannot be used for this experiment and we use SimpleMKL, which conducts MKL for AD diagnosis using all the PET and their corresponding MRI data, to be one of the comparison methods in our experiments. According to the experimental results, our proposed method still outperforms all the comparison methods. For example, the proposed method is improved by 8.9% and 3.9%, respectively, in terms of four evaluation metrics, if compared to Baseline and SimpleMKL (which achieves the best performance of all the comparison methods). By comparing the nonlinear feature selection methods (i.e., our proposed method and SimpleMKL) with the linear feature selection methods (i.e., iMSF and Lasso), the nonlinear methods are better than the linear methods in our experiments. This probably due to the fact that there is nonlinear relationship between the data features and the labels. In addition, we also perform paired t-tests between our results and the results of other methods as significance test. We report the outcomes of the paired t-test in Fig. 2, by marking statistically significant difference results (between our method and all the comparison methods at 95% confidence level) with asterisks (*). The results show that the most of the improvement of the proposed method is statistically significant in our experiments.

Maximum Mean Discrepancy Based Multiple Kernel Learning

4

79

Conclusion

In this paper, we proposed a MMD-based MKL method for AD diagnosis using incomplete multimodality neuroimaging data, which is able to capture the nonlinear data-to-label relationship, relief modality heterogeneity, and utilize all the available samples from different modalities to learn a classifier. To do so, we incorporate feature selection, data distribution and pair-wise sample mismatch minimizations, and classifier learning, in a MKL formulation, to concurrently map all the multimodality data into a common RKHS and learn a common classifier for all the modalities. The experimental results also confirmed the superiority of our proposed method, compared with other methods.

References 1. Adeli, E., et al.: Joint feature-sample selection and robust diagnosis of parkinson’s disease from MRI data. NeuroImage 141, 206–219 (2016) 2. Adeli, E., et al.: Kernel-based joint feature selection and max-margin classification for early diagnosis of parkinson disease. Sci. Reports 7 (2017) 3. Bach, F.R., et al.: Multiple kernel learning, conic duality, and the SMO algorithm. In: ICML, p. 6 (2004) 4. Borgwardt, K.M., et al.: Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics 22(14), e49–e57 (2006) 5. Hor, S., Moradi, M.: Learning in data-limited multimodal scenarios: scandent decision forests and tree-based features. Med. Image Anal. 34, 30–41 (2016) 6. Hu, R., et al.: Graph self-representation method for unsupervised feature selection. Neurocomputing 220, 130–137 (2017) 7. Rakotomamonjy, A., et al.: SimpleMKL. J. Mach. Learn. Res. 9, 2491–2521 (2008) 8. Thung, K., et al.: Neurodegenerative disease diagnosis using incomplete multimodality data via matrix shrinkage and completion. NeuroImage 91, 386–400 (2014) 9. Thung, K.-H., Adeli, E., Yap, P.-T., Shen, D.: Stability-weighted matrix completion of incomplete multi-modal data for disease diagnosis. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 88–96. Springer, Cham (2016). doi:10.1007/978-3-319-46723-8 11 10. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. Royal Stat. Soc. Ser. B (Methodol.) 58, 267–288 (1996) 11. Yuan, L., et al.: Multi-source feature learning for joint analysis of incomplete multiple heterogeneous neuroimaging data. NeuroImage 61(3), 622–632 (2012) 12. Zhang, S., et al.: Learning k for kNN classification. ACM TIST 8(3), 43:1–43:19 (2017) 13. Zhu, X., et al.: Subspace regularized sparse multitask learning for multiclass neurodegenerative disease identification. IEEE Trans. Biomed. Eng. 63(3), 607–618 (2016) 14. Zhu, X., et al.: A novel relational regularization feature selection method for joint regression and classification in AD diagnosis. Med. Image Anal. 38, 205–214 (2017) 15. Zhu, X., et al.: Robust joint graph sparse coding for unsupervised spectral feature selection. IEEE Trans. Neural Netw. Learning Syst. 28(6), 1263–1275 (2017)

80

X. Zhu et al.

16. Zhu, Y., et al.: Early diagnosis of Alzheimer disease by joint feature selection and classification on temporally structured support vector machine. In: MICCAI, pp. 264–272 (2016) 17. Zhu, Y., Zhu, X., Zhang, H., Gao, W., Shen, D., Wu, G.: Reveal consistent spatialtemporal patterns from dynamic functional connectivity for autism spectrum disorder identification. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9900, pp. 106–114. Springer, Cham (2016). doi:10.1007/978-3-319-46720-7 13

Liver Tissue Classification in Patients with Hepatocellular Carcinoma by Fusing Structured and Rotationally Invariant Context Representation John Treilhard1(B) , Susanne Smolka3,4 , Lawrence Staib1,2,3 , Julius Chapiro3 , MingDe Lin3,5 , Georgy Shakirin6 , and James S. Duncan1,2,3

2

1 Department of Biomedical Engineering, Yale University, New Haven, CT 06520, USA [email protected] Department of Electrical Engineering, Yale University, New Haven, CT 06520, USA 3 Department of Radiology and Biomedical Imaging, Yale University, New Haven, CT 06520, USA 4 Charit´e University Hospital, 10117 Berlin, Germany 5 Philips Research North America, Cambridge, MA 02141, USA 6 Philips Research Aachen, 52074 Aachen, Germany

Abstract. This work addresses multi-class liver tissue classification from multi-parameter MRI in patients with hepatocellular carcinoma (HCC), and is among the first to do so. We propose a structured prediction framework to simultaneously classify parenchyma, blood vessels, viable tumor tissue, and necrosis, which overcomes limitations related to classifying these tissue classes individually and consecutively. A novel classification framework is introduced, based on the integration of multiscale shape and appearance features to initiate the classification, which is iteratively refined by augmenting the feature space with both structured and rotationally invariant label context features. We study further the topic of rotationally invariant label context feature representations, and introduce a method for this purpose based on computing the energies of the spherical harmonic decompositions computed at different frequencies and radii. We test our method on full 3D multi-parameter MRI volumes from 47 patients with HCC and achieve promising results. Keywords: Classification · Structured prediction · Rotationally invariant context features · Spherical harmonics · HCC · MRI

1

Introduction

Hepatocellular carcinoma (HCC) is the most common primary cancer of the liver, its worldwide incidence is increasing, and it’s the second most common cause of cancer-related death [4]. Multi-parameter MRI is extremely useful for detecting and surveilling HCC, and in this work, we consider the problem of c Springer International Publishing AG 2017  M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 81–88, 2017. DOI: 10.1007/978-3-319-66179-7 10

82

J. Treilhard et al.

Fig. 1. Lef t: DCE-MRI sequence, clockwise from top left: pre-contrast phase, arterial phase, portal venous phase, delayed phase. Right: 3D view of liver tissue classes from gold standard radiologist segmentation: green - whole liver, red - viable tumor, blue necrosis, yellow - vasculature.

automated classification of pathological and functional liver tissue from multiparameter MRI in patients with HCC. We present a structured prediction framework for liver tissue classification. The contributions of our work are: (1) to our knowledge, this is among the first works on fully automatic multi-class liver tissue classification from multi-parameter MRI in patients with HCC, despite this modality being a gold-standard for HCC diagnosis and surveillance, (2) the method is based on a novel integration of multi-scale shape (Frangi vesselness) and appearance (mean, median, standard deviation, gradient) features to implement a fine-grained, clinically relevant, and simultaneous delineation of four tissue classes: viable tumor tissue, necrotic tissue, vasculature, and parenchyma, (3) the introduction of a novel method for rotationally-invariant context feature representation via the energies of the spherical harmonic context representation computed at different frequencies and radii, and (4) the segmentation is refined by a novel iterative classification strategy which fuses structured and rotationally-invariant semantic context representation. Our method is applied to a clinical dataset consisting of full 3D multi-parameter MRI volumes from 47 patients with HCC (Fig. 1).

2 2.1

Methods Shape and Appearance Features

Variations in patient physiology and HCC’s variable radiologic appearance necessitates the integration of multi-scale shape and appearance features in order to discriminate between tissue classes. Multi-scale shape and appearance features were extracted by computing Haralick textural features (contrast, energy, homogeneity, and correlation) and applying multi-scale median, mean, and standard deviation filters. The distance to the surface of the liver was associated as a feature to each voxel. Frangi filtering was applied to each temporal kinetic image to generate an image in which noise and background have been suppressed, and

Liver Tissue Classification in Patients with Hepatocellular Carcinoma

83

tubular structures (namely vessels) are enhanced. To account for variable vessel width, the filter response is computed at multiple scales and the maximum vesselness found at any scale is taken as the final result. 2.2

Label Context Features

Rotation-invariant context features. Our method incorporates the “autocontext” structured prediction framework [6], which takes the approach of training a cascade of classifiers, and using the probabilistic label predictions output by the (n − 1)st classifier as additional label context feature inputs for the nth classifier. Adopting a structured classification framework can yield significantly improved performance; however, one critical question concerns the nature of the representation of label context features. One typical approach is to sparsely extract label context probabilities for each tissue class from a neighborhood of each voxel, and concatenate these orderly into a 1D vector [6]. This approaches has a key advantage – namely, that it preserves the structural and semantic relationships between the extracted pixels: the process of orderly concatenation ensures that any two given positions in the label context feature representation share a consistent spatial orientation relationship and represent probabilities of consistent tissue classes. For these reasons, we call context features represented in this way structured context features. On the other hand, structured context features are highly dependent on the arbitrary orientation of the label context patch from which they are drawn. However in the present setting, patches within the liver do not have a canonical frame of reference, and therefore, alternative rotationally invariant label context feature representations potentially have some advantages over structured context feature representations in terms of improving the generalization performance of a structured classifier. More specifically, an orientation-invariant representation T 3  is a mapping T : Rd → Rd which maps a d×d×d patch into a 1D representation, such that T (P) = T (R(P)), (1) 3

3

where R : Rd → Rd denotes the operator rotating the patch in 3D. The patch P in this setting refers to a patch of label probabilities corresponding to a particular tissue class. It is clear that the mapping T which simply reorders the elements of a 3D patch into a 1D vector does not satisfy Eq. (1). Previous work on orientation invariant representation of contextual features [2] involves the decomposition of the 3D image patch into concentric spherical shells, and the computation of the distribution (via a histogram) of label context features on each of these shells, and finally the concatenation these histograms1 . More precisely, define the spherical shells S about voxel i as: Sr (i) = {k ∈ Ω : ||k − i|| = r}. 1

(2)

[2] actually proposes the use of “spin-context”, which involves computing “soft” histograms, but the principle is the same.

84

J. Treilhard et al.

Furthermore, denote the histogram operator with m bins by Hm : R∗ → Nm , (where R∗ denotes a list of numbers of arbitrary length). It is clear that the mapping: (3) T (P) = [Hm (P|Sr )]R r=1 satisfies the rotation invariance property given in (1), where P|Sr denotes the restriction of P to Sr . We note however that despite their orientation invariance, using the distributions of contextual features on concentric spherical shells involves the complete loss of structural information about the label context. Therefore, inspired by work in the field of shape descriptors [1], we propose an alternative representation of label context features by decomposing the label context features on spherical shells into their spherical harmonic representation, and using the norm of the energy of each frequency as features. We emphasize that the application of spherical harmonic decomposition for structure-preserving 3D rotation invariant label context representation has not appeared before in the literature. Following the notation in [1], let f (θ, φ) denote a function defined on the surface of a sphere; then f (θ, φ) admits a rotation invariant representation as: l  ∞ alm Ylm (θ, φ) (4) SH(f ) = {||fl (θ, φ)||2 }l=0 , fl (θ, φ) = m=−l

{Ylm (θ, φ)}

are a set of orthogonal basis functions for the space of where the L2 functions defined on the surface of a sphere, called spherical harmonics. In practice we limit the bandwidth of the representation to some small L > 0. Returning to our original motivation, we define the operator T acting on the label context patch P by: T (P) = [SH(P|Sr )]R r=1 .

(5)

We note that this alternative representation has some advantages relative the representation in (3), particularly the preservation of structural properties regarding the distribution of context features, but their relative performance will ultimately be determined by the nature of the specific problem, data, and classification method under consideration. Note also that our notation suppresses that these rotationally invariant representations are computed from the probability maps generated for each tissue class separately, and subsequently concatenated i.e. if Pu denotes the patch of label probabilities corresponding to tissue class u (where there are U tissue classes total), then T (P) = [SH(Pu |Sr )]r=1..R,u=1..U .

(6)

Structured and rotationally-invariant context integration. We note however that both rotation-invariant representations (3) and (5) share common disadvantages relative structured context feature representation: in particular, both fail to capture information between shells, and in addition they are computed separately for each tissue class and do not represent the relationship between probabilities of specific voxels belonging to different tissue classes. Since

Liver Tissue Classification in Patients with Hepatocellular Carcinoma

85

both structured and rotationally-invariant context representations are endowed with advantages and disadvantages, we propose to integrate them into a single representation, which could exploit their respective strengths and minimize their weaknesses. Therefore we propose of a unified structured and rotationallyinvariant context representation: T (P) = [T1 (P), T2 (P)],

(7)

where T1 (P) is given by the orderly concatenation of the elements of P, and T2 (P) is given by (3) or (5) (Fig. 2).

Fig. 2. Representation of the unique iterative classification method: (a) multiparameter MRI input images, (b) multi-class classifier, in our case a random forest, (c) tissue-specific probability maps output by the classifier, (d) structured and rotationally-invariant label context representations are computed and used as input to another classifier, in addition to the original multi-parameter MRI imaging, (f) tissuespecific probability maps output by the classifier, and the entire process is iterated.

Classification. We introduce a structured classification framework where multiscale shape and appearance features are extracted at each voxel from the multiparameter MRI and used to train a random forest classifier with bagging and random node optimization. A cascade of random forests are trained, where the features used to train classifier n being augmented by the label-context of each voxel, inferred from the output of classifier (n − 1), with the unique feature that these contextual features are represented both in structured and rotationallyinvariant forms - see (7). We train a cascade of random forest classifiers, where at classification stage t, the training data is given by: (t−1)

St = {(yji , (Xj (Ni ), T (Pj

(i)))), j = 1..m, i = 1..n},

(8)

where m denotes the number of images in the training set, n denotes the number of voxels in each training image, yji denotes the (tissue class) label of voxel i in image j, Xj (Ni ) denotes the concatenation of multi-scale shape and appearance features (described in Sect. 2.1) in a neighborhood Ni of voxel i, the mapping T (t−1) is given by (5), and Pj (i) denotes the patch of label probabilities surrounding voxel i in image j at classification iteration (t − 1) (Fig. 3).

86

J. Treilhard et al.

Fig. 3. Lef t: Viable tumor tissue classification, where blue represents the gold-standard segmentation, green represents the classification based on structural context features alone, and red represents the segmentation achieved by our method integrating structured and rotationally invariant context features. Right: Dice similarity coefficient (DSC) for viable tumor tissue classification for each classification iteration (“Spher. Harm” = spherical harmonic contextual feature encoding, “Shell Hist.” = histograms of contextual features on shells).

3 3.1

Experiments Data

We analyzed T1-w dynamic contrast enhanced (DCE) and T2-w MRI data sets from 47 patients with HCC. The DCE sequences consisted of 3 timepoints: the pre-contrast phase, the arterial phase (20 s after the injection of Gadolinium contrast agent), and the portal venous phase (70 s after the injection of Gadolinium contrast agent). Parenchyma, viable tumor, necrosis, and vasculature were segmented on the arterial phase image by a medical student and confirmed by an attending radiologist. 3.2

Numerical Results and Discussion

We evaluated our method on the dataset described in Sect. 3.1. Regarding the results presented in Table 1, we observe first that classification based solely on bias field corrected pre-contrast, arterial, and portal venous phase intensities (row 1 of Table 1) provides a far inferior result to the classification computed using higher order features, for example multi-scale vessel filters and Haralick textural features. Furthermore, we observe a performance improvement by virtue of integrating structural and rotationally-invariant context feature representations; for the viable tumor tissue and necrosis tissue classes, the classification was significantly better (parametric paired two-tailed t-test, p < 0.05) compared to using multi-scale shape and appearance features alone. We conclude that although both shell histogram and spherical harmonic context features add significant value following integration with structured context features, neither significantly outperformed the other in this problem. However, given a different problem setting and data, one might potentially outperform the other, and this deserves empirical evaluation with further data.

Liver Tissue Classification in Patients with Hepatocellular Carcinoma

87

Table 1. Dice similarity coefficient evaluation of classification results. Features

Viable tumor Necrosis Vasculature

(1) Intensity only

0.506

0.342

0.491

(2) Multi-scale shape + appearance

0.580

0.449

0.547

(3) = (2) + Structural Context

0.643

0.527

0.544

(4) = (2) + Shell Histogram Context

0.659

0.494

0.555

(5) = (2) + Spherical Harmonic Context 0.652

0.514

0.549

(6) = (3) + Spherical Harmonic Context 0.661

0.544

0.552

0.530

0.557

(7) = (3) + Shell Histogram Context

0.678

Regarding the overall performance, we note that the inter-reader variability for the Dice similarity coefficient for segmenting whole HCC lesions is reported as 0.7 [5]. In the context of this paper, we are assessing the accuracy of a much more difficult task, which is the segmentation of viable and necrotic tumor tissue segmentations, which could reasonably have a yet higher inter-reader variability. The inter-reader variability could be regarded as an upper-bound on how well an automated method could perform, and from this perspective, the results indicate the method is effective in the setting of the given problem. The MICCAI Multimodal Brain Tumor Image Segmentation Benchmark (BRATS - [3]) presents results related to the multi-class classification of brain tissue in patients with gliomas - they report slightly higher Dice similarity coefficient scores for the segmentations corresponding to the top performing methods, but we emphasize that we are considering different tissue classes in a different organ, and so the results are not directly comparable. Moreover, we are (to our knowledge) among the first to present multi-class tissue classification from MRI for a relatively large cohort of HCC patients.

4

Conclusion and Future Work

In this work we presented a novel method for addressing a previously unconsidered clinical problem: multi-class tissue classification from multi-parameter MRI in patients with HCC. The method integrated multi-scale shape and appearance features, along with structured and rotationally-invariant label context feature representations in a structured prediction framework. We also introduce a novel method for the rotationally-invariant encoding of context features and demonstrated that it delivered a significant performance improvement upon integration with structured context features, and our study considered full 3D clinical volumes in a substantial patient cohort. Furthermore, the methods we introduce here are independent of the specifics of our implementation, and can be translated to many other problems in the structured prediction setting. Acknowledgements. This research was supported in part by NIH grant R01CA206180.

88

J. Treilhard et al.

References 1. Kazhdan, M., Funkhouser, T., Rusinkiewicz, S.: Rotation invariant spherical harmonic representation of 3D shape descriptors. In: Eurographics Symposium on Geometry Processing, pp. 156–165 (2003) 2. McKenna, S.J., Telmo, A., Akbar, S., Jordan, L., Thompson, A.: Immunohistochemical analysis of breast tissue microarray images using contextual classifiers. J. Path. Inform. 4(2), 13 (2013) 3. Menze, B.H., Jakab, A., Bauer, S., et al.: The multimodal brain tumor segmentation benchmark (BRATS). IEEE Trans. Med. Imag. 34, 1993–2024 (2015) 4. Park, J.-W., Chen, M., Colombo, M., et al.: Global patterns of hepatocellular carcinoma management from diagnosis to death: the BRIDGE Study. Liver Int. 35(9), 2155–2166 (2015) 5. Tacher, V., Lin, M., Chao, M., et al.: Semiautomatic volumetric tumor segmentation for hepatocellular carcinoma. Acad. Radiol. 20, 446–452 (2013) 6. Tu, Z., Bai, X.: Auto-context and its application to high-level vision tasks and 3D brain image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 32, 1744–1757 (2010)

DOTE: Dual cOnvolutional filTer lEarning for Super-Resolution and Cross-Modality Synthesis in MRI Yawen Huang1(B) , Ling Shao2 , and Alejandro F. Frangi1 1

2

Department of Electronic and Electrical Engineering, The University of Sheffield, Sheffield, UK {yhuang36,a.frangi}@sheffield.ac.uk School of Computing Sciences, University of East Anglia, Norwich, UK [email protected]

Abstract. Cross-modal image synthesis is a topical problem in medical image computing. Existing methods for image synthesis are either tailored to a specific application, require large scale training sets, or are based on partitioning images into overlapping patches. In this paper, we propose a novel Dual cOnvolutional filTer lEarning (DOTE) approach to overcome the drawbacks of these approaches. We construct a closed loop joint filter learning strategy that generates informative feedback for model self-optimization. Our method can leverage data more efficiently thus reducing the size of the required training set. We extensively evaluate DOTE in two challenging tasks: image super-resolution and crossmodality synthesis. The experimental results demonstrate superior performance of our method over other state-of-the-art methods. Keywords: Dual learning · Convolutional sparse coding modal · Image synthesis · MRI

1

· 3D · Multi-

Introduction

In medical image analysis, it is sometimes convenient or necessary to infer an image from one modality or resolution from another image modality or resolution for better disease visualization, prediction and detection purposes. A major challenge of cross-modality image segmentation or registration comes from the differences in tissue appearance or spatial resolution in images arising from different physical acquisition principles or parameters, which translates into the difficulty to represent and relate these images. Some existing methods tackle this problem by learning from a large amount of registered images and constraining pairwise solutions in a common space. In general, one would desire to have high-resolution (HR) three-dimensional Magnetic Resonance Imaging (MRI) with near isotropic voxel resolution as opposed to the more common image stacks of multiple 2D slices for accurate quantitative image analysis and diagnosis. Multi-modality c Springer International Publishing AG 2017  M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 89–98, 2017. DOI: 10.1007/978-3-319-66179-7 11

90

Y. Huang et al.

imaging can generate tissue contrast arising from various anatomical or functional features that present complementary information about the underlying organ. Acquiring low-resolution (LR) single-modality images, however, is not uncommon. To solve the above problems, super-resolution (SR) [1,2] reconstruction is carried out for recovering an HR image from its LR counterpart, and crossmodality synthesis (CMS) [3] is proposed for synthesizing target modality data from available source modality images. Generally, these methods have explored image priors from either internal similarities of image itself [4] or external data support [5], to construct the relationship between two modalities. Although these methods achieve remarkable results, most of them suffer from the fundamental limitations associated with large scale pairwise training sets or patch-based overlapping mechanism. Specifically, a large amount of multi-modal images is often required to learn a sufficiently expressive dictionaries/networks. However, this is impractical since collecting medical images is very costly and limited by many factors. On the other side, patch-based methods are subjected to inconsistencies introduced during the fusion process that takes place in areas where patches overlap. To deal with the bottlenecks of training data and patch-based implementation, we develop a dual convolutional filter learning (DOTE) method with an application to neuroimaging that investigates data (in both source and target modalities from the same set of subjects) in a more effective way, and solves image SR and CMS problems respectively. The contributions of this work are mainly in four aspects: (1) We present a unified model (DOTE) for any crossmodality image synthesis problem; (2) The proposed method can efficiently reduce the amount of training data needed from the model, by generating abundant feedbacks from dual mapping functions during the training process; (3) Our method integrates feature learning and mapping relation in a closed loop for selfoptimization. Local neighbors are preserved intrinsically by directly working on the whole images; (4) We evaluate DOTE on two datasets in comparison with stat-of-the-art methods. Experimental results demonstrate superior performance of DOTE over these approaches.

Fig. 1. Flowchart of the proposed method for MRI cross-modality synthesis.

DOTE: Dual cOnvolutional filTer lEarning

2 2.1

91

Method Background

Convolutional Sparse Coding (CSC) remedies a fundamental drawback of conventional patch-based sparse representation methods by modeling shift invariance for consistent approximation of local neighbors on whole images. Instead of decomposing the vector as the multiplication of dictionary atoms and the coded coefficients, CSC provides a more elegant way to model local interactions. That is, by representing an image as the summation of convolutions of the sparsely distributed feature maps and the corresponding filters. Concretely, given an m × n image x in vector form, the problem of learning a set of vectorized filters for sparse feature maps is solved by minimizing the objective function that combines the convolutional least-squares term and the l1 -norm penalty on the representations: 2  K K    1   fk ∗ sk  + λ sk 1 arg min x −  f ,s 2  k=1

2

k=1

(1)

2

s.t. fk 2 ≤ 1 ∀k = {1, ..., K} ,   T T is the k-th d × d filter, ∗ denotes the 2D convolution where fk ∈ f = f1T , ..., fK   T T T operator, zk ∈ z = z1 , ..., zK refers to the sparse feature map corresponding to fk with size (m + d − 1) × (n + d − 1) to approximate x, and λ is a regularization parameter. The problem in Eq. (1) can be efficiently and explicitly solved in the Fourier domain, derived within an Alternating Direction Method of Multipliers (ADMM) framework [6]. Dual Learning (DL) [7] is a new learning paradigm that translates the input model by forming a closed loop between source and target domains to generate informative feedbacks. Specifically, for any dual tasks (e.g., A ↔ B) DL strategy appoints A → B as the primary task and the other A ← B as the dual task, and forces them learning from each other to produce the pseudo-input A . It can achieve the comparable performance through iteratively updating and minimizing the reconstruction error A − A that helps maximize the use of data. Therefore, making the learning-based methods have less dependent on the large number of training data. Problem Formulation: The cross-modality image synthesis problem can be formulated as: given an 3D image X of modality M1 , the task is to infer from X a target 3D image Y that approximates to the ground truth of modality M2 . Let X = [X1 , ..., XC ] ∈ Rm×n×z×C be a set of images of modality M1 in the source domain, and Y = [Y2 , ..., YC ] ∈ Rm×n×z×C be a set of images of modality M2 in the target domain. m, n are the dimensions of axial view of the image, and z denotes the size of image along the z-axis, while C is the numbers of elements in the training sets. Each pair of {Xi , Yi } ∀i = {1, ..., C} are registered. To bridge image appearances across different modalities while preserving the

92

Y. Huang et al.

intrinsic local interactions (i.e., intra-domain consistency), we propose a method based on CSC to jointly learn a pair of filters Fx and FY . Moreover, inspired by the DL strategy, we form a closed loop between both domains and assume that there exists a primal mapping function F (·) from X to Y for relating and predicting from one another. We also assume there exists a dual mapping function G (·) from Y to X to generate feedbacks for model self-optimization. Experimentally, we investigate human brain MRI and apply our method to two cross-modality synthesis tasks, i.e., image SR and CMS. An overview of our method is depicted in Fig. 1. Notation: Matrices and 3D images are written in bold uppercase (e.g., image X), vectors and vectorized 2D images in bold lowercase (e.g., filter f ) and scalars in lowercase (e.g., element k). 2.2

Dual Convolutional Filter Learning

Inspired by CSC (cf. Sect. 2.1) and the benefits of conventional coupled sparsity, we propose a dual convolutional filter learning (DOTE) model, which extends the original CSC formulation into a DL strategy and joint representation into a unified framework. More specifically, given X together with the corresponding Y for training, in order to facilitate a joint mapping, we associate the sparse C feature maps of each registered data pair {Xi , Yi }i=1 by constructing a forward mapping function F : X → Y with Y = F (X). Since such cross-modality synthesis problem satisfies a dual-learning mechanism, we further leverage the duality of the bidirectional transformation between the two domains. That is, by establishing a dual mapping function G : Y → X with Y = G (X). Incorporating feature maps representing and the above closed-loop mapping functions, we can thus derive the following objective function:   2 2 K K K        1 1 x x y y X − Y − arg x ymin F ∗ S + F ∗ S + γ Wk 2F     k k k k F ,F ,Sx ,Sy ,W 2    2 k=1 k=1 k=1 F F K  K  K K  x     2 y y x 2 x −1 y Sk − Wk S  +λ Sk 1 + Sk 1 + β Sk − Wk Sk F + k F k=1

k=1

k=1

k=1

s.t. fkx 22 ≤ 1 fky 22 ≤ 1 ∀k = {1, ..., K} .

(2) where Sxk and Syk take the role of the k-th sparse feature maps that approximate data X and Y when convolved with the k-th filters Fxk and Fyk of a fixed spatial support, k = 1, ..., K. ·F is a Frobenius norm chosen to induce the convolutional least squares approximation, and ∗ is represented as a 3D convolution operator, while λ, β, γ are the regularizationparameters.  Particularly, dual mapping functions F (Sxk , Wk ) = Wk Sxk and G Syk , Wk−1 = Wk−1 Syk are x Fy . They are used to relate the sparse feature maps of X and Y over KF and 2 y done by solving two sets of least squares terms (i.e., k=1 (Sk − Wk Sxk F +  K  2 Sx − W−1 Sy  ) with respect to the linear projections. k=1

k

k

k F

DOTE: Dual cOnvolutional filTer lEarning

2.3

93

Optimization

Similar to classical dictionary learning methods, the objective function in Eq. (2) is not simultaneously convex with respect to the learned filter pairs, the sparse feature maps and the mapping. Instead, we divide the proposed method into three sub-problems: learning Sx , Sy , training Fx , Fy , and updating W. Computing sparse feature maps: We first initialize the filters Fx , Fy as two random matrices and the mapping W as an identity matrix, then fix them for calculating the solutions of sparse feature maps Sx , Sy . As a result, the problem of Eq. (2) can be converted into two optimization sub-problems. Unfortunately, this cannot be solved under l1 penalty without breaking rotation invariance. The resulting alternating algorithms [6] by introducing two auxiliary variables U and V enforce the constraint inherent in the splitting. In this paper, we follow [6] and solve the convolution subproblems in the Fourier domain within an ADMM optimization strategy: 2  K K K   2   1  ˆ  ˆ x ˆ x ˆy x ˆ x X − S F S S + λ U  + β − W min    k k k k 1 k k Sx 2   F k=1

2 Vkx 2

F 1, Vkx 2

k=1

k=1

T

ˆ x , Ux = Sx ∀k = {1, ..., K} , = VΦ F k k k

s.t. ≤  K K K   2   1  ˆ  ˆ y ˆy ˆx y −1 ˆ y  Y − S F S min S + λ U  + β − W     k k k k k k 1 Sy 2   F k=1

s.t.

2 Vky 2

k=1

F

≤ 1,

Vky

(3)

k=1

ˆ y , Uy = Sy ∀k = {1, ..., K} , = VΦ F k k k T

whereˆapplied to any symbol denotes the frequency representations (i.e., Discrete ˆ ← f (X) where f (·) is the Fourier Fourier Transform (DFT)). For instance, X transform operator. represents the component-wise product. ΦT is the inverse DFT matrix, and V projects a filter onto the small spatial support. The auxiliary variables Uxk , Uyk , Vkx and Vky relax each of the CSC problems under dual mapping constraint by leading to several subproblem decompositions. Learning convolutional filters: Like when solving for sparse feature maps, filter pairs can be learned similarly by setting Sx , Sy and W fixed, and then learning Fx and Fy by minimizing 2 2   K K   1 1  ˆ  ˆ x ˆ x  ˆ  ˆ y ˆy X − Y − min S + S F F    k k k k Fx ,Fy 2    2 k=1

F

s.t. fkx 2 ≤ 2

2 1, fky 2

k=1

F

(4)

≤ 1 ∀k = {1, ..., K} ,

Equation (4) can be solved by a one-by-one update strategy through an augmented Lagrangian method [6].

94

Y. Huang et al.

Updating mapping: With fixed Fx , Fy , Sx and Sy , we solve the following ridge regression problem for updating mapping W: min W

K  k=1

 2 2 Syk − Wk Sxk F + Sxk − Wk−1 Syk F +

 K γ 2 Wk F . β

(5)

k=1

Particularly, the primal mapping function Syk − Wk Sxk F constructs an intrinsic mapping while the corresponding dual mapping function   x S − W−1 Sy 2 is utilized to give feedbacks and further optimize the relak k F k tionship between Sxk and Syk . Ideally (as the final solution), Syk = Wk Sxk , K 2 such that the problem in Eq. (5) is reduced to minWk k=1 Syk − Wk Sxk F +  K 2 y xT γ γ −1 x xT , where I is k=1 Wk F with the solution W = Sk Sk (Sk Sk + β I) β an identity matrix. We summarize the proposed DOTE method in the following Algorithm 1. 2

Algorithm 1. DOTE algorithm 1 2 3 4 5 6 7 8 9

2.4

Input: Training data X and Y, parameters λ, γ, σ. Initialize Fx0 , Fy0 , Sx0 , Sy0 , W0 , Ux0 , Uy0 , V0x , V0y . ˆ x0 , Sy → S ˆ y , Fx0 → F ˆ x0 , Fy → F ˆ y , Ux0 → U ˆ x0 , Uy → U ˆ y, Perform FFT Sx0 → S 0 0 0 0 0 0 ˆ 0x , Vy → V ˆ y. V0x → V 0 0 ˆ y ← WS ˆ x0 . Let S 0 while not converged do ˆ xk+1 , S ˆy , U ˆ xk+1 and U ˆ y using (3) with fixed filters and Wk . Solve for S k+1 k+1 y y x x ˆ k+1 , F ˆ ˆ ˆ Train F k+1 , Vk+1 and Vk+1 by (4) with fixed feature maps and Wk . Update Wk+1 by (5). ˆ xk+1 → Fxk+1 , F ˆ y → Fy . Inverse FFT F k+1 k+1 end Output: Fx , Fy , W.

Synthesis

Once the optimization is completed, we can obtain the learned filters Fx , Fy and the mapping W. We then apply the proposed model to synthesize images across different modalities (i.e., LR → HR and M1 → M2 , respectively). Given a test to Fx by solving image Xt , we compute the sparse feature maps Stx related  2 a    K single CSC problem like Eq. (1): Stx = arg minStx 12 Xt − k=1 Fxk ∗ Stx k  + 2 K  . After that, we can synthesize the target modality image of Xt λ k=1 Stx k 1 ty y tx by the sum of K target feature maps Sk = WSk convolved with Fk , i.e., K Yt = k=1 Fyk Sty k .

DOTE: Dual cOnvolutional filTer lEarning

3

95

Experimental Results

Experimental Setup: The proposed DOTE is validated on two datasets: IXI1 (including 578 256 × 256 × p MR healthy subjects) and NAMIC2 (involving 20 128 × 128 × q subjects). In our experiments, we perform 4-fold cross-validation for testing. That is, selecting 144 subjects from IXI and 5 subjects from NAMIC, respectively, as our test data. Following [1], the regularization parameters σ, λ, β, and γ are empirically set to be 1, 0.05, 0.10, 0.15, respectively. The number of filters is set as 800 according to [8]. Convergence towards primal feasible solution is proved in [6] by first converting Eq. (2) into two optimization subproblems that involve two proxies U, V and then solving them alternatively. DOTE converges after ca. 10 iterations. For the evaluation criteria, we adopt PSNR and SSIM indices to objectively assess the quality of our results. MRI Super-Resolution. As we introduced in Sect. 1, we first address image SR as one of cross-modality image synthesis. In this scenario, we investigate the T2-w images of the IXI dataset for evaluating and comparing DOTE with ScSR [1], A+ [2], NLSR [4], Zeyde [5], ANR [9], and CSC-SR [8]. Generally, LR images are generated by down-sampling HR ground-truth images using bicubic interpolation. We perform image SR with scaling factor 2, and show visual results in Fig. 2. The quantitative results are reported in Fig. 3, while the average PSNRs and SSIMs for all 144 test subjects are shown in Table 1. The proposed model achieves the best PSNRs and SSIMs. Moreover, to validate our argument that DL-based self-optimization strategy is beneficial and requires less training data, we compare DOTEnodual (removing dual mapping term) and DOTE under different training data size (i.e., 14 , 12 , 34 of the original dataset). The results are listed in Table 2. From Table 2, we see that DOTE is always better than DOTEnodual especially with few training samples.

Input

Ground Truth (PSNR, SSIM)

ScSR (30.71, 0.9266)

Zeyde (32.52, 0.9445)

NLSR (32.54, 0.9452)

ANR (32.68, 0.9431)

A+ (32.70, 0.9460)

CSC-SR (32.76, 0.9467)

DOTE-1/4 (32.92, 0.9503)

DOTE-1/2 (33.66, 0.9524)

DOTE (33.94, 0.9578)

Fig. 2. Example SR results and the corresponding PSNRs and SSIMs.

Cross-Modality Synthesis. For the problem of CMS, we evaluate DOTE and the relevant algorithms on both datasets involving six groups of experiments: (1) synthesizing T2-w image from PD-w acquisition and (2) vice versa; (3) generating T1-w image from T2-w input, and (4) vice versa. We conduct (1–2) experiments on the IXI dataset, while (3–4) are explored on the NAMIC dataset. The representative and state-of-the-art CMS methods, including Vemulapalli’s 1 2

http://brain-development.org/ixi-dataset/. http://hdl.handle.net/1926/1687.

96

Y. Huang et al.

Fig. 3. Error measures of SR results on the IXI dataset. Input PD-w MRI

T2-w Ground Truth

MIMECS T2-w

DOTE T2-w

Fig. 4. Visual comparison of synthesized results using MIMECS and DOTE. Table 1. Quantitative evaluation: DOTE vs. other SR methods. Avg.

ScSR

PSNR 29.98 SSIM

0.9265

Zeyde

NLSR

ANR

A+

CSC-SR DOTE

33.10

33.97

35.23

35.72

36.18

0.9502

0.9548

0.9568

0.9600

37.07

0.9651

0.9701

Table 2. Quantitative evaluation: DOTE vs. DOTEnodual . Avg.

DOTEnodual 14 DOTEnodual 12 DOTEnodual 34 DOTE

PSNR 31.23 SSIM

0.9354

33.17 0.9523

36.09 0.9581

1 4

36.56 0.9687

DOTE

1 2

36.68 0.9690

DOTE

3 4

37.07 0.9701

Fig. 5. CMS results: DOTE vs. MIMECS on the IXI dataset.

method [3] and MIMECS [10] are employed to compare with our DOTE approach. We demonstrate visual and quantitative results in Figs. 4, 5 and Table 3, respectively. Our algorithm yields the best results against MIMECS and Vemulapalli for two datasets validating our claim of being able to synthesize better results through the expanded dual optimization.

DOTE: Dual cOnvolutional filTer lEarning

97

Table 3. CMS results: DOTE vs. other synthesis methods on the NAMIC dataset. Metric(avg.) NAMIC T1 − > T2 T2 − > T1 MIMECS Vemulapalli DOTE MIMECS Vemulapalli DOTE

4

PSNR

24.98

27.22

29.83

SSIM

0.8821

0.8981

0.9013 0.9198

27.13

28.95

32.03

0.9273

0.9301

Conclusion

We presented a dual convolutional filter learning (DOTE) method which directly decomposes the whole image based on CSC, such that local neighbors are preserved consistently. The proposed dual mapping functions integrated with joint learning model form a closed loop that leverages the training data more efficiently and keeps a very stable mapping between image modalities. We applied DOTE to both image SR and CMS problems. Extensive results showed that our method outperforms other state-of-the-art approaches. Future work could concentrate on extending DOTE to higher-order imaging modalities like diffusion tensor MRI and to other modalities beyond MRI. Acknowledgments. This work has been partially supported by the European Commission FP7 project VPH-DARE@IT (FP7-ICT-2011-9-601055).

References 1. Yang, J., Wright, J., Huang, T.S., Ma, Y.: Image super-resolution via sparse representation. IEEE TIP 19(11), 2861–2873 (2010) 2. Timofte, R., De Smet, V., Van Gool, L.: A+: adjusted anchored neighborhood regression for fast super-resolution. In: Cremers, D., Reid, I., Saito, H., Yang, M.-H. (eds.) ACCV 2014. LNCS, vol. 9006, pp. 111–126. Springer, Cham (2015). doi:10.1007/978-3-319-16817-3 8 3. Vemulapalli, R., Van Nguyen, H., Zhou, S.K.: Unsupervised cross-modal synthesis of subject-specific scans. In: IEEE ICCV, pp. 630–638 (2015) 4. Rousseau, F.: Alzheimer disease neuroimaging initiative: a non-local approach for image super-resolution using intermodality priors. MIA 14(4), 594–605 (2010) 5. Zeyde, R., Elad, M., Protter, M.: On single image scale-up using sparserepresentations. In: Boissonnat, J.-D., Chenin, P., Cohen, A., Gout, C., Lyche, T., Mazure, M.-L., Schumaker, L. (eds.) Curves and Surfaces 2010. LNCS, vol. 6920, pp. 711–730. Springer, Heidelberg (2012). doi:10.1007/978-3-642-27413-8 47 6. Bristow, H., Eriksson, A., Lucey, S.: Fast convolutional sparse coding. In: IEEE CVPR, pp. 391–398 (2013) 7. He, D., Xia, Y., Qin, T., Wang, L., Yu, N., Liu, T., Ma, W.: Dual learning for machine translation. In: NIPS, pp. 820–828 (2016)

98

Y. Huang et al.

8. Gu, S., Zuo, W., Xie, Q., Meng, D., Feng, X., Zhang, L.: Convolutional sparse coding for image super-resolution. In: IEEE ICCV, pp. 1823–1831 (2015) 9. Timofte, R., De Smet, V., Van Gool, L.: Anchored neighborhood regression for fast example-based super-resolution. In: IEEE ICCV, pp. 1920–1927 (2013) 10. Roy, S., Carass, A., Prince, J.L.: Magnetic resonance image example-based contrast synthesis. IEEE TMI 32(12), 2348–2363 (2013)

Supervised Intra-embedding of Fisher Vectors for Histopathology Image Classification Yang Song1(B) , Hang Chang2 , Heng Huang3 , and Weidong Cai1 1

School of Information Technologies, University of Sydney, Sydney, Australia [email protected] 2 Biological Systems and Engineering Division, Lawrence Berkeley National Laboratory, Berkeley, USA 3 Computer Science and Engineering, University of Texas at Arlington, Arlington, USA

Abstract. In this paper, we present a histopathology image classification method with supervised intra-embedding of Fisher vectors. Recently in general computer vision, Fisher encoding combined with convolutional neural network (ConvNet) has become popular as a highly discriminative feature descriptor. However, Fisher vectors have two intrinsic problems that could limit their performance: high dimensionality and bursty visual elements. To address these problems, we design a novel supervised intra-embedding algorithm with a multilayer neural network model to transform the ConvNet-based Fisher vectors into a more discriminative feature representation. We apply this feature encoding method on two public datasets, including the BreaKHis image dataset of benign and malignant breast tumors, and IICBU 2008 lymphoma dataset of three malignant lymphoma subtypes. The results demonstrate that our supervised intra-embedding method helps to enhance the ConvNet-based Fisher vectors effectively, and our classification results largely outperform the state-of-the-art approaches on these datasets.

1

Introduction

Diagnosis of cancers usually relies on the visual analysis of tissue samples under the microscope. The morphological features in histopathology images provide the important clue to differentiate benign and malignant tumors or identify different cancer subtypes. To encode the morphological features for automated classification, handcrafted [2,6] and learning-based [3] feature descriptors have been designed over the years. The most recent trend in this area is the use of convolutional neural network (ConvNet). For example, pretrained or customized ConvNet models have been applied to classify lymphoma images [5] and breast cancer images [11]. Other related applications include cell detection and segmentation [13,14]. Such ConvNet approaches often demonstrate improved performance over the more traditional techniques based on handcrafted features. Recently in the general imaging domain, a feature descriptor that integrates ConvNet with Fisher encoding has been proposed [4]. With this method, the c Springer International Publishing AG 2017  M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 99–106, 2017. DOI: 10.1007/978-3-319-66179-7 12

100

Y. Song et al.

patch-level features of an image are first extracted using a ConvNet model, then these patch features are encoded into a Fisher vector (FV) [7] as the imagelevel feature descriptor. This CFV (short for ConvNet-based FV) descriptor provides significantly higher classification performance than using only the ConvNet model for texture classification and object categorization [4]. FV descriptors have also been adopted into biomedical imaging applications for histopathology image classification [3,10] and HEp-2 cell classification [15]. In these studies, the FV descriptors are generated by Fisher encoding of various types of patch-level features (other than the ConvNet model), and consistently high classification performance is reported. There are however two issues that could affect the discriminative power of FV descriptors. First, FV descriptors are high dimensional. An FV descriptor is constructed by pooling the difference vectors between the patch features and a number of Gaussian centers, and the resultant CFV descriptor can be 64K dimensional or higher. With a limited number of training images, such high dimensionality could cause overfitting and decrease the classification performance on test data. Second, FV descriptors can have bursty visual elements. This means that there can be some artificially large elements in the FV descriptor due to large repetitive patterns in the image, and such large elements would lower the contribution from the other important elements and thus affect the representativeness of the descriptor. For the first issue, dimensionality reduction with principal component analysis (PCA) and large margin distance metric learning have been experimented [9,10]. To overcome the second issue, the intranormalization technique [1] is often applied to perform L2 normalization within each block (corresponding to one Gaussian component) of the FV descriptor. In this study, we design a supervised intra-embedding method to address these two issues. We borrow the block-based normalization idea from intranormalization, but instead of simple L2 normalization within each block, a discriminative dimension reduction algorithm based on multilayer neural network is designed to embed each block of the CFV descriptor to a lower dimensional feature space. Also, the block-wise elements are integrated with further neural network layers and a hinge loss layer to optimize the discriminative power of whole descriptor collectively rather than just the individual blocks. We conduct experimental evaluation on two public histopathology image datasets, including the BreaKHis dataset for classifying benign and malignant breast tumors [12], and IICBU 2008 lymphoma dataset for classifying three malignant lymphoma subtypes [8]. We obtain improved performance over the existing approaches, and demonstrate that the proposed supervised intra-embedding method can effectively enhance the discriminative power of the CFV descriptors.

2 2.1

Methods Fisher Vector

Fisher vector [7] is a type of feature encoding technique that aggregates the patch-level features into an image-level descriptor. With FV encoding, a

Supervised Intra-embedding of Fisher Vectors

101

Fig. 1. Overview of the multilayer neural network design of our proposed supervised intra-embedding method. With this network, the CFV descriptor (shown as a 2K × H matrix) is transformed to a lower dimension (shown as a 2K × D2 matrix).

Gaussian mixture model (GMM) is constructed from the patch features. Then the mean first and second order difference vectors between each Gaussian center and all patch features are computed weighted by soft assignments. Assume that the patch-level feature is of dimension H. The first and second order difference vectors would also each have a dimension H. The final FV descriptor is a concatenation of all these difference vectors corresponding to all K Gaussian components, hence the total dimension of the FV descriptor is 2KH. The patch-level features can be extracted in different ways. In this study, we adopt the ConvNet-based method [4]. Specifically, an image is first rescaled to multiple sizes with scale factors of 2s , s = −3, −2.5, . . . , 1.5. For each rescaled image, the VGG-VD ConvNet model with 19 layers pretrained on ImageNet is applied and the last convolutional layer produces a set of patch features with H = 512 dimensions. Then these patch features from all rescaled images are pooled together to generate the CFV descriptor of this image. Note that from our empirical results, the effectiveness of the ImageNet-pretrained model demonstrates that although the natural images in ImageNet seem quite different from histopathology images, the intrinsic feature details could be highly similar. We generate the GMM model with K = 64 components. We find that a smaller K (e.g. 32) reduces the effectiveness of the descriptor, while a larger K (e.g. 128) increases the computational cost without notable performance improvement. The CFV descriptor is thus 2 × 64 × 512 = 65535 dimensional. 2.2

Supervised Intra-embedding

To reduce the dimensionality and bursty effect of CFV descriptors, we design a supervised intra-embedding method. Formally, we denote the CFV descriptor as f . The objective of the supervised intra-embedding method is to transform f to a lower dimension, and the transformed descriptor g is expected to provide good classification performance. For this, we design a multilayer neural network model, with locally connected layers for local transformation of descriptor blocks and a hinge loss layer for global optimization of the entire descriptor. The overall network structure is illustrated in Fig. 1. The input layer is the CFV descriptor of 2KH dimensions. The second layer is a locally connected

102

Y. Song et al.

layer, which is formed by 2K filters with each filter of D1 neurons. One filter is fully connected to a descriptor block of H = 512 elements in the input layer (one descriptor block corresponds to one first or second order difference vector of 512 dimensions) and generates D1 outputs: f2 (i) = W2 (i)f1 (i) + b2 (i)

(1)

where W2 (i) ∈ RD1 ×H and b2 (i) ∈ RD1 are the weights and bias of the ith filter, f1 (i) ∈ RH denotes the ith block in the input layer, and f2 (i) ∈ RD1 is the ith output at this locally connected layer. The output of this layer is a concatenation of outputs from all filters, and is thus of 2KD1 dimensions. Note that since D1 < H, this layer reduces the dimensionality of the input descriptor. The rational of designing a locally connected layer is that a CFV descriptor can be considered as having 2K blocks each of H dimensions and corresponding to a first or second order difference vector. The localized filters can help to achieve different ways of transformation in different blocks. This provides more locally adaptive processing compared to the convolutional and fully connected layers in ConvNet, in which a filter is applied to the entire input. On the other hand, such local filters also increase the number of learning parameters significantly and could cause overfitting. To reduce the number of filter parameters, we make every four consecutive filters share the same weights and bias, so there are altogether 2K/4 unique filters. The third layer is an intra-normalization layer, in which each output f2 (i) from the previous layer is L2 normalized. In this way, the different blocks would contribute with equal weights in the transformed descriptor. Together with the locally connected layer, such local transformation provides a supervised learningbased approach to overcome the bursty visual elements. The fourth layer is a ReLU layer, and the ReLU activation is applied to the entire 2KD1 -dimensional output of the third layer. Next, layers two to four are repeated as layers five to seven in the network structure, to provide another level of transformation. Assume that the individual local filters at the fourth layer have D2 neurons. The output f7 at the seventh layer has then 2KD2 dimensions, and this is the final transformed descriptor g to be used in classification. For the last layer, we design a hinge loss layer to impose the optimization objective of the supervised intra-embedding. Typically FV descriptors (original or dimension reduced) are classified using a linear-kernel support vector machine (SVM) to produce good classification results [4,10]. Therefore, to align the optimization objective in our multilayer neural network with the SVM classification objective, we choose to use an SVM formulation in the loss layer. Specifically, assume that the dataset contains L image classes. We construct a one-versus-all multi-class linear-kernel classification model to compute the hinge loss. Denote the weight vector as wl ∈ R2KD2 for each class l ∈ {1, . . . , L}. The overall loss value based on N input CFV descriptors (for training) is computed as: ε=

L

L

l=1

l=1

N

 1 T wl wl + C max(1 − wlT f7n λnl , 0) 2 n=1

(2)

Supervised Intra-embedding of Fisher Vectors

103

Fig. 2. Example images of the two datasets.

where n is the index of the input descriptor, f7n is the corresponding output at the seventh layer, λnl = 1 if the nth input belongs to class l and λnl = −1 otherwise, and C is the regularization parameter. Minimizing this loss value ε at the last layer mimics the margin maximization in an SVM classifier. The hinge loss layer thus effectively integrates the local transformations and local filters would influence each other during the optimization. During training, a dropout layer with a rate of 0.2 is added before the loss layer to provide some regularization. Also, to initialize the filter weights, we first train the filters individually with one block of descriptors as the input layer. The transformed descriptor g at the seventh layer has 2KD2 dimensions, and is classified using the linear-kernel SVM to obtain the image label. The key parameters D1 and D2 are set to 64, hence the resultant feature dimension is 2 × 64 × 64 = 8192. The regularization parameter C is set to 0.1.

3

Results

We use two public datasets in our experimental study. (1) The BreaKHis dataset contains 7909 hematoxylin and eosin (H&E) stained microscopy images. The images are collected at four magnification factors from 82 patients with breast tumors. Each image has 700×460 pixels, and among the images, 2480 are benign and 5429 are malignant. The task is to classify the images into benign or malignant cases at both image-level and patient-level, and for the individual magnification factors. (2) The IICBU 2008 malignant lymphoma dataset contains 374 H&E stained microscopy images captured using brightfield microscopy. Each image is of 1388 × 1040 pixels and there is a large degree of staining variation among the images. The dataset contains 113 chronic lymphocytic leukemia (CLL), 139 follicular lymphoma (FL), and 133 mantle cell lymphoma (MCL) cases. The task is to classify these three subtypes of malignant lymphoma. Figure 2 shows example images from the two datasets. For training and testing, the BreaKHis dataset releases five splits for cross validation and each split contains 70% of images as training data and 30% as testing data. Images of the same patient are partitioned into either the training or testing set only. We use the same five splits in our study. For the IICBU dataset,

104

Y. Song et al.

we perform four-fold cross validation, with 3/4 of data for training and 1/4 for testing in each split. For the supervised intra-embedding, 50 epochs are trained on the BreaKHis dataset and 100 epochs are trained on the IICBU dataset. We compared our results with the state-of-the-art approaches reported on these datasets, including [11,12] on the BreaKHis dataset: [12] with a set of customized features, and [11] with a domain-specific ConvNet model; and [5,10] on the IICBU dataset: [10] with SIFT-based FV descriptor and distance metricbased dimensionality reduction, and [5] with a combination of handcrafted and ConvNet features. In addition, we also experimented with the more standard way of using VGG-VD, which is to use the 4096-dimensional output from the last fully connected layer as the feature descriptor. Classifying the CFV descriptor without supervised intra-embedding is also evaluated. We also compared with the more standard techniques to address the issues with FV descriptor: PCA for dimensionality reduction, and intra-normalization for bursty visual element. Table 1. The classification accuracies (%) on the BreaKHis breast tumor dataset. Method

Magnification factors 40× 100×

200×

400×

Patient-level [12] [11] rand [11] max VGG-VD CFV PCA Intra-norm Our method

81.6 ± 3.0 88.6 ± 5.6 90.0 ± 6.7 86.9 ± 5.2 90.0 ± 5.8 90.0 ± 5.8 90.0 ± 5.8 90.2 ± 3.2

79.9 ± 5.4 84.5 ± 2.4 88.4 ± 4.8 85.4 ± 5.7 88.5 ± 6.1 88.5 ± 6.1 87.7 ± 5.7 91.2 ± 4.4

85.1 ± 3.1 83.3 ± 3.4 84.6 ± 4.2 85.2 ± 4.4 85.4 ± 5.0 85.4 ± 5.0 86.6 ± 5.8 87.8 ± 5.3

82.3 ± 3.8 81.7 ± 4.9 86.1 ± 6.2 85.7 ± 8.8 86.0 ± 8.0 86.2 ± 8.0 86.2 ± 6.9 87.4 ± 7.2

Image-level

89.6 ± 6.5 85.6 ± 4.8 80.9 ± 1.6 86.8 ± 2.5 87.3 ± 2.5 87.3 ± 2.7 87.7 ± 2.4

85.0 ± 4.8 83.5 ± 3.9 81.1 ± 3.0 85.6 ± 3.8 86.1 ± 3.8 86.4 ± 4.1 87.6 ± 3.9

82.8 ± 2.1 82.7 ± 1.7 82.2 ± 1.9 83.8 ± 2.5 83.8 ± 2.5 83.9 ± 2.6 86.5 ± 2.4

80.2 ± 3.4 80.7 ± 2.9 80.2 ± 3.8 81.6 ± 4.4 82.0 ± 4.4 82.3 ± 4.3 83.9 ± 3.6

[11] rand [11] max VGG-VD CFV PCA Intra-norm Our method

For the BreaKHis dataset, the state-of-the-art [11] shows that random sampling of 1000 patches of 64 × 64 pixels provided the best overall result and max pooling of four different ConvNet models produced further enhancement. We thus include the results from these two techniques ([11] rand, and [11] max) in the comparison, as shown in Table 1. The patient-level classification is derived by majority voting of the image-level results. The original approach [12] reported

Supervised Intra-embedding of Fisher Vectors

105

the patient-level results only. Our method achieved consistently better performance than [11,12] except for the image-level classification of images with 40× magnification. It can be seen that while the random sampling approach [11] produced the best image-level classification for the 40× magnification, its patientlevel results of all magnification factors were all lower than the those of the max pooling approach [11] and our method. Overall, the results illustrate that using CFV feature representation with supervised intra-embedding can outperform the handcrafted features [12] and ConvNet models [11] that are specifically designed for the particular histopathology dataset. Table 2. The classification accuracies (%) on the IICBU 2008 lymphoma dataset. [10] [5]

VGG-VD

CFV

PCA

Intra-norm Our method

93.3 95.5 73.0 ± 3.1 93.9 ± 3.0 94.2 ± 3.1 94.5 ± 3.2

96.5 ± 2.7

Table 2 shows the result comparison on the IICBU dataset. It can be seen that our method achieved the best result. CFV outperforms [10], indicating the advantage of using ConvNet-based patch features rather than SIFT in FV encoding, even when discriminative dimension reduction is also included in [10]. In addition, similar to our method, [5] also involves an ImageNet-pretrained ConvNet model but with additional steps for segmentation, handcrafted feature extraction, and ensemble classification. The advantage of our method over [5] illustrates the benefit of FV encoding and multilayer neural network-based dimensionality reduction. Also, compared to [5], our method is less complicated without the need to segment the cellular objects. On both datasets, it can be seen that our method outperformed the approach with CFV descriptor only, demonstrating the benefit of supervised intra-embedding. When comparing CFV with VGG-VD (the 4096-dimensional descriptor), the advantage of using FV encoding of ConvNet-based patch features is evident. In addition, it can be seen that PCA and intra-normalization generally improved the results compared to using the original CFV descriptor. This indicates that it is beneficial to reduce the feature dimension and bursty effect of the CFV descriptor, and in general, intra-normalization has a greater effect on the classification performance than PCA. Our method provided larger improvement over CFV compared to PCA and intra-normalization. This shows that our learning-based transformation could provide a more effective approach to address the two issues with CFV.

4

Conclusions

A histopathology image classification method is presented in this paper. We encode the image content with ConvNet-based FV descriptor. To further improve the discriminative power of the descriptor, we design a supervised intraembedding method to transform the descriptor to a lower dimension and reduce

106

Y. Song et al.

the bursty effect. Our results on the BreaKHis breast tumor and IICBU 2008 lymphoma datasets show that our method provides consistent improvement over the state of the art on these datasets. For future study, we will investigate extending our method to conduct tile-based classification for whole-slide images.

References 1. Arandjelovic, R., Zisserman, A.: All about VLAD. In: CVPR, pp. 1578–1585 (2013) 2. Barker, J., et al.: Automated classification of brain tumor type in whole-slide digital pathology images using local representative tiles. Med. Image Anal. 30(1), 60–71 (2016) 3. BenTaieb, A., Li-Chang, H., Huntsman, D., Hamarneh, G.: Automatic diagnosis of ovarian carcinomas via sparse multiresolution tissue representation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9349, pp. 629–636. Springer, Cham (2015). doi:10.1007/978-3-319-24553-9 77 4. Cimpoi, M., et al.: Deep filter banks for texture recognition, description, and segmentation. Int. J. Compt. Vis. 118(1), 65–94 (2016) 5. Codella, N., et al.: Lymphoma diagnosis in histopathology using a multi-stage visual learning approach. In: SPIE, p. 97910H (2016) 6. Kandemir, M., Zhang, C., Hamprecht, F.A.: Empowering multiple instance histopathology cancer diagnosis by cell graphs. In: Golland, P., Hata, N., Barillot, C., Hornegger, J., Howe, R. (eds.) MICCAI 2014. LNCS, vol. 8674, pp. 228–235. Springer, Cham (2014). doi:10.1007/978-3-319-10470-6 29 7. Perronnin, F., S´ anchez, J., Mensink, T.: Improving the Fisher kernel for large-scale image classification. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 143–156. Springer, Heidelberg (2010). doi:10.1007/ 978-3-642-15561-1 11 8. Shamir, L., et al.: IICBU 2008: a proposed benchmark suite for biological image analysis. Med. Biol. Eng. Comput. 46(9), 943–947 (2008) 9. Simonyan, K., et al.: Fisher vector faces in the wild. In: BMVC, pp. 1–12 (2013) 10. Song, Y., Li, Q., Huang, H., Feng, D., Chen, M., Cai, W.: Histopathology image categorization with discriminative dimension reduction of Fisher vectors. In: Hua, G., J´egou, H. (eds.) ECCV 2016. LNCS, vol. 9913, pp. 306–317. Springer, Cham (2016). doi:10.1007/978-3-319-46604-0 22 11. Spanhol, F., et al.: Breast cancer histopathological image classification using convolutional neural networks. In: IJCNN, pp. 1–8 (2016) 12. Spanhol, F., et al.: A dataset for breast cancer histopathological image classification. IEEE Trans. Biomed. Eng 63(7), 1455–1462 (2016) 13. Wang, J., MacKenzie, J.D., Ramachandran, R., Chen, D.Z.: Neutrophils identification by deep learning and voronoi diagram of clusters. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 226–233. Springer, Cham (2015). doi:10.1007/978-3-319-24574-4 27 14. Xing, F., et al.: An automatic learning-based framework for robust nucleus segmentation. IEEE Trans. Med. Imag. 35(2), 550–566 (2016) 15. Xu, X., Lin, F., Ng, C., Leong, K.P.: Adaptive co-occurrence differential texton space for HEp-2 cells classification. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 260–267. Springer, Cham (2015). doi:10.1007/978-3-319-24574-4 31

GSplit LBI: Taming the Procedural Bias in Neuroimaging for Disease Prediction Xinwei Sun1 , Lingjing Hu2(B) , Yuan Yao3,4(B) , and Yizhou Wang5 1

School of Mathematical Science, Peking University, Beijing 100871, China Yanjing Medical College, Capital Medical University, Beijing 101300, China [email protected] 3 Hong Kong University of Science and Technology, Hong Kong, Hong Kong 4 Peking University, Beijing, China [email protected] 5 National Engineering Laboratory for Video Technology, Key Laboratory of Machine Perception, School of EECS, Peking University, Beijing 100871, China

2

Abstract. In voxel-based neuroimage analysis, lesion features have been the main focus in disease prediction due to their interpretability with respect to the related diseases. However, we observe that there exist another type of features introduced during the preprocessing steps and we call them “Procedural Bias”. Besides, such bias can be leveraged to improve classification accuracy. Nevertheless, most existing models suffer from either under-fit without considering procedural bias or poor interpretability without differentiating such bias from lesion ones. In this paper, a novel dual-task algorithm namely GSplit LBI is proposed to resolve this problem. By introducing an augmented variable enforced to be structural sparsity with a variable splitting term, the estimators for prediction and selecting lesion features can be optimized separately and mutually monitored by each other following an iterative scheme. Empirical experiments have been evaluated on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database. The advantage of proposed model is verified by improved stability of selected lesion features and better classification results. Keywords: Voxel-based structural magnetic resonance imaging · Procedural bias · Split Linearized Bregman Iteration · Feature selection

1

Introduction

Usually, the first step of voxel-based neuroimage analysis requires preprocessing the T1 -weighted image, such as segmentation and registration of grey matter (GM), white matter (WM) and cerebral spinal fluid (CSF). However, some Electronic supplementary material The online version of this chapter (doi:10. 1007/978-3-319-66179-7 13) contains supplementary material, which is available to authorized users. c Springer International Publishing AG 2017  M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 107–115, 2017. DOI: 10.1007/978-3-319-66179-7 13

108

X. Sun et al.

systematic biases due to scanner difference and different population etc., can be introduced in this pipeline [2]. Part of them can be helpful to the discrimination of subjects from normal controls (NC), but may not be directly related to the disease. For example in structural Magnetic Resonance Imaging (sMRI) images of subjects with Alzheimer’s Disease (AD), after spatial normalization during simultaneous registration of GM, WM and CSF, the GM voxels surrounding lateral ventricle and subarachnoid space etc. may be mistakenly enlarged caused by the enlargement of CSF space in those locations [2] compared to normal template, as shown in Fig. 1. Although these voxels/features are highly correlated with disease, they can’t be regarded as lesion features in an interpretable model. In this paper we refer to them as “Procedural Bias”, which should be identified but is neglected in the literature. We observe that it can be harnessed in our voxel-based image analysis to improve the prediction of disease.

Fig. 1. The overlapped voxels among top 150 negative value voxels in each fold of βpre at the time corresponding to the best average prediction result in the path of GSplit LBI using 10-fold cross-validation. For subjects with AD, they represent enlarged GM voxels surrounding lateral ventricle, subarachnoid space, edge of gyrus, etc.

Together with procedural bias, the lesion features are vital for prediction and lesion regions analysis tasks, which are commonly solved by two types of regularization models. Specifically, one kind of models such as general losses with l2 penalty, elastic net [13] and graphnet [5] select strongly correlated features to minimize classification error. However, such models don’t differentiate features either introduced by disease or procedural bias and may also introduce redundant features. Hence, the interpretability of such models are poor and the models are prone to over-fit. The other kind of models with sparsity enforcement such as TV-L1 (Combination of Total Variation [9] and L1 ) and particularly n2 GFL [12] enforce strong prior of disease on the parameters of the models introduced in order to capture the lesion features. Although such features are disease-relevant and the selection is stable, the models ignore the inevitable procedural bias, hence, they are losing some prediction power. To incorporate both tasks of prediction and selection of lesion features, we propose an iterative dual-task algorithm namely Generalized Split LBI (GSplit LBI) which can have better model selection consistency than generalized lasso [11]. Specifically, by the introduction of variable splitting term inspired by Split LBI [6], two estimators are introduced and split apart. One estimator is for prediction and the other is for selecting lesion features, both of which can be pursued separately with a gap control. Following an iterative scheme, they will

GSplit LBI: Taming the Procedural Bias in Neuroimaging

109

be mutually monitored by each other: the estimator for selecting lesion features is gradually monitored to pursue stable lesion features; on the other hand, the estimator for prediction is also monitored to exploit both the procedural bias and lesion features to improve prediction. To show the validity of the proposed method, we successfully apply our model to voxel-based sMRI analysis for AD, which is challenging and attracts increasing attention.

2

Method

2.1

GSplit LBI Algorithm

p th Our dataset consists of N samples {xi , yi }N 1 where xi ∈ R collects the i neuroimaging data with p voxels and yi = {±1} indicates the disease status (−1 for Alzheimer’s disease in this paper). X ∈ RN ×p and y ∈ Rp are concatenations of {xi }i and {yi }i . Consider a general linear model to predict the disease status (with the intercept parameter β0 ∈ R),

log P (yi = 1|xi ) − log P (yi = −1|xi ) = xTi βpre + β0 .

(2.1)

A desired estimator βpre ∈ Rp should not only fit the data by maximizing the loglikelihood in logistic regression, but also satisfy the following types of structural sparsity: (1) the number of voxels involved in the disease prediction is small, so βpre is sparse; (2) the voxel activities should be geometrically clustered or 3Dsmooth, suggesting a TV-type sparsity on DG βpre where DG is a graph difference operator1 ; (3) the degenerate GM voxels in AD are captured by nonnegative component in βpre . However, the existing procedural bias may violate these a priori sparsity properties, esp. the third one, yet increase the prediction power. To overcome this issue, we adopt a variable splitting idea in [6] by introducing an auxiliary variable γ ∈ R|V |+|E| to achieve these sparsity requirements separately, while controlling the gap from Dβpre with penalty Sρ (βpre , γ) :=   T T and Dβpre − γ22 := βpre − γV 22 + ρDG βpre − γG 22 with γ = γVT γG   T T D = I ρDG . Here ρ controls the trade-off between different types of sparsity. Our purpose is thus of two-folds: (1) use βpre for prediction; (2) enforce sparsity on γ. Such a dual-task scheme can be illustrated by Fig. 2. To implement it, we generalize the Split Linearized Bregman Iteration (Split LBI) algorithm in [6] to our setting with generalized linear models (GLM) and the three types of structural sparsity above, hence called Generalized Split LBI (or GSplit LBI). Algorithm 1 describes the procedure with a new loss: N (β0 , βpre , γ; {xi , yi }N 1 , ν) := (β0 , βpre ; {xi , yi }1 ) +

1 Sρ (βpre , γ), 2ν

(2.2)

where (βpre ; {xi , yi }N 1 ) is the negative log-likelihood function for GLM and ν > 0 tunes the strength of gap control. The algorithm returns a sequence of estimates 1

Here DG : RV → RE denotes a graph difference operator on G = (V, E), where V is the node set of voxels, E is the edge set of voxel pairs in neighbour (e.g. 3-by-3-by-3), such that DG (β)(i, j) := β(i) − β(j).

110

X. Sun et al.

Fig. 2. Illustration of GSplit LBI. The gap between βpre for fitting data and γ for sparsity is controlled by Sρ (βpre , γ). The estimate βles , as a projection of βpre on support set of γ, can be used for stable lesion features analysis when ν → 0 (Sect. 3.2). When ν  0 (Sect. 3.1) with appropriately large value, βpre can be used for prediction by capturing both lesion features and procedural bias.

Algorithm 1. GSplit LBI 1: Input: Loss function (β0 , βpre , γ; {xi , yi }N i=1 , ν), parameters ν, ρ, κ, α > 0. k k k = 0, βpre = 0, γVk = 0p , γG = 0m , zVk = 0p , 2: Initialize: k = 0, tk = 0, β0k = 0, βles k = 0m and Sk := supp(γ k ) = ∅. zG 3: Iteration k , γ k ; {xi , yi }N 4: β0k+1 = β0k − καβ0 (β0k , βpre 1 , ν) k+1 k k = βpre − καβpre (β0k , βpre , γ k ; {xi , yi }N 5: βpre 1 , ν) k , γ k ; {xi , yi }N , ν) 6: z k+1 = z k − αγ (β0k , βpre 1 7: γVk+1 = κ · S + (zVk+1 , 1), where S + (x, 1) = max(x − 1, 0) k+1 k+1 = κ · S(zG , 1), where S(x, 1) = sign(x) · max(|x| − 1, 0) 8: γG k+1 k+1 , where PS = Pker(DSc ) = I − DS† c DS c 9: βles = PSk+1 βpre k+1 10: t = (k + 1)α  k+1   k+1  γV z k k 11: Output: {β0k , βpre , βles , γ k }, where γ k+1 = and z k+1 = Vk+1 . k+1 γG zG

k k as a regularization path, {β0k , βpre , γ k , βles }k≥0 . In particular, γ k shows a variety k of sparsity levels and βpre is generically dense with different prediction powers. k onto the subspace with the same support of γ k gives The projection of βpre k estimate βles , satisfying those a priori sparsity properties (sparse, 3D-smooth, nonnegative) and hence being regarded as the interpretable lesion features for AD. The remainder of this projection is heavily influenced by procedural bias; in k which are negative (−1 denotes disease this paper the non-zero elements in βpre label) with comparably large magnitude are identified as procedural bias, while others with tiny values can be treated as nuisance or weak features. In summary, βles only selects lesion features; while βpre also captures additional procedural bias. Hence, such two kinds of features can be differentiated, as illustrated in Fig. 2.

GSplit LBI: Taming the Procedural Bias in Neuroimaging

2.2

111

Setting the Parameters

A stopping time at tk (line 10) is the regularization parameter, which can be determined via cross-validation to minimize the prediction error [7]. Parameter ρ is a tradeoff between geometric clustering and voxel sparsity. Parameter κ, α is damping factor and step size, which should satisfy κα ≤ ν/κ(1 + νΛH + Λ2D ) to ensure the stability of iterations. Here Λ(·) denotes the largest singular value of a matrix and H denotes the Hessian matrix of (β0 , βpre ; {xi , yi }N 1 ). Parameter ν balances the prediction task and sparsity enforcement in feature selection. In this paper, it is task-dependent, as shown in Fig. 2. For prediction of disease, βpre with appropriately larger value of ν may increase the prediction power by harnessing both lesion features and procedural bias. For lesion features analysis, βles with a small value of ν is helpful to enhance stability of feature selection. For details please refer to supplementary information.

3

Experimental Results

We apply our model to AD/NC classification (namely ADNC) and MCI (Mild Cognitive Impairment)/NC (namely MCINC) classification, which are two fundamental challenges in diagnosis of AD. The data are obtained from ADNI2 database, which is split into 1.5 T and 3.0T (namely 15 and 30) MRI scan magnetic field strength datasets. The 15 dataset contains 64 AD, 208 MCI and 90 NC; while the 30 dataset contains 66 AD and 110 NC. DARTEL VBM pipeline [1] is then implemented to preprocess the data. Finally, the input features consist of 2,527 8 × 8 × 8 mm3 size voxels with average values in GM population template greater than 0.1. Experiments are designed on 15ADNC, 30ADNC and 15MCINC tasks. 3.1

Prediction and Path Analysis

10-fold cross-validation is adopted for classification evaluation. Under exactly the same experimental setup, comparison is made between GSplit LBI and other classifiers: SVM, MLDA (univariate model via t-test + LDA) [3], Graphnet [5], Lasso [10], Elastic Net, TV+L1 and n2 GFL. For each model, optimal parameters are determined by grid-search. For GSplit LBI, ρ is chosen from {1, 2, ..., 10}, κ is set to 10; α = ν/κ(1 + νΛ2X + Λ2D )3 ; specifically, ν is set to 0.2 (corresponding to ν  0 in Fig. 2)4 . The regularization coefficient λ is ranged in {0, 0.05, 0.1, ..., 0.95, 1, 10, 102 } for lasso5 and 2{−20,−19,...,0,...,20} for SVM. For other models, parameters are optimized from λ : {0.05, 0.1, ..., 0.95, 1, 10, 102 } and ρ : {0.5, 1, .., 10} (in addition, the mixture parameter α: {0, 0.05, ..., 0.95} for Elastic Net). 2 3 4 5

http://adni.loni.ucla.edu. For logit model, α < ν/κ(1 + νΛ2H + νΛ2X ) since ΛX > ΛH . In this experiment, comparable prediction result will be given for ν ∈ (0.1, 10). 0 corresponds to logistic regression model.

112

X. Sun et al. Table 1. Comparison of GSplit LBI with other models MLDA SVM

Lasso

Graphnet Elastic net TV + l1 n2 GFL GSplit LBI (βpre )

15ADNC 85.06% 83.12% 87.01% 86.36%

88.31%

83.77% 86.36% 88.96%

30ADNC 86.93% 87.50% 87.50% 88.64%

89.20%

87.50% 87.50% 90.91%

15MCINC 61.41% 70.13% 69.80% 72.15%

70.13%

73.83% 69.80% 75.17%

The best accuracy in the path of GSplit LBI and counterpart are reported. Table 1 shows that βpre of our model outperforms that of others in all cases. Note that although our accuracies may not be superior to models with multi-modality data [8], they are the state-of-the-art results for only sMRI modality.

Fig. 3. Left image: Accuracy of (βpre , βles ) vs log t (t: regularization parameter). Right image: Six 2-d brain slice images of selected degenerative voxels of βles and βpre are sorted orderly at {t1 , ...t6 }. As t grows, βpre and βles identify similar lesion features.

The process of feature selection combined with prediction accuracy can be analyzed together along the path. The result of 30ADNC is used as an illustration in Fig. 3. We can see that βpre (blue curve) outperforms βles (red curve) in the whole path for additional procedural bias captured by βpre . Specifically, at βpre ’s highest accuracy (t5 ), there is a more than 8% increase in prediction accuracy by βpre . Early stopping regularization at t5 is desired, as βpre converges to βles in prediction accuracy with overfitting when t grows. Recall that positive (negative) features represent degenerate (enlarged) voxels. In each fold of βpre at t5 , the commonly selected voxels among top 150 negative (enlargement) voxels are identified as procedural bias shown in Fig. 1, where most of these GM voxels are enlarged and located near lateral ventricle or subarachnoid space etc., possibly due to enlargement of CSF space in those locations that are different from the lesion features.

GSplit LBI: Taming the Procedural Bias in Neuroimaging

3.2

113

Lesion Features Analysis

To quantitatively evaluate the stability of selected lesion features, multi-set Dice Coefficient (mDC)6 [4,12] is applied as a measurement. The 30ADNC task is again applied as an example, the mDC is computed for βles which achieves highest accuracy by 10-fold cross-validation. As shown from Table 2, when ν = 0.0002 (corresponding to ν → 0 in Fig. 2), the βles of our model can obtain more stable lesion feature selection results than other models with comparable prediction power. Besides, the average number of selected features (line 3 in Table 2) are also recorded. Note that although elastic net is of slightly higher accuracy than βles , it selects much more features than necessary. Table 2. mDC comparison between GSplit LBI and other models Lasso

Elastic Net Graphnet TV + l1 n2 GFL GSplit LBI (βles )

Accuracy

87.50% 89.20%

88.64%

87.50%

87.50%

88.64%

mDC 10

0.1992

0.5631

0.6005

0.5824

0.5362

0.7805

777.8

832.6

712.6

443.9

129.4

k=1

|S(k)|/10 50.2

For the meaningfulness of selected lesion features, they are shown in Fig. 4(a)– (c), located in hippocampus, parahippocampal gyrus and medial temporal lobe etc., which are believed to be early damaged regions for AD patients.

(a) fold 2

(b) fold 10 (c) overlap

(d) coarse-to-fine

Fig. 4. (a)–(c): Stability of selected lesion features of βles shown in 2-d 110 slice brain images when ν = 0.0002. (a)–(b): Results of fold 2 and fold 10. (c): The overlapped features in 10 folds. (d): The 2-d slice brain image of selected voxels with 2 × 2 × 2 mm3 using coarse-to-fine approach.

To further investigate the locus of lesion features, we conduct a coarseto-fine experiment. Specifically, we project the selected overlapped voxels of 8 × 8 × 8 mm3 size (shown in Fig. 4(c)) onto MRI image with more finer scale voxels, i.e. in size of 2 × 2 × 2 mm3 . Totally 4,895 voxels are served as input 6

In [12], mDC :=

10|∩10 S(k)| 10k=1 |S(k)| k=1

where S(k) denotes the support set of βles in k-th fold.

114

X. Sun et al.

features after projection. Again, the GSplit LBI is implemented using 10-fold cross-validation. The prediction accuracy of βpre is 90.34% and on average 446.6 voxels are selected by βles . As desired, these voxels belong to parts of lesion regions, such as those located in hippocampal tail, as shown in Fig. 4(d).

4

Conclusions

In this paper, a novel iterative dual task algorithm is proposed to incorporate both disease prediction and lesion feature selection in neuroimage analysis. With variable splitting term, the estimators for prediction and selecting lesion features can be separately pursued and mutually monitored under a gap control. The gap here is dominated by procedural bias, some specific features crucial for prediction yet ignored in a priori disease knowledge. With experimental studies conducted on 15ADNC, 30ADNC and 15MCINC tasks, we have shown that the leverage of procedural bias can lead to significant improvements in both prediction and model interpretability. In future works, we shall extend our model to other neuroimaging applications including multi-modality data. Acknowledgements. This work was supported in part by 973-2015CB351800, 2015CB85600, 2012CB825501, NSFC-61625201, 61370004, 11421110001 and Scientific Research Common Program of Beijing Municipal Commission of Education (No. KM201610025013).

References 1. Ashburner, J.: A fast diffeomorphic image registration algorithm. Neuroimage 38(1), 95–113 (2007) 2. Ashburner, J., Friston, K.J.: Why voxel-based morphometry should be used. Neuroimage 14(6), 1238–1243 (2001) 3. Dai, Z., Yan, C., Wang, Z., Wang, J., Xia, M., Li, K., He, Y.: Discriminative analysis of early alzheimer’s disease using multi-modal imaging and multi-level characterization with multi-classifier. Neuroimage 59(3), 2187–2195 (2012) 4. Dice, L.R.: Measures of the amount of ecologic association between species. Ecology 26(3), 297–302 (1945) 5. Grosenick, L., Klingenberg, B., Katovich, K., Knutson, B., Taylor, J.E.: Interpretable whole-brain prediction analysis with graphnet. Neuroimage 72, 304–321 (2013) 6. Huang, C., Sun, X., Xiong, J., Yao, Y.: Split lbi: An iterative regularization path with structural sparsity. In: Advances In Neural Information Processing Systems, pp. 3369–3377 (2016) 7. Osher, S., Ruan, F., Xiong, J., Yao, Y., Yin, W.: Sparse recovery via differential inclusions. Appl. Comput. Harmonic Anal. 41(2), 436–469 (2016) 8. Peng, J., An, L., Zhu, X., Jin, Y., Shen, D.: Structured sparse kernel learning for imaging genetics based alzheimer’s disease diagnosis. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 70–78 (2016)

GSplit LBI: Taming the Procedural Bias in Neuroimaging

115

9. Rudin, L.I., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal algorithms. Physica D 60(1–4), 259–268 (1992) 10. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. Roy. Stat. Soc.: Ser. B (Methodol.) 58, 267–288 (1996) 11. Tibshirani, R.J., Taylor, J.E., Candes, E.J., Hastie, T.: The solution path of the generalized lasso. Ann. Stat. 39(3), 1335–1371 (2011) 12. Xin, B., Hu, L., Wang, Y., Gao, W.: Stable feature selection from brain smri. In: AAAI, pp. 1910–1916 (2014) 13. Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. Roy. Stat. Soc. Ser. B (Statistical Methodology) 67(2), 301–320 (2005)

MRI-Based Surgical Planning for Lumbar Spinal Stenosis Gabriele Abbati1 , Stefan Bauer2 , Sebastian Winklhofer3 , Peter J. Sch¨ uffler4 , 5 5 5 Ulrike Held , Jakob M. Burgstaller , Johann Steurer , and Joachim M. Buhmann2(B) 1

Department of Engineering Science, University of Oxford, Oxford, UK [email protected] 2 Department of Computer Science, ETH Z¨ urich, Z¨ urich, Switzerland {stefan.bauer,jbuhmann}@inf.ethz.ch 3 Neuroradiology, University Hospital Z¨ urich, Z¨ urich, Switzerland [email protected] 4 Computational Pathology, Memorial Sloan Kettering Cancer Center, New York, USA [email protected] 5 Horten Centre for Patient Oriented Research and Knowledge Transfer, University of Z¨ urich, Z¨ urich, Switzerland {ulrike.held,jakob.burgstaller,johann.steurer}@usz.ch

Abstract. The most common reason for spinal surgery in elderly patients is lumbar spinal stenosis (LSS). For LSS, treatment decisions based on clinical and radiological information as well as personal experience of the surgeon show large variance. Thus a standardized support system is of high value for a more objective and reproducible decision. In this work, we develop an automated algorithm to localize the stenosis causing the symptoms of the patient in magnetic resonance imaging (MRI). With 22 MRI features of each of five spinal levels of 321 patients, we show it is possible to predict the location of lesion triggering the symptoms. To support this hypothesis, we conduct an automated analysis of labeled and unlabeled MRI scans extracted from 788 patients. We confirm quantitatively the importance of radiological information and provide an algorithmic pipeline for working with raw MRI scans. Both code and data are provided for further research at www.spinalstenosis. ethz.ch. Keywords: Machine learning

1

· Deep learning · Lumbar spinal stenosis

Introduction

The lumbar spine consists of the five vertebrae (levels or segments) L1–L5. The vertebral discs connect adjacent levels and are denoted as L1/L2, L2/L3, L3/L4, Electronic supplementary material The online version of this chapter (doi:10. 1007/978-3-319-66179-7 14) contains supplementary material, which is available to authorized users. c Springer International Publishing AG 2017  M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 116–124, 2017. DOI: 10.1007/978-3-319-66179-7 14

MRI-Based Surgical Planning for Lumbar Spinal Stenosis

117

Fig. 1. Examples of T2-weighted MRI. (a) The five segments are highlighted yellow in a sagittal scan. (b) Axial scan of a patient without symptoms and without narrowing of the spinal channel (white spot in the center). (c) Example with extreme narrowing.

L4/L5, L5/S1, where S1 is the first vertebra of the underlying sacral region (see Fig. 1). Lumbar Spinal Stenosis (LSS) is the most common indicator for spine surgery in patients older than 65 years [1]. The North American Spine Society defines LSS as “[...] diminished space available for the neural and vascular elements in the lumbar spine secondary to degenerative changes in the spinal canal [...]” [2]. Symptoms such as gluteal and/or lower extremity pain and/or fatigue might occur, possibly associated with back pain. Magnetic resonance imaging (MRI, illustrated in Figs. 1(b) and (c)) and the patient’s clinical course contribute to diagnosis and treatment formulation. When conservative treatments such as physiotherapy or steroid injections fail, decompression surgery is frequently indicated [1]. Depending on the clinical presentation of the patient and corresponding imaging findings, surgeons decide which segments to operate. This decision process exhibits wide variability [3,4], while associations between imaging and symptoms are still not entirely clear [5]. These issues motivate the search for objective methods to help in surgery planning. Since the definition of LSS implies anatomic abnormalities, MRI plays a fundamental role in diagnosis [6]. Andreisek et al. [7] identified 27 radiological criteria and parameters for LSS. However, correlations between imaging procedures, clinical findings and symptoms is still unclear, and research efforts show contradictory results [8,9]. This paper comprehensively determines the important role of radiological parameters in LSS surgery planning, in particular by modeling surgical decisionmaking: to the best of our knowledge, no machine learning approach has been applied in this direction before. In Sect. 2, we automatically predict surgery locations with 22 manual radiological features comparing five different classifiers. We obtain accuracies of 85.4% using random forests and show features associated with stenosis are commonly chosen by all classifiers. In Sect. 3, the highly heterogeneous MRI dataset is preprocessed and a convolutional neural network and convolutional autoencoder are trained to accomplish the same task as before, without any knowledge of the underlying structure of LSS. The automatic preprocessing of raw MRI scans is a key contribution of this work and code with examples will be released in the final version. Both algorithms achieve accuracies of 69.8% and 70.6%, respectively, in mimicking surgeons’ decisions, showing the

118

G. Abbati et al.

high relevance of radiological features in LSS treatment. Finally, we conclude with a discussion in Sect. 4.

2

Surgical Prediction from Numerical Dataset

The Numerical Dataset. Radiological T1-weighted and T2-weighted scans from 788 LSS patients have been collected in a multi-center study by Horten Zentrum (Z¨ urich, CH). For every segment and patient, radiologists manually scored 6 quantitative features (e.g. area of spinal canal in mm2 ) and 16 qualitative features (e.g. severity grade of compromise of a given vertebral region) known to be most relevant for assessing stenosis (a subset of the ones identified in [7]), forming the “numerical” dataset. Notice only one reading per image is available. A description of the features can be found in the Supplement (A.1). 431 of 788 patients underwent surgery. The Numeric Rating Scale (NRS) [10] for pain assessment was employed to understand whether the intervention improved a certain patient’s condition or not. NRS differences larger than 2 points before and six months after surgery were considered as improvement, as failure otherwise. In total, 321 of 431 patients exhibited improvement of NRS after surgery. As there is no information gain from unsuccessful operations, the following analysis addresses the subset of the 321 improved patients, yielding a total of 1385 segments as data points. Methods. We consider every segment independently as a data vector x consisting of its 22 feature values. The target is represented by a binary variable y (to operate/not to operate). This binary classification framework is tackled with the following algorithms: K-nearest neighbors (KNN), linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), support vector machine (SVM), and random forest (RF). Implementations from the scikit-learn [11] library are employed. The area under the receiver operating characteristic (ROC) curve is a natural choice for evaluating binary classifiers’ performances, and it is combined with 20-fold cross validation. To evaluate the influence of individual features, forward selection and backward selection are employed to choose the best 3, 5, 8, 10, 12, 15 and 18 features: with 5 different classifiers, a single feature can be chosen for a total of 70 times (7 sets × 5 classifiers × 2 algorithms). Thus we can evaluate how often a feature is considered to be among the most relevant ones for surgery prediction. This procedure is again validated through 20-fold cross validation. Results. For parameter-optimized binary classifiers, box plots describing the area under the ROC curve (AUC) obtained with 20-fold cross validation are shown in Fig. 2(a). The best results are achieved with an optimized random forest classifier: the mean over the AUC returned by the cross validation is 85.4%, with a standard deviation of 3.26%. The precision obtained here is particularly significant if we consider the relatively low agreement rates between doctors in determining treatments for LSS [12,13]. Feature selection indicates that

MRI-Based Surgical Planning for Lumbar Spinal Stenosis

119

SegCentralZone (assesses the compromise of the central zone of the vertebra), SegCSarea (area of the section of the spinal cord in mm2 ) and SegFluidSign (relation from fluid to cauda equina) as the most important features for assessing stenosis: these are chosen in 88.57%, 87.14% and 70.00%, respectively, of the total trials with feature selection algorithms (total ranking in Fig. 2(b)). All three features are known to be strongly related to spinal stenosis [7]. The results show that radiological data actually helps in assessing LSS and planning surgical treatments. (a) 1.00

(b)

SegCentralZone SegCSArea SegFluidSign SegRecessRight SegLipomathosis SegFlavumThickRight SegLRLeft SegDisc SegForamenRight SegListhesis SegOsteoArthRight SegModic SegFlavumRight SegLRRight SegAPDiam SegRootLeft SegForamenLeft SegFlavumLeft SegOsteoArthLeft SegRecessLeft SegRootRight SegFlavumThickLeft

Area under the ROC curve

0.95 0.90 0.85 0.80 0.75 0.70 0.65

KNN

LDA

QDA Classifiers

SVM

RF

0

10

20 30 40 50 60 70 Percentages of selection [%]

80

90

Fig. 2. Summary of the classifiers for segmental surgery prediction. (a) The box plots of the 20-fold cross validation. All classifiers show a strong signal between radiological data and surgical treatments. (b) Feature ranking as described in the text. The three most important features are SegCentralZone, SegCSarea and SegFluidSign.

3

Surgical Prediction from Radiological Images

Fully automated MRI-based surgery planning would be a helpful tool, as it can substantially speed up the process by skipping manual scoring while reducing the variability of human assessment. Therefore, we aim to directly learn features from raw MRI scans. The Image Dataset. The above described dataset of 788 LSS patients contains a great variety of T1-weighted and T2-weighted sagittal, coronal and axial series scans (see Fig. 3 for four typical examples). Since the images come from seven different institutions, the dataset is heterogeneous: not all types of MRI scans listed above are always available, and often only a small subset of the segments is accessible. Further, different machines vary in resolution (from 320 × 320 to 1024 × 1024 pixels) and scanning frequency (0.2 to 1 scan/mm). To keep the same segment-wise approach as before, we decide to employ only the T2-weighted axial scans (e.g. Fig. 3(c)), as they picture the whole lumbar spine and can be easily chopped into single segment sub-series. T2-weighted imaging pictures the spinal canal white in contrast to T1-weighted images, in

120 (a)

G. Abbati et al. (b)

(c)

(d)

Fig. 3. Typical examples of the different MRI scans: (a)–(c) T2-weighted (a, sagittal; b, coronal; c, axial), (d) T1-weighted axial.

which the canal is dark and hardly visible (Fig. 3(d)). Further, T2-weighted axial scans are the most common series in the dataset. The image dataset includes the same 321 operated patients with improved NRS. Image Preprocessing & Data Augmentation. All images are cropped and resized to 128 × 128 pixels, in order to keep the central section. Because of the various scanning frequencies, we then linearly interpolate to a desired number of equally spaced slices: to sufficiently describe the vertebral disc, yet keep the data structure simple, we use four subimages for each segment. We employed following data augmentation: rotation by a random angle α ∈ [−10◦ ; 10◦ ]; sagittal mirroring; inversion of the order of the slides (since the MRI machine can scan upwards or downwards); application of random Gaussian noise (zero-mean and 5% standard deviation); random brightness alteration (maximum alteration at 5%). Each image is augmented 20 times by this pipeline, each time every augmentation technique is randomly applied or not applied. Methods. Deep learning algorithms have already shown great success in a variety of image recognition problems. Convolutional Neural Networks (CNN, implementation details can be found in [14]) are image processing algorithms that are able to extract image features regardless of their position, which is especially useful in our case since scans are not always optimally centered on the spine. Due to the small sample size, a simple architecture is needed to prevent overfitting. Our CNN has the following structure: first convolutional layer (filters size 5 × 5, 128 masks), followed by a max-pooling layer; second convolutional layer (filters size 5 × 5, 64 masks), followed by a max-pooling layer; a fully connected layer, 2048 nodes; a further fully connected layer, 1024 nodes. Rectifier Linear Units (ReLU) are a common choice for this kind of network. The network structure is illustrated in Fig. 4, step 3. The cost function minimized during training is the mean of the softmax cross-entropy function between the output x and the actual label vector z, L = −z log σ(x) − (1 − z) log [1 − σ(x)], where σ(x) is the softmax function. The optimizer used for the minimization is AdaGrad [15]. Implementation is done in Python using TensorFlow [16]. The major inherent vice in this approach is the need of labeled examples. We learn from 1576 labeled scanned segments from 321 successfully operated

MRI-Based Surgical Planning for Lumbar Spinal Stenosis

121

patients. On the other hand, if we were able to include unlabeled segments in the analysis, we could take advantage of all 4031 segments from the 788 patients. Unsupervised learning methods do not need labeled examples. The autoencoder algorithm [18] is used to reduce the dimensionality of the problem: it consists of an encoder function h = f (x) and a decoder function r = g(h). The autoencoder is trained to copy the input to the output, but it is not given the resources to do so exactly (undercompleteness property). In this way an approximation of the input is returned and the model is forced to prioritize the most relevant aspects of the input. As the autoencoder does not need labels for the surgery, all 4031 segments can be used. An autoencoder sufficient for our needs can be built by mirroring the CNN and learning how to “invert” the convolutional and the max pooling layers [17] into deconvolutional layers: first convolutional layer (filters size 5 × 5, 128 masks), followed by a max-pooling layer; second convolutional layer (filters size 5 × 5, 64 masks), followed by a max-pooling layer; a fully connected layer, 1024 nodes; a fully connected layer, 128 nodes (bottleneck ); a fully connected layer, 1024 nodes; first unpooling and deconvolutional layer (filters size 5 × 5, 64 masks); second deconvolutional layer (filters size 5 × 5, 128 masks). This autoencoder reconstructs the original 3D image, and in the middle layer (the bottleneck), we find a 128-number code that identifies each image sufficiently for its reconstruction. We train the autoencoder on all unlabeled images to minimize the difference tensor J = (Xorig − Xreconstr )2 , where Xorig is the original image and Xreconstr is its reconstruction. After training, the autoencoder is used to encode all labeled images and their 128-number codes are used as features in the same classification experiments as in Sect. 2.

Fig. 4. Proposed computing pipeline from preprocessing of raw MRI pictures to learning of surgical planning

Results. The complete pipeline from the MRI preprocessing to the surgery classification is depicted in Fig. 4. For both CNN and autoencoder, the available image datasets are split into training and test set with a 80/20 ratio. The training sets are augmented as previously described and the networks are trained for 100

122

G. Abbati et al.

(a)

(b)

(c)

(d)

Fig. 5. Image reconstruction examples by the autoencoder. (a), (c): 2 out of 4 slices of the original 3D image. (b), (d): Reconstructed image slices.

epochs. Learning curves are available in the Supplement (A.2). On the test set, the CNN reaches an AUC of 69.8%. This is significantly lower than the AUC obtained with the numerical dataset, but it is still confirming the existence of a signal in the MRI images, and enforces the idea that radiological data are linked to stenosis diagnosis and treatment. Considering the small size of the training data, we are confident that higher precisions can be obtained if the present dataset is improved and expanded. The autoencoder learns successfully to reconstruct the images (Fig. 5). While some details are missed, it is noticeable that the dimension of a picture is now extremely reduced from 128 × 128 × 4 = 65536 numbers to 128. When training and testing the binary classifiers from Sect. 3 with the codes from the labeled segments, the highest mean AUC for a 20-fold cross validation test is given by a optimized LDA classifier, at 70.6%, with a corresponding standard deviation of 6.69%. The mild improvement can be explained by the extension of the dataset to the non-labeled segments.

4

Discussion

While the influence of MRI scans on surgical decisions for LSS was previously unclear, our results quantitatively confirm the importance of medical imaging in LSS diagnosis and treatment planning. We started by effectively modeling surgical decision-making for lumbar spine stenosis through binary classifiers, on the sole basis of manually-assessed radiological features. To reduce human bias and errors in the selection and calculation of features, we developed an automatic pipeline (Fig. 4) to work on raw MRI scans. To the best of our knowledge these are the first and initial steps towards benchmarking LSS. Supervised (CNN) and semi-supervised (convolutional autoencoders) deep learning algorithms were trained on the transformed images and accuracies around 70% on surgical planning were achieved. Compared to the results with the numerical dataset, the differences in accuracy (of about 15%) can be justified by the modest number of MRI scans. We are confident that further systematic efforts aimed at enlarging the image catalog could significantly improve the classification results and thus patient outcome.

MRI-Based Surgical Planning for Lumbar Spinal Stenosis

123

Acknowledgments. This research was partially supported by the Max Planck ETH Center for Learning Systems, the SystemsX.ch project SignalX, the Baugarten Foundation, the Helmut Horten Foundation, the Pfizer-Foundation for geriatrics & research in geriatrics, the Symphasis Charitable Foundation, the OPO Foundation, NIH/NCI Cancer Center Support Grant P30 CA008748 and an Oxford - Google DeepMind scholarship.

References 1. Deyo, R.A.: Treatment of lumbar spinal stenosis: a balancing act. Spine J. 10, 625–627 (2010) 2. Kreiner, S., Shaffer, W.O., Baisden, J., Gilbert, T., et al.: Evidence-based clinical guidelines for multidisciplinary spine care diagnosis and treatment of degenerative lumbar spinal stenosis. North Am. Spine Soc. (2014) 3. Weinstein, J.N., Lurie, J.D., Olson, P.R., Bronner, K.K., Fisher, E.S., United States’ trends, regional variations in lumbar spine surgery: 1992–2003. Spine 31, 2707–2714 (2006) 4. Irwin, Z.N., Hilibrand, A., Gustavel, M., McLain, R., et al.: Variation in surgical decision making for degenerative spinal disorders. Part I: lumbar spine. Spine 30, 2208–2213 (2005) 5. Jensen, M.C., Brant-Zawadzki, M.N., Obuchowski, N., Modic, M.T., et al.: Magnetic resonance imaging of the lumbar spine in people without back pain. N. Engl. J. Med. 331, 69–73 (1994) 6. Steurer, J., Roner, S., Gnannt, R., Hodler, J.: Quantitative radiologic criteria for the diagnosis of lumbar spinal stenosis: a systematic literature review. BMC Musculoskelet Disord. 12, 175 (2011) 7. Andreisek, G., Deyo, R.A., Jarvik, J.G., Porchet, F., et al.: LSOS working group and others, Consensus conference on core radiological parameters to describe lumbar stenosis - an initiative for structured reporting. Eur. Radiol. 24, 3224–3232 (2014) 8. Haig, A.J., Tong, H.C., Yamakawa, K.S., et al.: Spinal stenosis, back pain, or no symptoms at all? a masked study comparing radiologic and electrodiagnostic diagnoses to the clinical impression. Arch. Phys. Med. Rehabil. 87, 897–903 (2006) 9. Ishimoto, Y., Yoshimura, N., Muraki, S., et al.: Associations between radiographic lumbar spinal stenosis and clinical symptoms in the general population: the Wakayama Spine Study. Osteoarthr. Cartil. 21, 783–788 (2013) 10. Downie, W.W., Leatham, P.A., Rhind, V.M., Wright, V., et al.: Studies with pain rating scales. Ann. Rheum. Dis. 37, 378–381 (1978) 11. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011) 12. Lurie, J.D., Tosteson, A.N., Tosteson, T.D., Carragee, E., et al.: Reliability of readings of magnetic resonance imaging features of lumbar spinal stenosis. Spine 33, 1605–1610 (2008) 13. Fu, M.C., Buerba, R.A., Long, W.D., Blizzard, D.J., et al.: Interrater and intrarater agreements of magnetic resonance imaging findings in the lumbar spine: significant variability across degenerative conditions. Spine J. 14, 2442–2448 (2014) 14. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998) 15. Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)

124

G. Abbati et al.

16. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., et al.: Tensorflow: large-scale machine learning on heterogeneous distributed systems. arXiv preprint (2016) 17. Zeiler, M.D., Krishnan, D., Taylor, G.W., Fergus, R.: Deconvolutional networks. In: IEEE Conference on CVPR, pp. 2528–2535 (2010) 18. Zemel, R.S.: Autoencoders, minimum description length and Helmholtz free energy. NIPS (1994)

Pattern Visualization and Recognition Using Tensor Factorization for Early Differential Diagnosis of Parkinsonism Rui Li1,6 , Ping Wu2 , Igor Yakushev3 , Jian Wang4 , Sibylle I. Ziegler3 , Stefan F¨ orster3 , Sung-Cheng Huang5 , Markus Schwaiger3 , Nassir Navab1 , Chuantao Zuo2 , and Kuangyu Shi3(B) 1

4

Department of Computer Science, Technische Universit¨ at M¨ unchen, Munich, Germany 2 Huashan Hospital, PET Center, Fudan University, Shanghai, China 3 Department of Nuclear Medicine, Technische Universit¨ at M¨ unchen, Munich, Germany [email protected] Department of Neurology, Huashan Hospital, Fudan University, Shanghai, China 5 Department of Molecular and Medical Pharmacology, University of California, Los Angeles, USA 6 Alibaba Cloud, Hangzhou, China

Abstract. Idiopathic Parkinsons disease (PD) and atypical parkinsonian syndromes may have similar symptoms at the early disease stage. Pattern recognition on metabolic imaging has been confirmed of distinct value in the early differential diagnosis of Parkinsonism. However, the principal component analysis (PCA) based method ends up with a unique probability score of each disease pattern. This restricts the exploration of heterogeneous characteristic features for differentiation. There is no visualization of the underlying mechanism to assist the radiologist/neurologist either. We propose a tensor factorization based method to extract the characteristic patterns of the diseases. By decomposing the 3D data, we can capture the intrinsic characteristic pattern in the data. In particular, the disease-related patterns can be visualized individually for the inspection by physicians. The test on PET images of 206 early parkinsonian patients has confirmed differential patterns on the visualized feature images using the proposed method. Computer-aided diagnosis based on multi-class support vector machine (SVM) shown improved diagnostic accuracy of Parkinsonism using the tensor-factorized feature images compared to the state-of-the-art PCA-based scores [Tang et al. Lancet Neurol. 2010].

1

Introduction

Atypical parkinsonian syndromes, including multiple system atrophy (MSA) and progressive supranuclear palsy (PSP) present very similar clinical symptoms to idiopathic Parkinson’s disease (PD) especially in their early stages [10]. It c Springer International Publishing AG 2017  M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 125–133, 2017. DOI: 10.1007/978-3-319-66179-7 15

126

R. Li et al.

has been reported that approximately 20%–30% of PD patients were misdiagnosed [11]. This diagnostic error has significant consequences for clinical patient care and causing inadequate treatment [6]. Positron emission tomography (PET) detect abnormal functional alterations [2,7,12,17] far before structural damages to the brain tissue are present [3,12,18,20]. Although 18 F-FDG PET is effective in the diagnosis of parkinsonism by visualizing brains glucose metabolism [8], the complex spatial abnormalities make the exploration of its differential potential challenging. Sophisticated pattern recognition has been developed to boost the performance in early differential diagnosis of parkinsonism [5,16]. Specific metabolic patterns of PD, MSA and PSA were extracted using principal component analysis (PCA). For each pattern, a score is derived to describe the probability of the PET images of an individual subject presenting this pattern. These pattern scores have been found as surrogates to accurately discriminate between the different types of parkinsonian syndromes in early stage. However, the unique probability score of each disease pattern provides limited information of the heterogeneity of the underlying pathophysiological abnormalities, leading to bottleneck to further improve the diagnostic accuracy. Furthermore, there is no visualization of the characteristic features, which restricts the possibility of diagnostic inspection and approval by radiologist/neurologists. In contrast to vector-based dimension reduction of PCA, tensor factorization can represent data by a two-dimensional matrix directly and provides a powerful tool for factor analysis. Various successful applications have been reported. A work [15] extracts features from a tensor by higher order discriminant analysis. The class discriminant information in the high dimensional data is captured by the withinclass scatter matrix and between-class scatter matrix, which can be seen as an extension to the well-known linear discriminant analysis (LDA). An error analysis [4] of the tensor decomposition was proposed to provide error bounds. The experiments showed improved performance on the video compression and classification. Another work [13] suggested the tensor factorization for context-aware collaborative filtering. A recent study [14] applied tensor factorization to find meaningful latent variables that predict brain activity with competitive accuracy. In this paper, we propose a tensor factorization based method to extract the characteristic patterns of PD, MSA and PSA. This is achieved by decomposing the 3D data into 2D planes containing the determinant information. The pattern related features can be then represented in the 2D visual space. Thus it can be visualized individually for the inspection by physicians. The method was tested on 18 F-FDG PET images of 206 patients suspected with parkinsonism. The computer-aided diagnosis on the derived 2D feature images were compared with that on the state-of-the-art PCA-based pattern scores [16].

2 2.1

Methods Introduction to Tensor Factorization

In this study, we denote the imaging data as a order-3 tensor T ∈ RI×J×K , where I, J, K are the dimensions. i, j, k can take on the specific value in the I, J

Pattern Visualization and Recognition Using Tensor Factorization

127

and K respectively. Thus, T = [tijk ]I,J,K i,j,k=1 , where ti,j,k is an entry of the tensor at the position (i, j, k). CANDECOMP/PARAFAC (CP) [9] decomposition1 is employed to decomposing the tensor in this study [19]. With respect to an order3 tensor T , TF factorizes T into three components bases (factor matrices A, B and C), which are often constrained to a unit length vector associated with a weight vector λ = [λ1 , . . . , λR ]. These rank-one components can re-express the original tensor. Thus, the factorized form is as follows: T ≈

R 

λr × ar ◦ br ◦ cr ,

(1)

r=1

in which the component basis A = [a1 , a2 , . . . , ar ], B = [b1 , b2 , . . . , br ], C = [c1 , c2 , . . . , cr ]. The weight vector is λ = [λ1 , λ2 , . . . , λr ]. Outer product is denoted by the ◦, and R is the rank. After the CP tensor factorization, the factorized model can then be compactly expressed as T ≈ {λ, A, B, C}. By reducing the 3D tensor into a tensor with smaller dimension (such as 2D matrix), we can visualize a 3D image by a 2D representation. As a higher dimensional extension to the singular value decomposition (SVD), TF can be solved by the CP-ALS (alternating least squares). The idea of CP-ALS is to minimize the least square R term min T − r=1 λr × ar ◦ br ◦ cr , where  ·  is the vector 2-norm. We used A,B,C

the Matlab tensor toolbox [1] implementation in the experiments.

(a)

(b)

Fig. 1. (a) Illustration of vector, matrix, tensor and CP decomposition. (b) Demonstration of factor matrices and associated weights by R = 10 in Eq. 1. 

In Algorithm 1, the data reduction of T ⇒ T is possible because the model MT can be expressed by the product of the decomposed factor matrices sharing a common dimension size of R2 . Furthermore, we can select the top-m components from the R to reduce the data. Thus, reducing the data by using the top-m bases in the factor matrices is mathematically feasible. We, in fact, can shrink the data 1 2

We name CP decomposition as tensor factorization (TF) in this work. The R was chosen by gradually reducing the rank number until the minimal lambda is not smaller than 5% of the maximum lambda after decomposition.

128

R. Li et al.

Algorithm 1: Learning features in high dimensional data via tensor factorization 1 2 3 4

Inputs: Tensor data T ∈ RI×J×K , rank R, top-m components (m ≤ R) Outputs: Feature vector for each 3D image Model training: Train a tensor model MT using a given rank R for each class MSA, PSP and PD respectively Deriving features: Given an image, use the MT to reduce the data along the  second mode J, such that T ∈ RI×J×K ⇒ T ∈ RI×p×K , where m ≤ R ≤ J.  Flatten the T into a one dimensional vector of size I · m · K that is used as a feature vector to train a classification classifier such as support vector machines (SVM).

1 2

...

m

m m

Fig. 2. Demonstration of feature learning via tensor factorization. The 2D image shows the third (middle) slice from the reduced image R95×m×69 . m is the number of selected top-m basis components from the model MT . In this study, we set m = 1 for the purpose of visualization, although the proposed method is valid for any value of m.

along any mode. In this study, the reduction was performed along the second mode J, corresponding to sagittal plane. Figure 2 depicts the feature learning process using the proposed tensor factorization (TF) method. In our experiments, we train a TF model for each disease type (MSA, PSP and PD) using the training images, resulting in three models , MPSP and MPD denoted as MMSA T T T . Given a test image (new coming subject), we we just need to project the data to the established factorized bases to derive the respective feature vectors (illustrated as right hand side images in Fig. 2), which are then concatenated together as a one dimensional feature vector representing the test image. This is similar to the state-of-the-art method, where new data only need to be projected to the established PCA-bases. Thus, an image is characterized by the three trained models (factorized bases). Finally, a classifier can be trained based on the derived feature vectors to arrive at a predictive model. 2.2

Application of Tensor Factorization to 3D Images

In Fig. 2, the model MT is trained by the training data. MT consists the factor matrices (A, B, C) that are the keys to perform data reduction. To reduce an image of size 95 × 79 × 69 along the second dimension 79, we iterate over the

Pattern Visualization and Recognition Using Tensor Factorization Atypical

PSP

MSA

PD

Representative

129

Fig. 3. Illustration of feature images of PD, MSA and PSP. 4 representative feature images and 1 atypical feature image for each disease are displayed.

third dimension 69. In each iteration, a 2D image of size 95 × 79 is selected, which is multiplied with the second factor matrix of size 79 × p. The resulting matrix is of size 95×m as shown in Fig. 2. After 69 iterations, a reduced 3D image (95×m×69) is generated. The reduced 3D data is concatenated to form a feature vector representing the image. One may also iterate over the first dimension to perform the reduction, as long as the matrix operation is allowed. Two additional points must be stated: First, it is possible to further reduce the image to an even smaller size along other dimensions. Second, it is also possible to reduce the image by any dimension rather than the second dimension. In this work, we chose the second dimension (sagittal plane) to have the best visualization of all interested structures of 3D data reduction by tensor factorization, with m set to one (second dimension is reduce to one).

3

Experiments and Results

206 patients with suspected parkinsonian clinical features were subjected to an 18 F-FDG PET imaging. After the imaging, these patients were assessed by blinded movement disorders specialists for a mean of 2.1 years before a final clinical diagnosis of PD (n = 136), MSA (n = 40), and PSP (n = 30) were made. PET images were normalized by global mean and then were spatially normalized to Montreal Neurological Institute (MNI) space using SPM83 . A Gaussian kernel (size 8 × 8 × 8 mm) were applied to smooth the PET images. The mean of each image will be subtracted from the image. A group of 20 PD, 20 MSA and 20 PSP images were randomly selected to generate mean image for tensor factorization. The factorization algorithm generated a set of base images (factor matrices). Afterwards, the PET image of each patient was projected to the factorized bases to generate 2D feature images. These feature images represent the characteristic patterns and were displayed for visual 3

Statistical Parametric Mapping, http://www.fil.ion.ucl.ac.uk/spm/software/spm8/, 2009.

130

R. Li et al.

inspection. For most patients the clinical diagnosis of PD, MSA or PSP can be visually differentiated. Figure 3 shows 4 representative feature images of PD, MSA and PSP. For representative PD patterns, no difference between frontal, parietal and occipital lobes was observed. Visible cerebellum and striatum activities can be observed. For MSA, vanishing cerebellum activity was observed and there were also reduced activities in striatum on the pattern images. For representative PSP, decreasing striatum activities were observed, while cerebellum activities were still visible. There were observable declining activities in frontal lobe. However, these typical findings do not represent all the images. Overall, 13 (9.6%) PD, 5 (12.5%) MSA and 6 (20%) PSP were found to be ambiguous in visual inspection. Examples of atypical pattern images were illustrated in Fig. 3. MSA

PD

PSP

PCA Score

1.05

0.95

0.85

0.75

Sensitivity Specificity

PPV

NPV

Sensitivity Specificity

PPV

NPV

Sensitivity Specificity

PPV

NPV

Fig. 4. Comparing the computer-aided diagnosis results on PCA-based pattern score [16] and the proposed tensor factorized feature images.

In addition to visual inspection, multi-class SVM were applied to the feature images. A linear kernel was used, with a grid search for parameter optimization. Grid search considers only the optimization of the penalty parameter C in the linear SVM, selecting the value of C yielding the best classification result based on the training data. After the best value of C was found, we applied it to the test data. A 10-fold cross-validation was applied. The tensor factorization and cross-validation were repeated for 256 times. The results were compared with the conventional PCA-based pattern scores with the same setting of cross validation. Figure 4 displays the comparison of the proposed tensor factorization and pattern scores for the sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV). Paired t-test of the results of 256 repeated tests show that visual representations of parkinsonian patterns lead to significant (p of all the comparisons are less than 1.8 × 10−59 ) improvement compared to pattern scores.

4

Discussions and Conclusion

This paper proposed a new pattern representation method using tensor factorization, which allows both visual inspection and computer-aided diagnosis. The visualization of the derived feature images demonstrates different representative patterns for PD, MSA and PSP. This provides the potential for physicians to

Pattern Visualization and Recognition Using Tensor Factorization

131

inspect the diagnosis in reading room. The improved result of computer-aided diagnosis using the tensor-factorized patterns over the PCA-based scores confirmed that the new method can capture more characteristic features for differential diagnosis. In this study, only one sagittal representation image was chosen for the visualization. This is based on the consideration that it may give a view covering maximum information of reported characteristic anatomical structures of parkinsonism, such as striatum, cerebellum and brainstem [16]. The generation of more anatomical planes and the increase of number of feature images can further increase the performance of SVM-based computer-aided diagnosis. A test of including 5 feature images (m = 5) for computer-aided diagnosis can overall improve 2.7% of specificity for PD and 0.5% of specificity for MSA. However, the increased number of images may incur additional burden for neurologist to resolve the critical information in the diagnosis. Further clinical test needs to be made to find optimized number of pattern images considering both the feasibility of visual inspection and accuracy of computer-aided diagnosis. Furthermore, the derived pattern images after tensor factorization has limited anatomical correspondence and the visual inspection may be different from conventional anatomy-guided diagnosis. Special training of the physicians is necessary to make it feasible in clinical practice. Nevertheless, the evolving applicability of PCAbased pattern scores after a series of international clinical trials makes a good example for clinical translation of the proposed concepts. Considering the high challenge of early differential diagnosis of parkinsonism, the exploration of more characteristic features for both visual and computer-aided diagnosis may change the state of art of parkinsonism management. Acknowledgement. The methodological development is based on the funding from German Research Foundation (DFG) Collaborative Research Centre 824 (SFB824). The international cooperation was supported by the Sino-German Insititue for Brain Molecular Imaging and Clinical Translation.

References 1. Brett W.B., Tamara, G.K., et al.: Matlab tensor toolbox version 2.6., February 2015 2. Bi, L., Kim, J., Feng, D., Fulham, M.: Multi-stage thresholded region classification for whole-body PET-CT lymphoma studies. In: Golland, P., Hata, N., Barillot, C., Hornegger, J., Howe, R. (eds.) MICCAI 2014. LNCS, vol. 8673, pp. 569–576. Springer, Cham (2014). doi:10.1007/978-3-319-10404-1 71 3. Chen, K., Langbaum, J.B., Fleisher, A.S., Ayutyanont, N., Reschke, C., Lee, W., Liu, X., Bandy, D., Alexander, G.E., Thompson, P.M., Foster, N.L., Harvey, D.J., de Leon, M.J., Koeppe, R.A., Jagust, W.J., Weiner, M.W., Reiman, E.M.: Twelvemonth metabolic declines in probable alzheimer’s disease and amnestic mild cognitive impairment assessed using an empirically pre-defined statistical region-ofinterest: findings from the alzheimer’s disease neuroimaging initiative. Neuroimage 51(2), 654–664 (2010)

132

R. Li et al.

4. Huang, H., Ding, C., Luo, D.J.: Tensor reduction error analysis-applications to video compression and classification. In: IEEE Conference on Computer Vision and Pattern Recognition (2008) 5. Eidelberg, D.: Metabolic brain networks in neurodegenerative disorders: a functional imaging approach. Trends Neurosci. 32(10), 548–557 (2009) 6. Fahn, S., Oakes, D., Shoulson, I., Kieburtz, K., Rudolph, A., Lang, A., Olanow, C.W., Tanner, C., Marek, K.: Levodopa and the progression of Parkinson’s disease. New Engl. J. Med. 351(24), 2498–2508 (2004) 7. Gao, F., Liu, H., Shi, P.: Patient-adaptive lesion metabolism analysis by dynamic PET images. In: Ayache, N., Delingette, H., Golland, P., Mori, K. (eds.) MICCAI 2012. LNCS, vol. 7512, pp. 558–565. Springer, Heidelberg (2012). doi:10.1007/ 978-3-642-33454-2 69 8. Hellwig, S., Frings, L., Amtage, F., Buchert, R., Spehl, T.S., Rijntjes, M., Tuscher, O., Weiller, C., Weber, W.A., Vach, W., Meyer, P.T.: 18f-FDG PET is an early predictor of overall survival in suspected atypical Parkinsonism. J. Nucl. Med. 56(10), 1541–1546 (2015) 9. Hitchcock, F.L.: The expression of a tensor or a polyadic as a sum of products. J. Math. Phys. 6(1), 164–189 (1927) 10. Hughes, A.J., Ben-Shlomo, Y., Daniel, S.E., Lees, A.J.: What features improve the accuracy of clinical diagnosis in parkinson’s disease: a clinicopathologic study. Neurology 57(10 Suppl 3), S34–S38 (2001) 11. Hughes, A.J., Daniel, S.E., Ben-Shlomo, Y., Lees, A.J.: The accuracy of diagnosis of parkinsonian syndromes in a specialist movement disorder service. Brain 125(Pt 4), 861–870 (2002) 12. Jiao, J., Searle, G.E., Tziortzi, A.C., Salinas, C.A., Gunn, R.N., Schnabel, J.A.: Spatio-temporal pharmacokinetic model based registration of 4D PET neuroimaging data. NeuroImage 84, 225–235 (2014) 13. Amatriain, X., Baltrunas, L., Karatzoglou, A., Oliver, N.: Multiverse recommendation: n-dimensional tensor factorization for context-aware collaborative filtering. In: Proceedings of the Fourth ACM conference on Recommender Systems, pp. 79– 86 (2010) 14. Mitchell, T.M., Papalexakis, E.E., et al.: Turbo-SMT: accelerating coupled sparse matrix-tensor factorizations by 200x. In: SIAM International Conference on Data Mining. SIAM (2014) 15. Phan, A.-H., Cichocki, A.: Tensor decompositions for feature extraction and classification of high dimensional datasets. Nonlinear Theor. Appl. IEICE 1(1), 37–68 (2010) 16. Tang, C.C., Poston, K.L., Eckert, T., Feigin, A., Frucht, S., Gudesblatt, M., Dhawan, V., Lesser, M., Vonsattel, J.P., Fahn, S., Eidelberg, D.: Differential diagnosis of Parkinsonism: a metabolic imaging study using pattern analysis. Lancet Neurol. 9(2), 149–158 (2010) 17. Xu, Z., Bagci, U., Seidel, J., Thomasson, D., Solomon, J., Mollura, D.J.: Segmentation based denoising of PET images: an iterative approach via regional means and affinity propagation. In: Golland, P., Hata, N., Barillot, C., Hornegger, J., Howe, R. (eds.) MICCAI 2014. LNCS, vol. 8673, pp. 698–705. Springer, Cham (2014). doi:10.1007/978-3-319-10404-1 87 18. Zhang, D., Shen, D.: Multi-modal multi-task learning for joint prediction of multiple regression and classification variables in alzheimer’s disease. Neuroimage 59(2), 895–907 (2012)

Pattern Visualization and Recognition Using Tensor Factorization

133

19. Zhou, G.X., Zhao, Q.B., Cichocki, A., Zhang, Y., Wang, X.Y.: Fast nonnegative tensor factorization based on accelerated proximal gradient and low-rank approximation. Neurocomputing 198, 148–154 (2016) 20. Zhou, L., Salvado, O., Dore, V., Bourgeat, P., Raniga, P., Villemagne, V.L., Rowe, C.C., Fripp, J.: MR-less surface-based amyloid estimation by subject-specific atlas selection and Bayesian fusion. In: Ayache, N., Delingette, H., Golland, P., Mori, K. (eds.) MICCAI 2012. LNCS, vol. 7511, pp. 220–227. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33418-4 28

Physiological Parameter Estimation from Multispectral Images Unleashed Sebastian J. Wirkert1(B) , Anant S. Vemuri1 , Hannes G. Kenngott2 , otz5 , Benjamin F.B. Mayer2 , Sara Moccia1,3,4 , Michael G¨ 5 Klaus H. Maier-Hein , Daniel S. Elson6,7 , and Lena Maier-Hein1 1

Division of Computer Assisted Medical Interventions, DKFZ, Heidelberg, Germany [email protected] 2 Department for General, Visceral and Transplantation Surgery, Heidelberg University Hospital, Heidelberg, Germany 3 Department of Advanced Robotics, Istituto Italiano di Tecnologia, Genoa, Italy 4 Department of Electronics, Information and Bioengineering, Politecnico di Milano, Milan, Italy 5 Division for Medical and Biological Informatics, DKFZ, Heidelberg, Germany 6 Hamlyn Centre for Robotic Surgery, Imperial College London, London, UK 7 Department of Surgery and Cancer, Imperial College London, London, UK

Abstract. Multispectral imaging in laparoscopy can provide tissue reflectance measurements for each point in the image at multiple wavelengths of light. These reflectances encode information on important physiological parameters not visible to the naked eye. Fast decoding of the data during surgery, however, remains challenging. While modelbased methods suffer from inaccurate base assumptions, a major bottleneck related to competing machine learning-based solutions is the lack of labelled training data. In this paper, we address this issue with the first transfer learning-based method to physiological parameter estimation from multispectral images. It relies on a highly generic tissue model that aims to capture the full range of optical tissue parameters that can potentially be observed in vivo. Adaptation of the model to a specific clinical application based on unlabelled in vivo data is achieved using a new concept of domain adaptation that explicitly addresses the high variance often introduced by conventional covariance-shift correction methods. According to comprehensive in silico and in vivo experiments our approach enables accurate parameter estimation for various tissue types without the need for incorporating specific prior knowledge on optical properties and could thus pave the way for many exciting applications in multispectral laparoscopy.

1

Introduction

Multispectral images (MSI) offer great potential in a large variety of medical procedures. The encoded information about tissue parameters such as oxygenation and blood volume fraction has motivated a considerable body of research related to early cancer detection [1] as well as image-guided therapy involving c Springer International Publishing AG 2017  M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 134–141, 2017. DOI: 10.1007/978-3-319-66179-7 16

Physiological Parameter Estimation from Multispectral Images Unleashed

135

Fig. 1. Overview on our approach. First we create masses of generic, labelled, tissue-reflectance samples (θ i , ri )n 1 . The desired physiological parameter yi and the reflectances adapted to the camera xi are used for training a base regressor fˆbase . Weights (βi )n 1 are calculated to fit the simulated data to the measurements. These weights adapt the base regressor to the in vivo measurements. Applying the adapted ˆ. regressor fˆDR to new images yields their parameter estimates y

bowel anastomosis [2] and transplantation evaluation. However, decoding the reflectance measurements during a medical intervention is not straightforward. Model-based approaches that are sufficiently fast for online execution typically suffer from incorrect base assumptions such as constant scattering or low tissue absorption [2]. Machine learning-based alternatives [3,4] need accurate information about the composition of the underlying tissue for the training phase to account for the lack of annotated real data. This knowledge about optical tissue properties is hard to obtain and is dependent on experimental conditions [5]. To address this bottleneck, we present a novel machine learning-based approach to physiological parameter estimation (Sect. 2) that neither requires real labelled data nor specific prior knowledge on the optical properties of the tissue of interest. The method relies on a broadly applicable model of abdominal tissue that aims to capture a large range of physiological parameters observed in vivo. Adaptation of the model to a specific clinical application is achieved by means of domain adaptation (DA) using samples of unlabelled in vivo data. In a comprehensive study (Sect. 3) with seven pigs we show that (1) our model captures a large amount of the variation in real tissue and (2) our transfer learning-based approach enables highly accurate physiological parameter estimation.

2

Methods

Our approach to physiological parameter estimation, which is illustrated in Fig. 1, aims to compensate for the lack of detailed prior knowledge related to optical tissue properties by applying DA. More specifically, we hypothesise that (1) we can use a highly generic tissue model to generate a data set of spectral reflectances that covers the whole range of multispectral measurements that may possibly be observed in vivo and that (2) samples of real (unlabelled) measurements can be used to adapt the data to a new target domain. Section 2.1

136

S.J. Wirkert et al.

describes how the generic data set is generated and used for physiological parameter estimation while Sect. 2.2 introduces our approach to DA. 2.1

Generic Approach to Physiological Parameter Estimation in Multispectral Imaging

Our method is built around a single comprehensive data set, which can be used for various camera setups and target structures. Its generation and usage is as described below. Dataset Generation Using a Generic Tissue Model. A generalization of the layered tissue model developed in [3] was used to create our data set consisting of camera independent tissue-reflectance pairs (θ i , ri ), i ∈ {1...n}. The (physiological) parameters are varied within the ranges shown in Table 1 for each of the three layers. Each layer is described by its value for blood volume fraction vhb , scattering coefficient amie , scattering power bmie , anisotropy g, refractive index n and layer thickness d. Oxygenation s is kept constant across layers [3]. Following the values in [5] and in contrast to [3], bmie is varied, covering all soft, fatty and fibrous tissues. We increase the ranges of vhb by a factor of three to potentially model pathologies. In conjunction with values for haemoglobin extinction coefficients Hb and HbO2 from the literature, absorption and scattering coefficients μa and μs can be determined for usage in the Monte Carlo simulation framework. The simulated range of wavelengths λ is large enough for adapting the simulations to cameras operating in the visible and near infrared. To account for a specific camera setup, ri can be transformed to camera reflectances (ci,j ) at the jth spectral band, using the method described in [4]. Zero mean Gaussian noise w was added to model camera noise. Table 1. The simulated ranges of physiological parameters, and their usage in the simulation set-up as described in Sect. 2.1. Values used in [3] are denoted if different. Important changes are marked in bold font. θ

vhb [%] s[%]

Layer 1–3: 0–30 In [3]: 0–10

1 amie [ cm ] bmie

0–100 5–50

g

n

d[µm]

.3–3 .8–.95 1.33–1.54 20–2000 1.3 1.36 395–1010

μa (vhb , s, λ) = vhb (sHbO2 (λ) + (1 − s)Hb (λ))ln(10)150 g L−1 (64, 500 g mol−1 )−1 mie μs (amie , b, λ) = a1−g ( 500λnm )−bmie Simulation framework: GPU-MCML [6], 106 photons per simulation. Simulated samples: 500 K ([3]: 10 K). Sample wavelength range: 300–1000 nm ([3]: 450–720 nm), stepsize 2 nm.

Physiological Parameter Estimation from Multispectral Images Unleashed

137

Physiological Parameter Regression. To train a regressor for specific physiological parameters yi ∈ θ i , such as oxygenation and blood volume fraction, the corresponding reflectances have to be normalized (xi =  cici,j ) to account j for multiplicative factors due to changes in light intensity or camera pose. The combination of normalized camera reflectances and corresponding physiological parameter (X, y) serve as training data for any machine learning regression method f to obtain the regressor fˆbase , which corresponds to a regressor without DA. The physiological parameter estimates during an intervention can be deter  mined using this baseline regressor by evaluating fˆbase (x ) = y for each recorded  multispectral image pixel x . The next section describes our DA technique to further improve the parameter estimation in a specific clinical context. 2.2

Domain Adaptation

Working with tissue samples from a simulated source domain ps (x, y) will inevitably introduce a bias with respect to the target domain pt (x, y). We speak pt (x) of covariate shift if ps (y|x) = pt (y|x) and therefore ppst (x,y) (x,y) = ps (x) =: β(x). If pt is contained in the support of ps , adding the weights β to the loss function of the regressor can correct for the covariate shift [7]. The appeal of this method   is, that only recordings xi and no labels yi are necessary for adaptation. While the concept of covariate shift has been applied with great success in a number of different medical imaging applications [8,9], major challenges related to transferring it to our problem are estimation of high dimensional β, and high variance introduced by weighting, both addressed in the next two subsections. Finding β with Kernel Mean Matching and Random Kitchen Sinks. Kernel mean matching (KMM) is a state-of-the-art method for determining β [7]. KMM minimizes the mean distance of the samples of the two domains in a reproducing kernel hilbert space H, using a possibly infinite dimensional lifting φ : IRm → H. In its original formulation [7], the kernel trick was used to pose KMM as a quadratic problem. In our problem domain this is not feasible, because calculating the Gram matrix is quadratic in the (high) number of samples. To overcome this bottleneck, we minimize the KMM objective function (see [7]) with an approximate representation of the lifting φ(xi ) ≈ z(xi ) determined by the random kitchen sinks method [10]. This enables us to solve the convex KMM objective function in its non-kernelized form using a standard optimizer. Doubly Robust Covariate Shift Correction. Estimators trained using weighted samples can yield worse result than estimators not accounting for the covariate shift. The reason is that only few samples are effectively “active”, providing the risk minimizer with less samples. On the other hand using the unweighted training samples often leads to a reasonable, but biased, estimator [11]. Intuitively the unweighted base regressor fˆbase defined in Sect. 2.1 can be used to obtain an initial estimate. Subsequently, another estimator aims to refine the results with emphasis on the samples with high weight. This is the basic idea

138

S.J. Wirkert et al.

of doubly robust (DR) covariate shift correction [11]. Specifically, we use the residuals δi = yi − fˆbase (xi ) weighted by β from the last subsection to train an    estimator fˆres on (X, δ, β). The final estimate is fˆdr (x ) = fˆbase (x ) + fˆres (x ).

3

Experiments and Results

Based on a comprehensive in silico and in vivo MSI data set (Sect. 3.1) we validate the quality of our generic tissue model (Sect. 3.2) as well as the performance of the DA based physiological parameter estimation approach (Sect. 3.3). 3.1

Experimental Setup

Images were recorded with a custom-built, multispectral laparoscope, capturing images at eight different wavebands [3] with the 5Mpix Pixelteq (Largo, FL, USA) Spectrocam. We recorded MI data from seven pigs and six organs (liver, spleen, gallbladder, bowel, diaphragm and abdominal wall) in a laparoscopic setting. For all our experiments we used the data set described in Sect. 2.1 for training. We used a random forest regressor with parameters as in [3]. We drew 1000 random directions from the reproducing kernel hilbert space induced by the radial basis function (RBF) kernel with the random kitchen sink method. We set the σ value of the RBF to the approximate median sample distance and the B parameter of the KMM to ten for all experiments. 3.2

Validity of Tissue Model

One of the prerequisites for covariate shift correction is that support of the distribution of true in vivo measurements is contained in the support of the simulated reflectances. To investigate this for a range of different tissue types (cf. Sect. 3.1) we collected a total of 57 images, extracted measurements from a 100 × 100 region of interest (ROI) and corrected them by a flatfield and dark image as in [2]. The first three principal components cover 99% = 82% + 13% + 4% of the simulated variance. For in vivo data, 97% = 89% + 4% + 4% of the variance lies on the simulated data’s first three principal components. For a qualitative assessment we projected the in vivo measurements on the first two principal components of the simulated data. A selection can be seen in Fig. 2. Apart from gallbladder, all of the in vivo data lie on the two dimensional manifold implied by the simulated data. Figure 3 illustrates how changes in oxygenation and perfusion influence the distribution of the measurements. 3.3

Performance of Domain Adaptation

To validate our approach to physiological parameter estimation with reliable reference data, we performed an in silico experiment with simulated colon tissue as target domain. For this purpose, we used 15,000 colon tissue samples with corresponding ground truth oxygenation from [3] at a signal-to-noise ratio

Physiological Parameter Estimation from Multispectral Images Unleashed

139

Fig. 2. Four organs from three pigs projected onto the first two principal components of our simulated reflectance data plotted in brown. The images on the left show the 560 nm band recorded for the first pig. The depicted measurements are taken from the red ROI. Except for gallbladder, all organs lie on the non-zero density estimates of the simulated data (See also Sect. 4).

Fig. 3. Liver tissue measurements before and after sacrificing a pig. The grid indicates how varying oxygenation (sao2) and blood volume fraction (vhb) changes the measurements in the space spanned by the first two principal components of the simulations. Note that these lines can not be directly interpretated as sao2 and vhb values for the two points, because other factors such as scattering will cause movement on this simplified manifold. c

(SNR := wi,j ) of 20. 10,000 reflectance samples were selected for DA while the i,j remaining 5,000 samples were used for testing. We varied the number of training samples between 104 and 5 ∗ 105 to investigate how the effective sample size influences results. Our DR DA method reduced the median absolute oxygenation estimation error compared to the base estimator by 25–27% and by 14–25% without the DR correction (Fig. 4a). As expected β2 the difference was smaller for higher effective sample sizes meff = β12 . 2

140

S.J. Wirkert et al.

(a) In silico domain adaptation

(b) In vivo domain adaptation

Fig. 4. In silico boxplot results (a) and in vivo (b) validation results corresponding to the experiments described in Sect. 3.3. (b) Shows the distribution of in vivo measurements and adapted in silico reflectances in the principal component space of the simulations. For graphical clarity the distributions are visualized as their two principal axes in this space with lengths corresponding to the eigenvalues.

In vivo experiments were performed for each organ n by calculating the Euclidean distance of the weighted simulation mean n1 i βi xi to the mean of the images. The weighted distance was 36–92% (median 77%) smaller than the unweighted average. See Fig. 4b for a depiction in the principal component space.

4

Discussion

To our knowledge, this paper introduced the first transfer learning-based approach to physiological parameter estimation from multispectral imaging data. As it neither requires real labelled data nor specific prior knowledge on the optical properties of the tissue of interest and is further independent of the camera model and corresponding optics it is potentially broadly applicable to a wide range of clinical applications. The method is built around a generic data set that can automatically be adapted to a given target anatomy based on samples of unlabelled in vivo data. According to porcine experiments with six different target structures, the first three principal components of our simulated data set capture 97% of measured in vivo variations. Our hypothesis is that these variations represent the blood volume fraction, oxygenation and scattering. Visual inspection of the first two principal components showed the captured organ data lie within the simulated data, an important prerequisite for the subsequent DA to work. Gallbladder is the exception, most likely due to its distinctive green stain, caused by the bile shining through. Modelling the bile as another chromophore and extending our data set accordingly would be straightforward. In future work we plan to capture an even higher variety of in vivo data, involving pathologies such as cancer. Our experiments further demonstrate the potential performance boost when adapting the generalized model to a specific task using the presented DA technique. An important methodological component in this context was the integration of the recently proposed DR correction method to address the instabilities

Physiological Parameter Estimation from Multispectral Images Unleashed

141

when few effective training samples are selected. We also tested this method with another recently proposed DA weighting method [9] with similar results. Both bias and variance of the Euclidean distance of the weighted sample mean to the mean of the in vivo images reduced when increasing the number of porcines used for weight determination (not shown). The required training cases for a given application is an interesting future direction of research. In conclusion, we have addressed the important bottleneck of lack of annotated MSI data, with a novel transfer learning-based method to physiological parameter estimation. Given the highly promising experimental results presented in this manuscript, future work will focus on evaluating the method for a variety of clinical applications including partial nephrectomy and cancer detection. Conflict of Interest. The authors declare that they have no conflict of interest. Compliance with Ethical Standards. This article does not contain any studies with human participants. All applicable international, national, and/or institutional guidelines for the care and use of animals were followed. Acknowledgement. Funding for this work was provided by the European Research Council (ERC) starting grant COMBIOSCOPY (637960).

References 1. Claridge, E., Hidovic-Rowe, D.: Model based inversion for deriving maps of histological parameters characteristic of cancer from ex-vivo multispectral images of the colon. IEEE TMI 33, 822–835 (2013) 2. Clancy, N.T., et al.: Intraoperative measurement of bowel oxygen saturation using a multispectral imaging laparoscope. Biomed. Opt. Express 6(10), 4179–4190 (2015) 3. Wirkert, S.J.: others: Robust near real-time estimation of physiological parameters from megapixel multispectral images with inverse Monte Carlo and random forest regression. IJCARS 11(6), 909–917 (2016) 4. Styles, I.B., et al.: Quantitative analysis of multi-spectral fundus images. Med. Image Anal. 10(4), 578–597 (2006) 5. Jacques, S.L.: Optical properties of biological tissues: a review. Phys. Med. Biol. 58(11), R37 (2013) 6. Alerstam, E., et al.: Next-generation acceleration and code optimization for light transport in turbid media using GPUs. Biomed. Opt. Express 1(2), 658–675 (2010) 7. Huang, J., et al.: Correcting sample selection bias by unlabeled data. In: Advances in Neural Information Processing Systems, vol. 19 (2007) 8. Heimann, T., et al.: Real-time ultrasound transducer localization in fluoroscopy images by transfer learning from synthetic training data. Med. Image Anal. 18(8), 1320–1328 (2014) 9. Goetz, M., et al.: DALSA: domain adaptation for supervised learning from sparsely annotated MR images. IEEE TMI 35(1), 184–196 (2016) 10. Rahimi, A., et al.: Random features for large-scale kernel machines. In: NIPS, vol. 3 (2007) 11. Reddi, S.J., et al.: Doubly robust covariate shift correction. In: Twenty-Ninth AAAI Conference on Artificial Intelligence (2015)

Segmentation of Cortical and Subcortical Multiple Sclerosis Lesions Based on Constrained Partial Volume Modeling M´ ario Jo˜ ao Fartaria1,2,3(B) , Alexis Roche1,2 , Reto Meuli2 , Cristina Granziera4,5 , Tobias Kober1,2,3 , and Meritxell Bach Cuadra2,3,6 1

Advanced Clinical Imaging Technology, Siemens Healthcare AG, Lausanne, Switzerland 2 Department of Radiology, CHUV and UNIL, Lausanne, Switzerland mario.fartaria de [email protected] 3 Signal Processing Laboratory (LTS 5), EPFL, Lausanne, Switzerland 4 Martinos Center for Biomedical Imaging, MGH and HMS, Chalestown, MA, USA 5 Department of Clinical Neuroscience, CHUV and UNIL, Lausanne, Switzerland 6 Center of Biomedical Imaging, CIBM, UNIL, Lausanne, Switzerland

Abstract. We propose a novel method to automatically detect and segment multiple sclerosis lesions, located both in white matter and in the cortex. The algorithm consists of two main steps: (i) a supervised approach that outputs an initial bitmap locating candidates of lesional tissue and (ii) a Bayesian partial volume estimation framework that estimates the lesion concentration in each voxel. By using a “mixel” approach, potential partial volume effects especially affecting small lesions can be modeled, thus yielding improved lesion segmentation. The proposed method is tested on multiple MR image sequences including 3D MP2RAGE, 3D FLAIR, and 3D DIR. Quantitative evaluation is done by comparison with manual segmentations on a cohort of 39 multiple sclerosis early-stage patients. Keywords: Cortical lesions · Partial volume · Multiple sclerosis · MRI · Lesion segmentation

1

Introduction

Multiple Sclerosis (MS) is a disease characterized by focal and diffuse inflammation, degeneration, and repair in the central nervous system [1]. The inflammatory demyelination is more common in white matter (WM), and become manifest as focal WM lesions in Magnetic Resonance Imaging (MRI). Recently, advanced MRI has revealed substantial tissue damage also in the cortical gray matter (GM) (i.e. cortical lesions) [2]. Automated MS lesion segmentation has been an active research topic for more than 20 years [3]. Despite significant advances in quantitative image analysis of MS lesions in MRI, some challenges however still remain. The effect of a c Springer International Publishing AG 2017  M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 142–149, 2017. DOI: 10.1007/978-3-319-66179-7 17

Cortical and Subcortical MS Lesion Segmentation

143

mixture of tissues in the same voxel, known as partial volume (PV) effect, is one of the aspects that render the lesion segmentation problem difficult. PV affects particularly small lesions, which are of key importance for early diagnosis and follow-up of MS patients. It has been recently reported a relatively good correlation between the severity of cortical lesions and the patient disability [2]. They are normally small and tend to appear more frequently in regions prone to strong PV effects as seen at the interface between WM and GM [4].

Fig. 1. Diagram of the lesion segmentation pipeline divided into two main steps: the first step outputs lesion location bitmap (λLes ) and the second step performs the PV estimation. π CSF , π GM and π WM are atlas-based prior probability maps for CSF, GM, and WM, respectively.

Our study is based on a set of advanced MRI sequences that have shown to be equally sensitive to WM lesions as routine sequences, but significantly more sensitive to cortical lesions [4,5]. Some of these sequences have been recently recommended as optional sequences in clinical protocols [1]. The goal of this work is to improve MS lesion delineation and consequently the estimation of lesion load in cortical and subcortical areas, a clinically very significant biomarker. Our novel framework (see Fig. 1) combines a supervised lesion detection method with a Bayesian PV estimation prototype algorithm. The former is a k-nearest neighbors (kNN) approach that has been reported to achieve good detection of MS lesions [5,6]. The latter is a novel method inspired by the “mixel” model originally proposed in [7]. The model leads, however, to an ill-posed estimation problem for which [8] suggested the use of regularizing priors. Here, we further introduced spatial constraints (derived from the initial kNN detection) into the model to estimate realistic concentration maps of healthy (WM, GM) and

144

M.J. Fartaria et al.

pathological brain tissue as well as cerebrospinal fluid (CSF). These concentration maps are used to directly compute lesion volumes rather than correcting an initial hard tissue classification for PV effects as in previous methods [9,10]. Furthermore, we first use the supervised approach to drive the unsupervised segmentation contrary to what was proposed in [11]. With this work, we thus strive to combine the advantages of supervised and unsupervised methods to yield both good lesion detection and volume estimation.

2 2.1

Method Partial Volume Estimation

Consider a set of nc images of a given single subject acquired from different MRI sequences after the alignment, bias field correction and skull stripping. Consistent with [7,8,12], we assume that the vector of image intensities yi at a voxel i in the total intra-cranial mask relates to an unknown vector of tissue concentrations qi , with qi  0 and q i 1 = 1, through the statistical relation: yi = M qi + εi ,

εi ∼ N (0, V),

(1)

where nt is the number of distinct tissues, M is an nt × nc matrix representing the mean tissue intensities for each channel (i.e., Mtc is the mean intensity of tissue t in channel c), and V = diag(σ12 , . . . , σn2 c ) is the noise covariance matrix assuming independent stationary Gaussian white noise across modalities. In this work, we consider nt = 4 tissues: CSF, GM, WM, lesions, and nc = 3 channels. We use a prior concentration model of the form proposed in [8] in order to regularize the problem of recovering the voxelwise tissue concentrations qi via Bayesian maximum a posteriori (MAP) estimation:   1 β  2 , q Aq − q − q  π(q1 , q2 , . . . , qnv ) ∝ exp − i i j 2 i i 2

(2)

i,j∈Ni

where nv is the total number of intra-cranial voxels, A is a symmetric penalty matrix with zero diagonal and positive off-diagonal elements, β is a positive constant, and Ni is the neighborhood of voxel i according to some discrete topology (we use a 6-topology). Both elements of A and β are hyperparameters to be tuned in a learning phase. While β controls the amount of spatial smoothness of tissue concentration maps, the purpose of A is to disentangle intensity fluctuations due to noise from PV effects. Each non-diagonal element acts as a penalty on the mixing of distinct tissues in a voxel, hence limiting spurious concentration variations when a single tissue is present. For instance, the larger A12 , the less likely voxels contain both CSF and GM. We propose to generalize the prior model of [8] by allowing voxel-dependent penalty matrices Ai including non-zero diagonal elements in order to penalize tissues locally. This is done here to avoid confusing GM and lesions, which have

Cortical and Subcortical MS Lesion Segmentation

145

similar intensity signatures in the investigated image sequences. Specifically, let π GM be a probabilistic atlas-based prior probability map for the GM and λLes a bitmap that indicates brain regions with lesions (see Sect. 2.2). We set the diagonal elements of Ai corresponding respectively to CSF, GM, WM and lesions, via: Ai,11 = 0,

Ai,22 = a2 (1 − πiGM ),

Ai,33 = 0,

Ai,44 = a4 (1 − λLes i ),

where a2 and a4 are positive factors pre-tuned along with the off-diagonal elements A12 , A13 , A14 , A23 , A24 , A34 and the smoothness parameter β, which are assumed voxel-independent in our particular implementation. Solving for the MAP tissue concentrations yields a quadratic programming problem:    (yi − M qi ) V−1 (yi − M qi ) + q qi − qj 2 , min i Ai q i + β q1 ,...,qnv

i

j∈Ni

where each qi is searched in the multidimensional simplex. The solution is tracked numerically using an iterative scheme that loops over the intra-cranial voxels, and solves for the associated concentration vector qi with all other concentration vectors held fixed using an active set algorithm [8]. This method proves very robust in practice, and typically converges in less than 25 iterations. 2.2

Bitmap of Lesion Location

A supervised approach was used to obtain a map that locates candidates of lesional tissue (λLes ). The method is based on the kNN classifier trained using a set of features obtained from images and atlas-based prior probability maps of the two brain tissues and CSF (π GM , π WM and π CSF ). The features used for the classification were (i) image voxel intensities, (ii) spatial location coordinates in a common space, and (iii) tissue prior probabilities. The k value was set to 15, which was empirically found as a good trade-off between accuracy and computation time [5]. Manual segmentations described in Sect. 3 were used to train the classifier (1 was assigned to lesions voxels, and 0 to the other tissues). Finally, in order to obtain the lesion location bitmap λLes , a dilation using a 4 × 4 × 4 cubic shape as a structural element was applied to the kNN output to enlarge the boundaries of the detected regions in order to guarantee that all lesional tissue is covered. Using this map to drive the PV segmentation with lesion candidates renders the present approach more patient-specific in contrast to employing general tissue atlas priors. 2.3

Imaging Parameters

The noise variance matrix V is initially assumed to be zero, and is iteratively re-estimated by MAP concurrently with the tissue concentrations (see Sect. 2.1), yielding the update rule:   1  (3) (yi − M qi )(yi − M qi ) , V = diag nv i

146

M.J. Fartaria et al.

which is performed after a complete tissue concentration re-estimation loop over the intracranial voxels. Conversely, the matrix M of mean tissue intensities is held fixed during the estimation of tissue concentrations. The mean intensity for CSF, GM, WM was determined from π CSF , π GM , π WM maps respectively, using the voxels with probability higher than 0.95. The mean intensity for lesional tissue was estimated from the kNN output. 2.4

Hyperparameter Tuning

A reference patient was used to tune the hyperparameters A and β so as to minimize the Hellinger distance between the manual lesion segmentation binary mask and the lesion concentration map output by the PV estimation algorithm. Two elements of A were fixed to very large values in order to proscribe mixing of CSF with WM and CSF with lesions (A13 = A14 = 1 × 1010 ). The other parameters were optimized using Powell’s method, yielding A12 = 27.52, A23 = −1.42, A24 = 14.49, A34 = 3.90, a2 = 15.41, a4 = 158.55, and β = 1.53.

3 3.1

Experimental Validation Data and Pre-processing

Thirty-nine patients (14 males, 25 females, median age 34 years, age range: 20–60 years) with early relapsing-remitting MS, disease duration less than 5 years from diagnosis) and Expand Disability Status Scale (EDSS) score between 1 and 2 (median EDSS = 1.5), were scanned on a 3T MRI system (MAGNETOM Trio, Siemens Healthcare GmbH, Erlangen, Germany) using a 32-channel head coil. The MRI protocol included: (i) high-resolution magnetization-prepared 2 rapid acquisition with gradient echo (MP2RAGE) (TR/TI1/TI2 = 5000/700/2500 ms, voxel size = 1×1×1.2 mm3 ), (ii) 3D fluid-attenuated inversion recovery (FLAIR) (TR/TE/TI = 5000/394/1800 ms, voxel size = 1 × 1 × 1.2 mm3 ), and (iii) 3D double-inversion recovery (DIR) (TR/TE/TI1/TI2 = 10000/218/450/3650 ms, voxel size = 1 × 1 × 1.2 mm3 ) all acquired in the same session without patient repositioning. WM and cortical lesions were first identified and marked by one radiologist and one neurologist separately and subsequently agreed on between the two in a follow-up reading. A trained technician then delineated the lesion volumes in each image, we consider the resulting masks as ground truth for lesion load and volume. A patient with relatively high lesion load was chosen as a reference to train the PV algorithm (see Sects. 2.3 and 2.4) and therefore excluded from the ensuing statistical analysis. ELASTIX [13] was used to rigidly register the different images sequences to a common space in each subject. All images were further skull-stripped using an in-house method [14], and corrected for intensity inhomogeneities using the N4 algorithm [15]. The intensity normalization was performed using the histogram

Cortical and Subcortical MS Lesion Segmentation

147

matching technique proposed by [16]. This last pre-processing step was applied only to the images used as kNN input. Fuzzy in-house templates of prior WM, GM, CSF probabilities were nonrigidly registered using ELASTIX onto each image volume to produce the prior maps π GM , π WM , and π CSF (see Sect. 2.2). 3.2

Results

In line with our goal to optimise the lesion delineation of the supervised algorithm used in the first step, we compared the obtained lesion load and voxel-wise metrics before and after applying the proposed PV algorithm on the binary lesion masks. Since the PV estimation is based on the initial lesion location bitmap λLes obtained by the kNN algorithm, it is restricted to known lesion locations. Consequently, no significant improvement was found for the lesion detection rate (DR) nor for the false positive rate. DR for WM lesions ≈75%, and DR for cortical lesions ≈55% before and after applying the PV algorithm. Figure 2 shows exemplary results (WM, cortical and peri-ventricular lesions) comparing the ground truth (GT) with the lesion masks before and after PV estimation. It can be observed visually that lesions appear better delineated using the PV algorithm. Figure 3 shows the differences between the total lesion volume (TLV) computed from the binary and PV-segmented lesions masks and the GT both for all lesions and for WM/cortical lesions separately. PV improved the TLV estimation with respect to the GT significantly compared to the binary kNN mask (P-value ≈ 0.004), mainly due to a substantial improvement in cortical lesion segmentation (P-value ≈ 2e−05). To compare the sensitivity and Dice of the two methods, a threshold has to be applied on the concentration map

Fig. 2. Patches showing examples of manual segmentation (GT, second column), and automated segmentation before (kNN, third column) and after (kNN-PV, last column) applying the PV approach. From the ground truth, WM lesions are shown in green and cortical lesions in blue. The examples are shown in a MP2RAGE background.

148

M.J. Fartaria et al.

Fig. 3. Boxplots of TLV difference between manual (ground truth, GT) and automated lesion segmentation before (kNN) and after (kNN-PV) applying the PV method. From left to right, TLV difference for all, WM, and cortical lesions. The crosses in the plot represent outliers in our cohort. N.S.: not significant.

Fig. 4. Boxplots showing the voxel-wise analysis (sensitivity and Dice) before (kNN) and after (kNN-PV) applying the PV estimation method. The crosses in the plot represent outliers in our cohort.

of lesional tissue obtained by the PV algorithm. The threshold was chosen so that the Dice between the automated and the GT segmentation was maximal. As shown in Fig. 4, a significant (P-value ≈ 2e−07) improvement of 10% for sensitivity and 6% for Dice was obtained by applying the PV estimation method.

4

Conclusion

We presented a novel framework to automatically detect and segment MS lesions in cortical and sub-cortical areas. Our method exploits the good lesion detection performance of a supervised kNN algorithm and extends it by a novel Bayesian PV estimation method to yield improved lesion load estimations. A good volume assessment is important since the TLV is a clinically relevant marker, both for diagnosis and treatment monitoring. Our results suggest that both lesion volume estimation and lesion segmentation can be improved when PV effects are considered. Metrics like sensitivity and Dice were significantly improved when the PV estimation approach was used. Our lesion PV estimation method can be combined with any initial detection. Actually, improving the initial lesion location bitmap would further improve the

Cortical and Subcortical MS Lesion Segmentation

149

lesion segmentation and volume estimation. Future work will focus on improving the initial detection exploring other feature sets and other classification techniques.

References ` Wattjes, M.P., et al.: Evidence-based guidelines: MAGNIMS consensus 1. Rovira, A., guidelines on the use of MRI in multiple sclerosis clinical implementation in the diagnostic process. Nat. Rev. Neurol. 11(8), 471–482 (2015) 2. Calabrese, M., Filippi, M., Gallo, P.: Cortical lesions in multiple sclerosis. Nat. Rev. Neurol. 6(8), 438–444 (2010) 3. Garc´ıa-Lorenzo, D., Francis, S., Narayanan, P.: Review of automatic segmentation methods of multiple sclerosis white matter lesions on conventional magnetic resonance imaging. Med. Image Anal. 17(1), 1–18 (2013) 4. Kober, T., Granziera, C., et al.: MP2RAGE multiple sclerosis magnetic resonance imaging at 3 T. Investig. Radiol. 47(6), 346–352 (2012) 5. Fartaria, M.J., Bonnier, G., Roche, A., et al.: Automated detection of white matter and cortical lesions in early stages of multiple sclerosis. J. Magn. Reson. Imaging 43(6), 1445–1454 (2016) 6. Anbeek, P., et al.: Probabilistic segmentation of white matter lesions in MR imaging. NeuroImage 21(3), 1037–1044 (2004) 7. Choi, H.S., et al.: Partial volume tissue classification of multichannel magnetic resonance images - a mixel model. IEEE Trans. Med. Imaging 10(3), 395–407 (1991) 8. Roche, A., Forbes, F.: Partial volume estimation in brain MRI revisited. In: Golland, P., Hata, N., Barillot, C., Hornegger, J., Howe, R. (eds.) MICCAI 2014. LNCS, vol. 8673, pp. 771–778. Springer, Cham (2014). doi:10.1007/978-3-319-10404-1 96 9. Khademi, A., et al.: Multiscale partial volume estimation for segmentation of white matter lesions using flair MRI. In: 2015 IEEE 12th International Symposium on Biomedical Imaging (ISBI), pp. 568–571. IEEE (2015) 10. Wu, Y., Warfield, S.K., et al.: Automated segmentation of multiple sclerosis lesion subtypes with multichannel MRI. NeuroImage 32(3), 1205–1215 (2006) ˇ ˇ Combining unsu11. Jerman, T., Galimzianova, A., Pernuˇs, F., Likar, B., Spiclin, Z.: pervised and supervised methods for lesion segmentation. In: Crimi, A., Menze, B., Maier, O., Reyes, M., Handels, H. (eds.) BrainLes 2015. LNCS, vol. 9556, pp. 45–56. Springer, Cham (2016). doi:10.1007/978-3-319-30858-6 5 12. Van Leemput, K., Maes, F., Vandermeulen, D., Suetens, P.: A unifying framework for partial volume segmentation of brain MR images. IEEE Trans. Med. Imaging 22(1), 105–119 (2003) 13. Klein, S., Staring, M., et al.: Elastix: a toolbox for intensity-based medical image registration. IEEE Trans. Med. Imaging 29(1), 196–205 (2010) 14. Schmitter, D., Roche, A., et al.: An evaluation of volume-based morphometry for prediction of mild cognitive impairment and Alzheimer’s disease. NeuroImage Clin. 7, 7–17 (2015) 15. Tustison, N.J., Avants, B.B., et al.: N4ITK: improved N3 bias correction. IEEE Trans. Med. Imaging 29(6), 1310–1320 (2010) 16. Ny´ ul, L.G., Udupa, J.K., Zhang, X.: New variants of a method of MRI scale standardization. IEEE Trans. Med. Imaging 19(2), 143–150 (2000)

Classification of Pancreatic Cysts in Computed Tomography Images Using a Random Forest and Convolutional Neural Network Ensemble Konstantin Dmitriev1(B) , Arie E. Kaufman1 , Ammar A. Javed2 , Ralph H. Hruban3 , Elliot K. Fishman4 , Anne Marie Lennon2,5 , and Joel H. Saltz6 1

Department of Computer Science, Stony Brook University, Stony Brook, USA [email protected] 2 Department of Surgery, Johns Hopkins School of Medicine, Baltimore, MD, USA 3 The Department of Pathology, The Sol Goldman Pancreatic Cancer Research Center, Johns Hopkins School of Medicine, Baltimore, MD, USA 4 Department of Radiology, Johns Hopkins School of Medicine, Baltimore, MD, USA 5 Division of Gastroenterology and Hepatology, Johns Hopkins School of Medicine, Baltimore, MD, USA 6 Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY, USA

Abstract. There are many different types of pancreatic cysts. These range from completely benign to malignant, and identifying the exact cyst type can be challenging in clinical practice. This work describes an automatic classification algorithm that classifies the four most common types of pancreatic cysts using computed tomography images. The proposed approach utilizes the general demographic information about a patient as well as the imaging appearance of the cyst. It is based on a Bayesian combination of the random forest classifier, which learns subclass-specific demographic, intensity, and shape features, and a new convolutional neural network that relies on the fine texture information. Quantitative assessment of the proposed method was performed using a 10-fold cross validation on 134 patients and reported a classification accuracy of 83.6%.

1

Introduction

Pancreatic cancer, or pancreatic ductal adenocarcinoma (PDAC) as it is formally known, is one of the most lethal of all cancers with an extremely poor prognosis and an overall five-year survival rate of less than 9%. There are no specific early symptoms of this disease, and most of the cases are diagnosed at an advanced stage after the cancer has spread beyond the pancreas. Early detection of the precursors of PDAC could offer the opportunity to prevent the development of invasive PDAC. Two of the three precursors of PDAC, intraductal papillary mucinous neoplasms (IPMNs) and mucinous cystic neoplasms (MCNs), form pancreatic cysts. These cysts are common and easy to detect with currently available imaging modalities such as computed tomography (CT) and c Springer International Publishing AG 2017  M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 150–158, 2017. DOI: 10.1007/978-3-319-66179-7 18

Classification of Pancreatic Cysts in Computed Tomography Images

151

magnetic resonance imaging. IPMNs and MCNs can be relatively easily identified and offer the potential for the early identification of PDAC. However, the issue is complicated because there are many other types of pancreatic cysts. These range from entirely benign, or non-cancerous cysts, such as serous cystadenomas (SCAs), which do not require surgical intervention, to solid-pseudopapillary neoplasms (SPNs), which are malignant and should undergo surgical resection. These issues highlight the importance of correctly identifying the type of cyst to ensure appropriate management [5]. The majority of pancreatic cysts are discovered incidentally on computed tomography (CT) scans, which makes CT the first available source of imaging data for diagnosis. A combination of CT imaging findings in addition to general demographic characteristics, such as patient age and gender, are used to discriminate different types of pancreatic cysts [5]. However, correctly identifying cyst type by manual examination of the radiological images can be challenging, even for an experienced radiologist. A recent study [9] reported an accuracy of 67–70% for the discrimination of 130 pancreatic cysts on CT scans performed by two readers with more than ten years of experience in abdominal imaging. The use of a computer-aided diagnosis (CAD) algorithm may not only assist the radiologist but also ameliorate the reliability and objectivity of differentiation of various pancreatic cysts identified in CT scans. Although many algorithms have been proposed for the non-invasive analysis of benign and malignant masses in various organs, to our knowledge, there are no CAD algorithms for classifying pancreatic cyst type. This paper presents a novel non-invasive CAD method for discriminating pancreatic cysts by analyzing imaging features in conjunction with patient’s demographic information.

2

Data Acquisition

The dataset in this study contains 134 abdominal contrast-enhanced CT scans collected with a Siemens SOMATOM scanner (Siemens Medical Solutions, Malvern, PA). The dataset consists of the four most common pancreatic cysts: 74 cases of IPMNs, 14 cases of MCNs, 29 cases of SCAs, and 17 cases of SPNs. All CT images have 0.75 mm slice thickness. The ages of the subjects (43 males, 91 females) range from 19 to 89 years (mean age 59.9 ± 17.4 years).

Fig. 1. Examples of pancreatic cyst appearance in CT images

152

K. Dmitriev et al.

One of the most critical parts in the computer-aided cyst analysis is segmentation. The effectiveness and the robustness of the ensuing classification algorithm depend on the precision of the segmentation outlines. The outlines of each cyst (if multiple) within the pancreas were obtained by a semi-automated graphbased segmentation technique [3] (Fig. 1), and were confirmed by an experienced radiologist (E.F.). The histopathological diagnosis for each subject was confirmed by a pancreatic pathologist (R.H.H.) based on the subsequently resected specimen. The segmentation step was followed by a denoising procedure using the state-of-the-art BM4D enhancement filter [6].

3

Method

This work describes an ensemble model, designed to provide an accurate histopathological differentiation for pancreatic cysts. This model consists of two principal components: (1) a probabilistic random forest (RF) classifier, which analyzes manually selected quantitative features, and (2) a convolutional neural network (CNN) trained to discover high-level imaging features for a better differentiation. We propose to analyze 2D axial slices, which can be more efficient in terms of memory consumption and computation compared to the analysis of 3D volumes. The overall schema of the proposed method is illustrated in Fig. 2.

Fig. 2. A schematic view of the proposed classification ensemble of (a) a random forest trained to classify vectors of quantitative features, and (b) a convolutional neural network for classification based on the high-level imaging features. Their Bayesian combination (c) generates the final class probabilities.

3.1

Quantitative Features and Random Forest

The most common features mentioned in the medical literature that are used for initial pancreatic cyst differentiation involve gender and age of the subject,

Classification of Pancreatic Cysts in Computed Tomography Images

153

as well as location, shape and general appearance of the cyst [9]. In this paper, we define a set Q of 14 quantitative features to describe particular cases by: (1) age a ∈ Q and gender g ∈ Q of the patient, (2) cyst location l ∈ Q, (3) intensity I ⊂ Q and (4) shape S ⊂ Q features of a cyst. The importance and discriminative power of these features are described below. 1. Age and Gender. Several studies reported a strong correlation between age and gender of a patient and certain types of pancreatic cysts [1,5]. For example, MCN and SPN often present in women of premenopausal age. In contrast, IPMNs have an equal distribution between men and women, and typically present in patients in their late 60s. 2. Cyst location. Certain cyst types are found in particular locations within the pancreas. For example, the vast majority of MCNs arise in the body or tail of the pancreas. 3. Intensity features. Due to the differences in the fine structure of pancreatic cysts, such as homogeneity versus common presence of septation, calcification ¯ s, κ, γ, M } ∈ I, which are the mean, or solid component, we use the set {I, standard deviation, kurtosis, skewness and median of intensities, respectively, as the global intensity features for coarse initial differentiation. 4. Shape features. Pancreatic cysts also demonstrate differences in shape depending on the category. Specifically, cysts can be grouped into three categories: smoothly shaped, lobulated and pleomorphic cysts [1]. To capture different characteristics of the shape of a cyst, we use volume V ∈ S, surface area SA ∈ S, surface area-to-volume ratio SA/V ∈ S, rectangularity r ∈ S, convexity c ∈ S and eccentricity e ∈ S features summarized in [11]. Given a set D = {(x1 , y1 ), ..., (xk , yk )} of examples xi of pancreatic cysts of known histopathological subtypes yi ∈ Y = {IP M N, M CN, SCA, SP N }, we compute a concatenation qi = (ai , gi , li , I¯i , si , κi , γi , Mi , Vi , Si , SAi , SA/Vi , ri , ci , ei ) of the described features for all k samples in the set D. Following feature extraction, we use an RF classifier to perform the classification of a feature vector qm computed for an unseen cyst sample xm . RFbased classifiers have shown excellent performance in various classification tasks, including numerous medical applications, having high accuracy of prediction and computation efficiency [7,8]. More formally, we use a forest of T decision trees implemented with the scikitlibrary1 . Each decision tree θt predicts the conditional probability Pθt (y|qm ) of histopathological class y, given a feature vector qm . The final RF class probability can be found as the following: T 1 P˜1 (ym = y|xm ) = P˜RF (ym = y|qm ) = Pθ (ym = y|qm ). T t=1 t

For more details, we refer the reader to the technical report [2]. 1

http://scikit-image.org.

(1)

154

3.2

K. Dmitriev et al.

CNN

As described in Sect. 4, RF trained on the proposed quantitative features can be used for cyst classification with reasonably high accuracy. However, despite high generalization potential, the proposed features do not take full advantage of the image information. In particular, due to variations in the internal structure of pancreatic cysts, they show different image characteristics: SCA often has a honeycomb-like appearance with a central scar or septation, MCN demonstrates a “cyst within cyst” appearance with peripheral calcification, IPMN tends to have a “cluster of grapes” appearance, and SPN typically consists of solid and cystic components [12]. However, these imaging features can overlap, especially when the cyst is small and the internal architecture cannot be differentiated. We apply CNN as a second classifier, which can better learn barely perceptible yet important image features [10]. The proposed CNN, shown in Fig. 2(b), contains 6 Convolutional, 3 Max-pooling, 2 Dropout and 3 Fully-connected (FC) layers. Each convolutional and the first two FC layers are followed by the rectified linear unit (ReLU) activation function; the last FC layer ends with the softmax activation function to obtain the final class probabilities. The data for training and testing the proposed CNN were generated as folSlice Slice of the original 3D bounding box {Xij } with a lows. Each 2D axial slice Xij segmented cyst xi was down-/up-sampled to 64×64 pixels squares, using bicubic interpolation. Visual examination confirmed the preservation of the important features. Due to the generally spherical shape of a cyst, slices near the top and the bottom of the volume do not contain enough pixels of a cyst to make an accurate diagnosis. Therefore, slices with the overlap ratio less than 40%, defined as the percentage of cyst pixels in a slice, were excluded. We also incorporated a data augmentation routine to increase the size of the training dataset and to prevent over-fitting: (1) random rotations within [−25◦ ; +25◦ ] degree range; (2) random vertical and horizontal flips; (3) and random horizontal and vertical translations within [−2; +2] pixels range. The network was implemented using the Keras library2 and trained on 512sized mini-batches to minimize the class-balanced cross-entropy loss function using Stochastic Gradient Descent with a 0.001 learning rate, momentum of 0.9, weight decay of 0.0005 for 100 epoch. In the testing phase, each slice with the overlap ratio more than 40% was analyzed by the CNN separately, and the final probabilities were obtained by averaging the class probabilities for each slice: Jm 1  Slice Slice }) = PCNN (ym = y|Xmj ), (2) P˜2 (ym = y|xm ) = P˜CNN (ym = y|{Xij Jm j=1 Slice ) is the vector of class probabilities, and Jm is the where PCN N (ym = y|Xmj number of 2D axial slices used for the classification of cyst sample xm .

2

Chollet, F.: Keras. https://github.com/fchollet/keras (2015).

Classification of Pancreatic Cysts in Computed Tomography Images

3.3

155

Ensemble

Although our dataset is representative of the types of cysts that arise in the population, we still recognize that it contains limited information and might not include enough cases of cysts of rare imaging appearance, which is crucial for obtaining robust CNN performance. Therefore, we hypothesize that the RF classifier will show a better performance at classifying small cysts, which do not have enough distinctive imaging features, by utilizing the clinical information about the patient and the general intensity and shape features, whereas CNN is expected to show a similar performance but at analyzing large cysts. It has been shown that combinations of multiple classifiers, classifier ensembles, achieve superior performance compared to single classifier models [4], by learning different, presumably independent classification subproblems separately. Therefore, after training RF and CNN classifiers independently, we perform a Bayesian combination to ensure that a more robust and accurate classifier has more power in making the final decision. Mathematically, the final histopathological diagnosis yˆ can be written in the following way: P˜1 (ym = y|xm )P˜2 (ym = y|xm ) . yˆm = arg max  2 ˜  y∈Y y  ∈Y c=1 Pc (ym = y |xm )

4

(3)

Results and Discussion

We evaluated the performance of the proposed method using a stratified 10-fold cross-validation strategy, maintaining similar data distribution in training and testing datasets to avoid possible over- and under-representation of classes due to the imbalance in the dataset. Classification performance is reported in terms of the normalized averaged confusion matrix and the overall classification accuracy. We also analyze the dependency between the accuracy of the individual and ensemble classifiers and the average size of the misclassified cysts. All experiments were performed using an NVIDIA Titan X (12 GB) GPU. The training of RF and CNN classifiers took approximately 1 s and 30 min, respectively, during each cross-validation loop. The test time for the final class probabilities took roughly 1 s to compute for a single sample. Table 1. Confusion matrices of the RF (left) and CNN (right) classifiers Ground Truth IPMN MCN SCA SPN

RF Prediction (%) IPMN MCN SCA SPN 95.9 1.4 2.7 0.0 21.4 64.3 14.3 0.0 51.7 3.5 37.9 6.9 5.9 0.0 0.0 94.1

Ground Truth IPMN MCN SCA SPN

CNN Prediction (%) IPMN MCN SCA SPN 93.2 4.0 1.4 1.4 57.1 28.6 14.3 0.0 37.9 0.0 48.3 13.8 0.0 0.0 0.0 100.0

156

K. Dmitriev et al.

Results of the individual classifiers. We first compare the performance of the RF and CNN classifiers separately, and the overall accuracy is 79.8% and 77.6%, respectively. The quantitative details are provided in Table 1. The experiments showed that the accuracy of 30 trees in RF lead to the error convergence and was sufficient to achieve the best performance. Prior to developing the proposed set of quantitative features, we also evaluated the performance of the RF classifier when using only age, gender, and the location of the cyst within the pancreas, as the most objective criteria used by clinicians. The overall accuracy was 62%, and adding the volume of the cyst as a feature improved the classification by 2.2%. In addition, we investigated the performance advantages for the CNN when using the data augmentation routine. Specifically, we found that the use of data augmentation improves the overall accuracy of the CNN by 13.2%. One of the interesting, but also expected, outcomes is the average size of the misclassified cysts. In particular, the CNN classifier struggles to correctly interpret cysts of a volume smaller than 9 cm3 or 2.3 cm in diameter (average volume and diameter of misclassified cysts are 5.1 cm3 and 1.3 cm, respectively), which are reasonably challenging due to the absence of distinctive appearance. However, the accuracy of the RF does not show such dependence (average volume and diameter of misclassified cysts are 81 cm3 and 5.2 cm, respectively). Results of the ensemble classifier. In this experiment, we test the effect of the Bayesian combination of the RF and CNN classifiers on the performance, and the results are presented in Table 2. The overall accuracy is 83.6%, which is higher than the performance of the individual classifiers. It is also interesting to note the change in the average volume and diameter of the misclassified cysts, which are 65 cm3 and 4.8 cm for the ensemble model, respectively. These results validate our hypothesis and justify the decision to combine the RF and CNN classifiers into a Bayesian combination to consider their separate diagnoses depending on how accurate they have been at analyzing the training dataset. Table 2. Confusion matrix of the final ensemble classifier. Ground truth Ensemble Prediction (%) IPMN MCN SCA SPN IPMN MCN SCA SPN

5

95.9 14.3 34.5 0.0

1.4 64.3 3.5 0.0

1.4 1.4 21.4 0.0 51.7 10.3 0.0 100.0

Conclusion and Future Work

In this work, we proposed an ensemble classification model to identify pancreatic cyst types automatically. The proposed algorithm is based on a Bayesian combination of an RF classifier and a CNN to make use of both clinical information

Classification of Pancreatic Cysts in Computed Tomography Images

157

about the patient and fine imaging information from CT scans. The reported results showed promising performance and achieved an overall accuracy of 83.6%. However, our study faces some limitations. In particular, our dataset was limited to only four most common pancreatic cyst types. Future work will extend the model to include other types and will evaluate the ability of the algorithm to differentiate IPMNs and MCNs with low- or intermediate-grade dysplasia from those with high-grade dysplasia or an associated invasive adenocarcinoma. This differentiation is critical in determining appropriate therapy. Acknowledgments. This research has been generously supported by The Marcus Foundation, Inc., and partially by NSF grants CNS0959979, IIP1069147, CNS1302246, NRT1633299, CNS1650499, IIS1527200, and NIH grant CA62924.

References 1. Cho, H.W., Choi, J.Y., Kim, M.J., Park, M.S., Lim, J.S., Chung, Y.E., Kim, K.W.: Pancreatic tumors: emphasis on CT findings and pathologic classification. Korean J. Radiol 12(6), 731–739 (2011) 2. Criminisi, A., Shotton, J., Konukoglu, E.: Decision forests for classification, regression, density estimation, manifold learning and semi-supervised learning. Microsoft Research Cambridge, Technical report MSRTR-2011-114 5(6), 12 (2011) 3. Dmitriev, K., Gutenko, I., Nadeem, S., Kaufman, A.: Pancreas and cyst segmentation. In: Proceedings of SPIE Medical Imaging, p. 97842C (2016) 4. Ingalhalikar, M., Parker, W.A., Bloy, L., Roberts, T.P.L., Verma, R.: Using multiparametric data with missing features for learning patterns of pathology. In: Ayache, N., Delingette, H., Golland, P., Mori, K. (eds.) MICCAI 2012. LNCS, vol. 7512, pp. 468–475. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33454-2 58 5. Lennon, A.M., Wolfgang, C.L., Canto, M.I., Klein, A.P., Herman, J.M., Goggins, M., Fishman, E.K., Kamel, I., Weiss, M.J., Diaz, L.A., et al.: The early detection of pancreatic cancer: what will it take to diagnose and treat curable pancreatic neoplasia? Cancer Res. 74(13), 3381–3389 (2014) 6. Maggioni, M., Katkovnik, V., Egiazarian, K., Foi, A.: Nonlocal transform-domain filter for volumetric data denoising and reconstruction. IEEE Trans. Image Process. 22(1), 119–133 (2013) 7. Raman, S.P., Chen, Y., Schroeder, J.L., Huang, P., Fishman, E.K.: CT texture analysis of renal masses: pilot study using random forest classification for prediction of pathology. Acad. Radiol. 21(12), 1587–1596 (2014) 8. Raman, S.P., Schroeder, J.L., Huang, P., Chen, Y., Coquia, S.F., Kawamoto, S., Fishman, E.K.: Preliminary data using computed tomography texture analysis for the classification of hypervascular liver lesions: generation of a predictive model on the basis of quantitative spatial frequency measurements - a work in progress. J. Comput. Assist. Tomogr. 39(3), 383–395 (2015) 9. Sahani, D.V., Sainani, N.I., Blake, M.A., Crippa, S., Mino-Kenudson, M., del Castillo, C.F.: Prospective evaluation of reader performance on mdct in characterization of cystic pancreatic lesions and prediction of cyst biologic aggressiveness. AJR Am. J. Roentgenol. 197(1), W53–W61 (2011)

158

K. Dmitriev et al.

10. Shin, H.C., Roth, H.R., Gao, M., Lu, L., Xu, Z., Nogues, I., Yao, J., Mollura, D., Summers, R.M.: Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Trans. Med. Imaging 35(5), 1285–1298 (2016) 11. Yang, M., Kpalma, K., Ronsin, J.: A survey of shape feature extraction techniques. In: Yin, P.-Y. (ed.) Pattern Recognition Techniques, Technology and Applications. InTech (2008). doi:10.5772/6237 12. Zaheer, A., Pokharel, S.S., Wolfgang, C., Fishman, E.K., Horton, K.M.: Incidentally detected cystic lesions of the pancreas on CT: review of literature and management suggestions. Abdom. Imaging 38(2), 331–341 (2013)

Classification of Major Depressive Disorder via Multi-site Weighted LASSO Model Dajiang Zhu1(&), Brandalyn C. Riedel1, Neda Jahanshad1, Nynke A. Groenewold2,3,4, Dan J. Stein4, Ian H. Gotlib5, Matthew D. Sacchet6, Danai Dima7,8, James H. Cole9, Cynthia H.Y. Fu10, Henrik Walter11, Ilya M. Veer12, Thomas Frodl12,13, Lianne Schmaal14,15,16, Dick J. Veltman16, and Paul M. Thompson1 1

Keck School of Medicine, Imaging Genetics Center, USC Stevens Neuroimaging and Informatics Institute, University of Southern California, CA Los Angeles, USA [email protected] 2 BCN NeuroImaging Center and Department of Neuroscience, University of Groningen, Groningen, The Netherlands 3 University Medical Center Groningen, Groningen, The Netherlands 4 Department of Psychiatry and Mental Health, University of Cape Town, Cape Town, South Africa 5 Neurosciences Program, and Department of Psychology, Stanford University, CA Stanford, USA 6 Department of Psychiatry and Behavioral Sciences, Stanford University, CA Stanford, USA 7 Department of Neuroimaging, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, UK 8 Department of Psychology, School of Arts and Social Science, City, University of London, London, UK 9 Department of Medicine, Imperial College London, London, UK 10 Department of Psychological Medicine, King’s College London, London, UK 11 Department of Psychiatry and Psychotherapy, Charité Universitätsmedizin Berlin, Berlin, Germany 12 Department of Psychiatry, Trinity College Dublin, Dublin, Ireland 13 Department of Psychiatry and Psychotherapy, Otto von Guericke University Magdeburg, Magdeburg, Germany 14 Department of Psychiatry and Neuroscience Campus Amsterdam, VU University Medical Center, Amsterdam, The Netherlands 15 Orygen, The National Centre of Excellence in Youth Mental Health, Parkville, Australia 16 Center for Youth Mental Health, The University of Melbourne, Melbourne, Australia Abstract. Large-scale collaborative analysis of brain imaging data, in psychiatry and neurology, offers a new source of statistical power to discover features that boost accuracy in disease classification, differential diagnosis, and outcome

Supported in part by NIH grant U54 EB020403; see ref. 3 for additional support to co-authors for cohort recruitment. © Springer International Publishing AG 2017 M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 159–167, 2017. DOI: 10.1007/978-3-319-66179-7_19

160

D. Zhu et al. prediction. However, due to data privacy regulations or limited accessibility to large datasets across the world, it is challenging to efficiently integrate distributed information. Here we propose a novel classification framework through multi-site weighted LASSO: each site performs an iterative weighted LASSO for feature selection separately. Within each iteration, the classification result and the selected features are collected to update the weighting parameters for each feature. This new weight is used to guide the LASSO process at the next iteration. Only the features that help to improve the classification accuracy are preserved. In tests on data from five sites (299 patients with major depressive disorder (MDD) and 258 normal controls), our method boosted classification accuracy for MDD by 4.9% on average. This result shows the potential of the proposed new strategy as an effective and practical collaborative platform for machine learning on large scale distributed imaging and biobank data. Keywords: MDD

 Weighted LASSO

1 Introduction Major depressive disorder (MDD) affects over 350 million people worldwide [1] and takes an immense personal toll on patients and their families, placing a vast economic burden on society. MDD involves a wide spectrum of symptoms, varying risk factors, and varying response to treatment [2]. Unfortunately, early diagnosis of MDD is challenging and is based on behavioral criteria; consistent structural and functional brain abnormalities in MDD are just beginning to be understood. Neuroimaging of large cohorts can identify characteristic correlates of depression, and may also help to detect modulatory effects of interventions, and environmental and genetic risk factors. Recent advances in brain imaging, such as magnetic resonance imaging (MRI) and its variants, allow researchers to investigate brain abnormalities and identify statistical factors that influence them, and how they relate to diagnosis and outcomes [12]. Researchers have reported brain structural and functional alterations in MDD using different modalities of MRI. Recently, the ENIGMA-MDD Working Group found that adults with MDD have thinner cortical gray matter in the orbitofrontal cortices, insula, anterior/posterior cingulate and temporal lobes compared to healthy adults without a diagnosis of MDD [3]. A subcortical study – the largest to date – showed that MDD patients tend to have smaller hippocampal volumes than controls [4]. Diffusion tensor imaging (DTI) [5] reveals, on average, lower fractional anisotropy in the frontal lobe and right occipital lobe of MDD patients. MDD patients may also show aberrant functional connectivity in the default mode network (DMN) and other task-related functional brain networks [6]. Even so, classification of MDD is still challenging. There are three major barriers: first, though significant differences have been found, these previously identified brain regions or brain measures are not always consistent markers for MDD classification [7]; second, besides T1 imaging, other modalities including DTI and functional magnetic

Classification of Major Depressive Disorder via Multi-site Weighted LASSO Model

161

Fig. 1. Overview of our proposed framework.

resonance imaging (fMRI) are not commonly acquired in a clinical setting; last, it is not always easy for collaborating medical centers to perform an integrated data analysis due to data privacy regulations that limit the exchange of individual raw data and due to large transfer times and storage requirements for thousands of images. As biobanks grow, we need an efficient platform to integrate predictive information from multiple centers; as the available datasets increase, this effort should increase the statistical power to identify predictors of disease diagnosis and future outcomes, beyond what each site could identify on its own. In this study, we introduce a multi-site weighted LASSO (MSW-LASSO) model to boost classification performance for each individual participating site, by integrating their knowledge for feature selection and results from classification. As shown in Fig. 1, our proposed framework features the following characteristics: (1) each site retains their own data and performs weighted LASSO regression, for feature selection, locally; (2) only the selected brain measures and the classification results are shared to other sites; (3) information on the selected brain measures and the corresponding classification results are integrated to generate a unified weight vector across features; this is then sent to each site. This weight vector will be applied to the weighted LASSO in the next iteration; (4) if the new weight vector leads to a new set of brain measures and better classification performance, the new set of brain measures will be sent to other sites. Otherwise, it is discarded and the old one is recovered.

2 Methods 2.1

Data and Demographics

For this study, we used data from five sites across the world. The total number of participants is 557; all of them were older than 21 years old. Demographic information for each site’s participants is summarized in Table 1.

162

D. Zhu et al. Table 1. Demographics for the five sites participating in the current study.

1 2 3 4 5

Sites

Total N

Total N of MDD Total N of patients (%) Controls (%)

Age of Controls (Mean ± SD; y)

Groningen Stanford BRCDECC Berlin Dublin Combined

45 110 130 172 100 557

22 (48.89%) 54 (49.09%) 69 (53.08%) 101 (58.72%) 53 (53%) 299 (53.68%)

42.78 38.17 51.72 41.09 38.49

2.2

23 (51.11%) 56 (50.91) 61 (46.92%) 71 (41.28%) 47 (47%) 258 (46.32$)

± ± ± ± ±

14.36 9.97 7.94 12.85 12.37

Age of MDD (Mean ± SD; y) 43.14 37.75 47.85 41.21 41.81

± ± ± ± ±

13.8 9.78 8.91 11.82 10.76

% Female MDD

% Female Total

72.73 57.41 68.12 64.36 62.26

73.33 60.00 60.77 60.47 57.00

Data Preprocessing

As in most common clinical settings, only T1-weighted MRI brain scans were acquired at each site; quality control and analyses were performed locally. Sixty-eight (34 left/34 right) cortical gray matter regions, 7 subcortical gray matter regions and the lateral ventricles were segmented with FreeSurfer [8]. Detailed image acquisition, pre-processing, brain segmentation and quality control methods may be found in [3, 9]. Brain measures include cortical thickness and surface area for cortical regions and volume for subcortical regions and lateral ventricles. In total, 152 brain measures were considered in this study. 2.3

Algorithm Overview

To better illustrate the algorithms, we define the following notations (Tables 2 and 3): 1. 2. 3. 4. 5.

Fi : The selected brain measures (features) of Site-i; Ai : The classification performance of Site-i; W: The weight vector; w-LASSO (W, Di ): Performing weighted LASSO on Di with weight vector – W; SVM ( Fi , Di ): Performing SVM classifier on Di using the feature set - Fi ;

The algorithms have two parts that are run at each site, and an integration server. At first, the integration server initializes a weight vector with all ones and sends it to all sites. Each site use this weight vector to conduct weighted LASSO (Sect. 2.6) with their own data locally. If the selected features have better classification performance, it will send the new features and the corresponding classification result to the integration server. If there is no improvement in classification accuracy, it will send the old ones. After the integration server receives the updates from all sites, it generates a new weight vector (Sect. 2.5) according to different feature sets and their classification performance. The detailed strategy is discussed in Sect. 2.5.

Classification of Major Depressive Disorder via Multi-site Weighted LASSO Model

163

Table 2. Main steps of Algorithm 1.

1. 2. 3. 4. 5. 6. 7.

Algorithm 1 (Integration Server) Initialize W (with all features weighted as one) Send W to all sites while at least one site has improvement on A update W (Section 2.5) Send W to all sites end while Send W with null to all sites

Table 3. Main steps of Algorithm 2.

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.

2.4

Algorithm 2 (Site-i) ← ∅, ← 0 while received W is not null ′ ← w-LASSO (W, ) (Section 2.6) if ′ ≠ ′ ← SVM ( ′ , ) if ′ > send ′ and ′ to Integration Server ← ′, ← ′ else send and to Integration Server end if end if end while

Ordinary LASSO and Weighted LASSO

LASSO [10] is a shrinkage method for linear regression. The ordinary LASSO is defined as:  2 Xn Xn   b b ðLASSOÞ ¼ arg miny  x b þ k jb j  i i i¼1 i¼1 i

ð1Þ

y and x are the observations and predictors. k is known as the sparsity parameter. It minimizes the sum of squared errors while penalizing the sum of the absolute values of the coefficients - b. As LASSO regression will force many coefficients to be zero, it is widely used for variable selection [11]. However, the classical LASSO shrinkage procedure might be biased when estimating large coefficients [12]. To alleviate this risk, adaptive LASSO [12] was developed and it tends to assign each predictor with different penalty parameters. Thus it can avoid having larger coefficients penalized more heavily than small coefficients. Similarly, the motivation of multi-site weighted LASSO (MSW-LASSO) is to penalize different predictors (brain measures), by assigning different weights, according to its classification performance across all sites. Generating the weights for each brain measure (feature) and the MSW-LASSO model are discussed in Sects. 2.5 and 2.6.

164

2.5

D. Zhu et al.

Generation of a Multi-site Weight

In Algorithm 1, after the integration server receives the information on selected features (brain measures) and the corresponding classification performance of each site, it generates a new weight for each feature. The new weight for the f th feature is: Wf ¼  Ws;f ¼

Xm s¼1

Ws;f As Ps =m

1; if the f th feature was selected in site  s 0; otherwise

ð2Þ ð3Þ

Here m is the number of sites. As is the classification accuracy of site-s. Ps is the proportion of participants in site-s relative to the total number of participants at all sites. Equation (3) penalizes the features that only “survived” in a small number of sites. On the contrary, if a specific feature was selected by all sites, meaning all sites agree that this feature is important, it tends to have a larger weight. In Eq. (2) we consider both the classification performance and the proportion of samples. If a site has achieved very high classification accuracy and it has a relatively small sample size compared to other sites, the features selected will be conservatively “recommended” to other sites. In general, if the feature was selected by more sites and resulted in higher classification accuracy, it has larger weights. 2.6

Multi-site Weight LASSO

In this section, we define the multi-site weighted LASSO (MSW-LASSO) model:  2  Xn Xn  Xm   b b MSWLasso ¼ arg miny  x b þ k 1  W A P =m jbi j ð4Þ  i s;i s s i i¼1 i¼1 s¼1

Here xi represents the MRI measures after controlling the effects of age, sex and intracranial volume (ICV), which are managed within different sites. y is the label indicating MDD patient or control. n is the 152 brain measures (features) in this study. In our MSW-LASSO model, a feature with larger weights implies higher classification performance and/or recognition by multiple sites. Hence it will be penalized less and has a greater chance of being selected by the sites that did not consider this feature in the previous iteration.

3 Results 3.1

Classification Improvements Through the MSW-LASSO Model

In this study, we applied Algorithms 1 and 2 on data from five sites across the world. In the first iteration, the integration server initialized a weight vector with all ones and sent it to all sites. Therefore, these five sites conducted regular LASSO regression in the first round. After a small set of features was selected using similar strategy in [9] within

Classification of Major Depressive Disorder via Multi-site Weighted LASSO Model

165

each site, they performed classification locally using a support vector machine (SVM) and shared the best classification accuracy to the integration server, as well as the set of selected features. Then the integration server generated the new weight according to Eq. (2) and sent it back to all sites. From the second iteration, each site performed MSW-LASSO until none of them has improvement on the classification result. In total, these five sites ran MSW-LASSO for six iterations; the classification performance for each round is summarized in Fig. 2(a-e).

Fig. 2. Applying MSW-LASSO to the data coming from five sites (a-e). Each subfigure shows the classification accuracy (ACC), specificity (SPE) and sensitivity (SEN) at each iteration. (f) shows the improvement in classification accuracy at each site after performing MSW-LASSO.

Though the Stanford and Berlin sites did not show any improvements after the second iteration, the classification performance at the BRCDECC site and Dublin continued improving until the sixth iteration. Hence our MSW-LASSO terminated at the sixth round. Figure 2f shows the improvements of classification accuracy for all five sites - the average improvement is 4.9%. The sparsity level of the LASSO is set as 16% - which means that 16% of 152 features tend to be selected in the LASSO process. Section 3.3 shows the reproducibility of results with different sparsity levels. When conducing SVM classification, the same kernel (RBF) was used, and we performed a grid search for possible parameters. Only the best classification results are adopted. 3.2

Analysis of MSW-LASSO Features

In the process of MSW-LASSO, only the new set of features resulting in improvements in classification are accepted. Otherwise, the prior set of features is preserved. The new features are also “recommended” to other sites by increasing the corresponding weights of the new features. Figure 3 displays the changes of the involved features through six iterations and the top 5 features selected by the majority of sites. At the first iteration, there are 88 features selected by five sites. This number decreases over MSW-LASSO iterations. Only 73 features are preserved after six iterations but the average classification accuracy increased by 4.9%. Moreover, if a

166

D. Zhu et al.

Fig. 3. (a) Number of involved features through six iterations. (b-f) The top five consistently selected features across sites. Within each subfigure, the top showed the locations of the corresponding features and the bottom indicated how many sites selected this feature through the MSW-LASSO process. (b-c) are cortical thickness and (d-f) are surface area measures.

feature is originally selected by the majority of sites, it tends to be continually selected after multiple iterations (Fig. 3d-e). For those “promising” features that are accepted by fewer sites at first, they might be incorporated by more sites as the iteration increased (Fig. 2b-c, f). 3.3

Reproducibility of the MSW-LASSO

For LASSO-related problems, there is no closed-form solution for the selection of sparsity level; this is highly data dependent. To validate our MSW-LASSO model, we repeated Algorithms 1 and 2 at different sparsity levels, which leads to preservation of different proportions of the features. The reproducibility performance of our proposed MSW-LASSO is summarized in Table 4. Table 4. Reproducibility results with different sparsity levels. The column of selected features represents the percentage of features preserved during the LASSO procedure, and the average improvement in accuracy, sensitivity, and specificity by sparsity. Selected Features Improvement, in % ACC SPE SEN 13% 3.1 1.8 4.4 20% 3.9 1.4 6.0 23% 3.8 2.9 4.4 26% 4.3 3.4 5.2 30% 2.9 3.0 2.9

Selected features Improvement, in % ACC SPE SEN 33% 2.6 3.1 2.5 36% 1.7 2.1 1.5 40% 2.5 4.1 1.4 43% 3.1 1.1 5.0 46% 2.8 3.9 1.9

Classification of Major Depressive Disorder via Multi-site Weighted LASSO Model

167

4 Conclusion and Discussion Here we proposed a novel multi-site weighted LASSO model to heuristically improve classification performance for multiple sites. By sharing the knowledge of features that might help to improve classification accuracy with other sites, each site has multiple opportunities to reconsider its own set of selected features and strive to increase the accuracy at each iteration. In this study, the average improvement in classification accuracy is 4.9% for five sites. We offer a proof of concept for distributed machine learning that may be scaled up to other disorders, modalities, and feature sets.

References 1. World Health Organization. World Health Organization Depression Fact sheet, No. 369 (2012). http://www.who.int/mediacentre/factsheets/fs369/en/ 2. Fried, E.I., et al.: Depression is more than the sum score of its parts: individual DSM symptoms have different risk factors. Psych. Med. 44(10), 2067–2076 (2014) 3. Schmaal, L., et al.: Cortical abnormalities in adults and adolescents with major depression based on brain scans from 20 cohorts worldwide in the ENIGMA Major Depressive Disorder Working Group. Mol Psych. (2016). doi:10.1038/mp.2016.60 4. Schmaal, L., et al.: Subcortical brain alterations in major depressive disorder: findings from the ENIGMA Major Depressive Disorder working group. Mol. Psych. 21(6), 806–812 (2016) 5. Liao, Y., et al.: Is depression a disconnection syndrome? Meta-analysis of diffusion tensor imaging studies in patients with MDD. J. Psych. Neurosci. 38(1), 49 (2013) 6. Sambataro, F., et al.: Revisiting default mode network function in major depression: evidence for disrupted subsystem connectivity. Psychl. Med. 44(10), 2041–2051 (2014) 7. Lo, A., et al.: Why significant variables aren’t automatically good predictors. PNAS 112(45), 13892–13897 (2015) 8. https://surfer.nmr.mgh.harvard.edu/ 9. Zhu, D., et al.: Large-scale classification of major depressive disorder via distributed Lasso. Proc. SPIE 10160, 101600Y-1 (2017) 10. Tibshirani, R.: Regression shrinkage and selection via the LASSO. J. Roy. Stat. Soc. 58, 267–288 (1996) 11. Li, Q., Yang, T., Zhan, L., Hibar, D.P., Jahanshad, N., Wang, Y., Ye, J., Thompson, Paul M., Wang, J.: Large-scale collaborative imaging genetics studies of risk genetic factors for Alzheimer’s Disease across multiple institutions. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9900, pp. 335–343. Springer, Cham (2016). doi:10.1007/978-3-319-46720-7_39 12. Zou, H.: The adaptive LASSO and its oracle properties. J. Amer. Statist. Assoc. 101(476), 1418–1429 (2006)

A Multi-atlas Approach to Region of Interest Detection for Medical Image Classification Hongzhi Wang(B) , Mehdi Moradi, Yaniv Gur, Prasanth Prasanna, and Tanveer Syeda-Mahmood IBM Almaden Research Center, San Jose, USA [email protected]

Abstract. A common approach for image classification is based on image feature extraction and supervised discriminative learning. For medical image classification problems where discriminative image features are spatially distributed around certain anatomical structures, localizing the region of interest (ROI) essential for the classification task is a key to success. To address this problem, we develop a multi-atlas label fusion technique for automatic ROI detection. Given a set of training images with class labels, our method infers voxel-wise scores for each image showing how discriminative each voxel is for categorizing the image. We applied our method in a 2D cardiac CT body part classification application and show the effectiveness of the detected ROIs.

1

Introduction

Medical image classification is often addressed by supervised learning. Given training images with class labels, classifiers are trained to directly map image features to class labels. Image features can be extracted from the entire images. However, for applications such as disease classification [3,4] and body part recognition [8,11], where the most informative image features are locally distributed around certain anatomical structures, localizing the region of interest (ROI) essential for the classification task is a key to success. When prior knowledge is available, e.g. hippocampus is known to be informative for discriminating patients with the Alzheimer’s disease, manual ROI labeling can be employed [4]. However, manual ROI labeling is time consuming and can not be applied without prior knowledge. We address automatic ROI detection through multi-atlas classification [4]. Our method detects one ROI for each class such that image features extracted from a class specific ROI can more optimally discriminate the class (see Fig. 1). Relevant to our work, learning-based patch selection [11] aims to find discriminative/informative patches from pre-extracted patches. For such methods, accurate ROI patch selection relies on the fact that the proper ROI patches are already included in the pre-selected patches. It is inefficient for handling the situation where discriminative patches may have large scale variations across classes. In contrast, our method infers voxel-wise estimation for each image showing how c Springer International Publishing AG 2017  M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 168–176, 2017. DOI: 10.1007/978-3-319-66179-7 20

A Multi-atlas Approach to Region of Interest Detection

169

discriminative each voxel is for categorizing the image, from which an accurate ROI can be derived through thresholding (see Fig. 1). Our application is 2D body level recognition using chest CT, which is also related to work in anatomy localization [5,6]. By dividing a full set of axial CT slices into different levels, with each level semantically labeled for the most significant anatomy or potential disease, the task of screening for a specific disease can be reduced to running a 3D algorithm on a small number of slices, making subsequent analysis more efficient. For this reason, instead of aiming for localizing specific organs, we aim for finding the best matching levels for a given image to optimally serve for specific disease analysis. We show state of the art body part classification performance and show that employing the class specific ROIs derived by our method substantially improves classification performance.

2 2.1

Multi-atlas ROI Detection for Image Classification Problem Definition for ROI Detection

Let I = {I1 , ..., In }, where n is the total number of class labels and Il = {Il1 , ..., Ilnl } contains images of class l. nl = |Il |. Our work is based on the assumption that each image contains a ROI with features that are common and unique for images from the same class. Let p(l|I, x) = p(l|I(N (x))) be the probability that image I is from class l given features located at x, where N is a neighborhood surrounding x. ROI detection can be addressed through estimating p(l|I(N (x))) for every training image. Discriminative learning is a common technique for estimating p(l|I(N (x))) given training images. For medical images, where ROI is anatomy dependent, we propose to apply multi-atlas label fusion techniques to address this problem. 2.2

Multi-atlas Image Classification

Multi-atlas segmentation is a powerful technique for anatomy segmentation. The technique applies image registration to propagate anatomy labels from training images to a target image and applies label fusion to resolve conflicting labels produced by warping multiple training images [9]. Recently, the same methodology is successfully applied for disease classification [4]. Our work builds on [4] and extends it for general multi-atlas image classification. Let {A1F , ..., Am F } be m images with image class labels, warped to a target image I by deformable registration. The posterior label probability can be estimated through image similarity-based locally weighted voting as follows: p(l|I(N (x))) =

m 

wxi p(l|AiF (N (x)))

(1)

i=1

p(l|AiF (N (x))) is the probability that the observed image patch AiF (N (x)) is from class l. We apply p(l|AiF (N (x))) = 1 if AiF is from class l and 0 otherwise.

170

H. Wang et al.

m i {wxi }m i=1 are voting weights with i=1 wx = 1. In our experiments, we applied joint label fusion [10] to estimate the voting weights. The probability computed from (1) evaluates the classification problem from local patches. A final classification decision must be derived by aggregating the local evaluation results. The simplest aggregation scheme is global aggregation:  p(l|I(N (x))) (2) p(l|I) ∝ x∈I

Since not every anatomical region is equally important for distinguishing different classes, with class-specific ROIs defined, the ROI-based aggregation may produce more reliable results: p(l|I) ∝ 



1

ROIl (x)=1

1

p(l|I(N (x)))

(3)

ROIl (x)=1

ROIl is the binary ROI mask for l. Note that for binary classification problems, a common ROI can be shared by two classes. For classification problems with more than two classes, the optimal ROIs for discriminating different classes from the remaining classes may be different. With aggregated scores, classification is achieved by choosing the label with maximal scores, i.e. argmaxl p(l|I). 2.3

Multi-atlas ROI Detection

For a training image I from class l, p(l|I, x) defines a voxel-wise discriminativeness score for I. If p(l|I, x) is close to 1, then the image feature I(N (x)) is a discriminative signature for l because the feature must appear more often in training images from class l than in training images from other classes. On the other hand, if p(l|I, x) is small, then the feature is common in other classes. To estimate the ROI for each training image, we apply multi-atlas image classification in a leave-one-subject-out cross-validation using all training images. Using this approach, for a training image I, the initial estimation p0 (l|I, x) is obtained from (1) by applying images from the remaining training subjects from all classes as atlases. Since the above estimation are produced for each training image independently. The ROI estimation produced using different images from one class may be inconsistent with each other due to noise effects. The inconsistency indicates that the discriminative regions may be poorly generalized to new images from the same class. To ensure that the detected ROIs can be well generalized, we apply a coordinate descent technique to smooth the label posterior estimations obtained among images from the same class. For class q, let Iqi ∈ Iq . We apply an iterative process to smooth the label posterior estimations among all training images from class q. At each iteration, the label posteriors produced for each image is updated one image at a time, based on the posterior estimations produced for other training images from the same class at the previous iteration. At iteration t, the updated posterior estimation for Iqi is obtained by:

A Multi-atlas Approach to Region of Interest Detection

pt (l|Iqi (N (x))) =

nq 

wxj pt−1 (l|Iqj→i , x)

171

(4)

j=1,j=i

Iqj→i is the image warped from Iqj to Iqi through deformable registration. Again, the voting weights are computed using joint label fusion. The iterative process stops when the differences produced by consecutive iterations are smaller than a preselected threshold or the maximal iteration has been reached. 2.4

ROI-Based Image Classification

In this section, we show how to apply the discriminativeness maps produced for training images in ROI-based image classification using two different classification methods: multi-atlas classification and general learning-based classification. Multi-atlas classification. Since in multi-atlas classification, registration is already computed between each training image and a testing image, based on which the voxel-wise discriminativeness scores are propagated to the testing image from each training image. Each warped discriminativeness map provides a spatial prior on the discriminative anatomical regions for the corresponding class. The consensus discriminativeness map for each class is derived by averaging all propagated maps from the respective class. A threshold is applied to produce a ROI segmentation that only includes regions with the highest discriminative scores from each consensus discriminativeness map. To avoid bias, ROI segmentations produced for different classes all have the same size. Then, the ROI-based multi-atlas classification, i.e. (3), is applied to classify the testing image. Learning-based classification. For efficient ROI propagation from training images to testing images, we employ a template-based approach. First, one template is created for each class by the unbiased template building technique [7] using all training images from the respective class. For each class, the discriminativeness map of each training image from the class is warped to the class template through deformable registration. The class specific ROI segmentation is created for each class by thresholding the averaged discriminative score map in the class template space. Given a testing image, the ROI segmentations are propagated to the testing image from each class-specific template through deformable registration. After propagating class-specific ROIs to an image, one image patch is extracted for each class from the image, where the patch is a minimal rectangle containing the respective ROI segmentation. Although the ROI segmentations for different classes all have the same size, depending on how tightly the segmentations are spatially distributed the image patches may have substantial variation in sizes for different classes (see Fig. 1 for examples) For patch feature extraction, we tested four type of features: histogram of gradients (HoG), Local binary patterns (LBP), Haar features and features generated from the fully connected layers of a convolutional neural networks (CNN)

172

H. Wang et al.

Fig. 1. Semantic categories of example axial cardiac CT slices. As one moves in superior-inferior direction, these are (from upper left to lower right in the image): l = 1: Thoracic inlet/supraclavicular region, l = 2: Lung apex/sternum, l = 3: Origin of great vessels/aortic arch, l = 4: Aortic arch/prevascular space, l = 5: Ascending aorta/descending aorta/Aortopulmonary window, l = 6: Pulmonary trunk/origin of right left pulmonary arteries, l = 7: Aortic valve/aortic root origin ascending aorta, l = 8: axial four chamber view 1, l = 9: axial two chamber view. Next to each image is the estimated distinctiveness map by our method. The anatomical regions that are essential for defining each class are properly highlighted. ROI segmentations with size of 3% image size derived from the discriminative maps are shown in red contours on raw images. Turquoise rectangles show the corresponding ROI image patches.

pre-trained on ImageNet [2]. We tested classification performance produced by using each feature type separately. The final feature representation for one image is obtained by concatenating features extracted from class specific ROI patches for all classes. For classification, a linear support vector machine classifier for multi-class classification is applied.

3 3.1

Experiments Data Description

75 axially acquired chest CT scans were used in our study. Representative 2D axial slices are selected from the 3D CT dataset and are categorized into nine semantic classes by two experienced radiologists to capture the most significant cardiac anatomy for disease detection (as shown in Fig. 1). Since the CT scans were acquired for characterising different cardiac diseases, the body part regions covered by different scans may vary. Hence, not all nine body part classes are visible in all CT scans. When a body part class is visible in one CT scan, a representative slice is chosen by a clinician for that class. A total of 519 labeled 2D images were generated. The average spatial distance between adjacent semantic classes is 22 mm.

A Multi-atlas Approach to Region of Interest Detection

3.2

173

Experiment Setup

We conducted 5-fold cross-validation at the subject level. Recall that the class specific ROI segmentation is produced by thresholding the consensus discriminativeness priors propagated from training images. The size of ROI segmentation is a free parameter in our experiment. To choose an optimal ROI segmentation size, we applied a parameter search for each cross-validation experiment by using the training subjects in a leave-one-subject-out test with multi-atlas classification. The parameter searching ranges from 1% of the image size to 5% of the image size, with a 1 % of the image size step. For each cross validation experiment, the parameter produced the best classification performance on the training images are applied for generating ROI segmentation for testing images for both multi-atlas and learning-based classification. Evaluation criterion. Following [11], we define margin 0/1 accuracy. In margin 0 accuracy, a predicted label l is correct if and only if it equals the ground truth label l. In margin 1 accuracy, a predicted label is correct if and only if the predicted label is located within one spatial neighbor of the ground truth. 3.3

Implementation Details

Image registration was computed using Advanced Normalization Tools (ANTs) [1] with Syn deformation model and mutual information similarity metric. Joint label fusion was applied with default parameters, except that the local patch searching was not applied. With the above setting, each registration task can be computed within a second and each label fusion task can be computed within 30 seconds. In our experiments, we fixed the maximal iteration for coordinate descent based label posterior smoothing to be five. 3.4

Results

Figure 2 shows estimated label posteriors produced for one image (l = 6) in Fig. 1. The noise effect is clearly visible in the initial label posterior estimations and is greatly reduced after smoothing among images from the same class. The semantic labels of this image are pulmonary trunk/origin of right and left pulmonary arteries. Label posterior map for l = 6, i.e. the estimated discriminativeness

Fig. 2. Posteriors produced for image l = 6 in Fig. 1. First/second rows are initial/final estimations, respectively. p(l = 6|I) is the discriminativeness map for the image.

174

H. Wang et al.

map, shows the highest intensity. It is also noteworthy that the area of the anatomy corresponding to the semantic labels for this class have the highest values of probability within the discriminativeness map. Figure 1 shows discriminativeness maps, ROI segmentations/patches produced for the example images. The produced discriminativeness maps accurately reflect the most discriminative anatomical regions for each class. For instance, the vessel region is highlighted for class 3, the origin of great vessels. The aortic and pulmonary vessels are highlighted for classes 4, 5, and 6. The highlighted region produced for class 7 (aortic root) is around the aortic region. The cardiac regions are highlighted in the two/four chamber view classes. Note that although the ROI regions for different classes may overlap each other, the signature anatomical features for distinguishing different classes are different. Classification Accuracy. Multi-atlas classification without using class specific ROI segmentation produced 84.8%/97.2% margin 0/1 accuracy, respectively. Using class specific ROIs improved the accuracy to 92.1%/99.2%, respectively. Table 1 summarizes learning-based classification results. When image features are extracted from global images, the best margin 0 and margin 1 accuracy produced by using a single feature type are 64.7% and 90.8%, respectively. The results are improved to 81.9% and 96.3%, respectively, by using class-specific ROI patches. These results clearly demonstrate that class specific ROIs derived by our method accurately located discriminative anatomical regions for the classification task. Note that since the applied CNN is pre-trained on natural images, the CNN features performed competitively but worse than HoG features. Table 1. Margin 0/1 accuracy by learning-based and multi-atlas classification. Feature type HoG

LBP

Haar

CNN

Multi-atlas classification

Global

64.7/90.4 43.9/77.1 61.3/89.4 60.5/90.8 84.8/97.2

ROI-Based

81.9/96.3 58.3/86.2 74.8/90.3 74.0/94.4 92.1/99.2

Our ROI-based multi-atlas classification results also compare favorable to the state-of-the-art learning-based technique. [8] applied a hybrid learning approach that integrates multiple image feature types extracted from global images for semantic classification of the CT slices and reported results on the same dataset as ours. The evaluation criterion used in [8] is a variant of ours. The margin 0 accuracy used in [8] is similar to margin 1 accuracy used in our work, respectively. Hence, using our criterion [8] produced a ∼91.4% margin 1 accuracy. Our ROIbased classification results produced by multi-atlas classification and learningbased classification using single feature type both are substantially better. [11] developed a deep learning approach to solve a similar body part recognition problem using body CT, where 11 categories were created to cover the whole body, including the head, trunk, and extremities. Using over 2000 training

A Multi-atlas Approach to Region of Interest Detection

175

images, which is about five times of training images used in our experiments, [11] produced 89.8% margin 0 accuracy and 99.1% margin 1 accuracy. Most errors produced by ROI-based multi-atlas classification is within 1 spatial neighbor of ground truth. Hence, the average localization error is bounded by the average distance between adjacent semantic classes, i.e. 22 mm, which is in a similar accuracy range as produced by recent 3D anatomy localization techniques such as [5]. Note that direct comparisons of quantitative segmentation results across publications are difficult and not always fair due to the inconsistency in problem definition and the patient population. However, the comparisons indicate the highly competitive performance produced by our method.

4

Conclusions and Discussion

We extended multi-atlas classification to more general ROI-based classification, based on which we developed a technique to automatically detect discriminative ROIs for image classification. Given a set of training images with imagelevel class labels, our method produces voxel-wise estimations for each training image indicating spatial varying discriminativeness for categorizing the image. We showed a fast approach for deriving class-specific ROI patches for new testing images using the produced discriminativeness maps on training images. Using class specific ROIs substantially improved classification accuracy for both multiatlas classification and classical learning-based classification.

References 1. Avants, B., Epstein, C., Grossman, M., Gee, J.: Symmetric diffeomorphic image registration with cross-correlation: evaluating automated labeling of elderly and neurodegenerative brain. MedIA 12(1), 26–41 (2008) 2. Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A.: Return of the devil in the details: delving deep into convolutional nets. In: BMVC (2014) 3. Chen, E., Chung, P., Chen, C., Tsai, H., Chang, C.: An automatic diagnostic system for CT liver image classification. IEEE Trans. BE 45(6), 783–794 (1998) 4. Coup´e, P., Eskildsen, S.F., Manj´ on, J.V., Fonov, V.S., Collins, D.L.: Simultaneous segmentation and grading of anatomical structures for patient’s classification: application to alzheimer’s disease. NeuroImage 59(4), 3736–3747 (2012) 5. Criminisi, A., Robertson, D., Konukoglu, E., Shotton, J., Pathak, S., White, S., Siddiqui, K.: Regression forests for efficient anatomy detection and localization in computed tomography scans. Med. Image Anal. 17(8), 1293–1303 (2013) 6. Donner, R., Menze, B.H., Bischof, H., Langs, G.: Global localization of 3D anatomical structures by pre-filtered hough forests and discrete optimization. Med. Image Anal. 17(8), 1304–1314 (2013) 7. Joshi, S., Davis, B., Jomier, M., Gerig, G.: Unbiased diffeomorphism atlas construction for computational anatomy. NeuroImage 23, 151–160 (2004) 8. Moradi, M., Gur, Y., Wang, H., Prasanna, P., Syeda-Mahmood, T.: A hybrid learning approach for semantic labeling of cardiac ct slices and recognition of body position. In: ISBI, pp. 1418–1421. IEEE (2016)

176

H. Wang et al.

9. Rohlfing, T., Brandt, R., Menzel, R., Russakoff Jr., D.B., Maurer, C.R.: Quo Vadis, atlas-based segmentation? In: Suri, J.S., Wilson, D.L., Laxminarayan, S. (eds.) Handbook of Biomedical Image Analysis. Topics in Biomedical Engineering International Book Series, vol. 3, pp. 435–486. Springer, Boston (2005). doi:10.1007/ 0-306-48608-3 11 10. Wang, H., Suh, J.W., Das, S., Pluta, J., Craige, C., Yushkevich, P.: Multi-atlas segmentation with joint label fusion. IEEE Trans. PAMI 35(3), 611–623 (2013) 11. Yan, Z., Zhan, Y., Peng, Z., Liao, S., Shinagawa, Y., Metaxas, D.N., Zhou, X.S.: Bodypart recognition using multi-stage deep learning. In: IPMI, pp. 449–461 (2015)

Spectral Graph Convolutions for Population-Based Disease Prediction Sarah Parisot(B) , Sofia Ira Ktena, Enzo Ferrante, Matthew Lee, Ricardo Guerrerro Moreno, Ben Glocker, and Daniel Rueckert Biomedical Image Analysis Group, Imperial College London, London, UK [email protected]

Abstract. Exploiting the wealth of imaging and non-imaging information for disease prediction tasks requires models capable of representing, at the same time, individual features as well as data associations between subjects from potentially large populations. Graphs provide a natural framework for such tasks, yet previous graph-based approaches focus on pairwise similarities without modelling the subjects’ individual characteristics and features. On the other hand, relying solely on subjectspecific imaging feature vectors fails to model the interaction and similarity between subjects, which can reduce performance. In this paper, we introduce the novel concept of Graph Convolutional Networks (GCN) for brain analysis in populations, combining imaging and non-imaging data. We represent populations as a sparse graph where its vertices are associated with image-based feature vectors and the edges encode phenotypic information. This structure was used to train a GCN model on partially labelled graphs, aiming to infer the classes of unlabelled nodes from the node features and pairwise associations between subjects. We demonstrate the potential of the method on the challenging ADNI and ABIDE databases, as a proof of concept of the benefit from integrating contextual information in classification tasks. This has a clear impact on the quality of the predictions, leading to 69.5% accuracy for ABIDE (outperforming the current state of the art of 66.8%) and 77% for ADNI for prediction of MCI conversion, significantly outperforming standard linear classifiers where only individual features are considered.

1

Introduction

Recent years have seen an increasing volume of medical image data being collected and stored. Large scale collaborative initiatives are acquiring and sharing hundreds of terabytes of imaging, genetic and behavioural data. With this novel wealth of imaging and non-imaging data, there is a need for models capable of representing potentially large populations and exploiting all types of information. Graphs provide a natural way of representing populations and their similarities. In such setting, each subject acquisition is represented by a node Sarah Parisot: This work was supported by the European Union’s Seventh Framework Programme (FP/2007-2013)/ERC Grant Agreement no. 319456. c Springer International Publishing AG 2017  M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 177–185, 2017. DOI: 10.1007/978-3-319-66179-7 21

178

S. Parisot et al.

and pairwise similarities are modelled via weighted edges connecting the nodes. Such models provide powerful tools for population analysis and integration of non-imaging data such as manifold learning [2,15] or clustering algorithms [11]. Nonetheless, all the available information is encoded via pairwise similarities, without modelling the subjects’ individual characteristics and features. On the other hand, relying solely on imaging feature vectors, e.g. to train linear classifiers as in [1], fails to model the interaction and similarity between subjects. This can make generalisation more difficult and reduce performance, in particular when the data is acquired using different imaging protocols. Convolutional Neural Networks (CNNs) have found numerous applications on 2D and 3D images, as powerful models that exploit features (e.g. image intensities) and neighbourhood information (e.g. the regular pixel grid) to yield hierarchies of features and solve problems like image segmentation [7] and classification. The task of subject classification in populations (e.g. for diagnosis) can be compared to image segmentation where each pixel is to be classified. In this context, an analogy can be made between an image pixel and its intensity, and a subject and its corresponding feature vectors, while the pairwise population graph equates to the pixel grid, describing the neighbourhood structure for convolutions. However, the application of CNNs on irregular graphs is not straightforward. This requires the definition of local neighbourhood structures and node orderings for convolution and pooling operations [10], which can be challenging for irregular graph structures.Recently, graph CNNs were introduced [4], exploiting the novel concept of signal processing on graphs [13], which uses computational harmonic analysis to analyse signals defined on irregular graph structures. These properties allow convolutions in the graph spatial domain to be dealt as multiplications in the graph spectral domain, extending CNNs to irregular graphs in a principled way. Such graph CNN formulation was successfully used in [8] to perform classification of large citation datasets. Contributions. In this paper, we introduce the novel concept of Graph Convolutional Networks (GCN) for brain analysis in populations, combining imaging and non-imaging data. Our goal is to leverage the auxiliary information available with the imaging data to integrate similarities between subjects within a graph structure. We represent the population as a graph where each subject is associated with an imaging feature vector and corresponds to a graph vertex. The graph edge weights are derived from phenotypic data, and encode the pairwise similarity between subjects and the local neighbourhood system. This structure is used to train a GCN model on partially labelled graphs, aiming to infer the classes of unlabelled nodes from the node features and pairwise associations between subjects. We demonstrate the potential of the method on two databases, as a proof of concept of the advantages of integrating contextual information in classification tasks. First, we classify subjects from the Autism Brain Imaging Data Exchange (ABIDE) database as healthy or suffering from Autism Spectrum Disorders (ASD). The ABIDE dataset comprises highly heterogeneous functional MRI data acquired at multiple sites. We show how integrating acquisition information allows to outperform the current state of the art on the whole

Spectral Graph Convolutions for Population-Based Disease Prediction

179

dataset [1] with a global accuracy of 69.5%. Second, using the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database, we show how our model allows to seamlessly integrate longitudinal data and provides a significant increase in performance to 77% accuracy for the challenging task of predicting the conversion from Mild Cognitive Impairment (MCI) to Alzheimer’s Disease (AD). The code is publicly available at https://github.com/parisots/population-gcn.

2

Methods

We consider a database of N acquisitions comprising imaging (e.g. resting-state fMRI or structural MRI) and non-imaging phenotypic data (e.g. age, gender, acquisition site, etc.). Our objective is to assign to each acquisition, corresponding to a subject and time point, a label l ∈ L describing the corresponding subject’s disease state (e.g. control or diseased). To this end, we represent the population as a sparse graph G = {V, E, W } where W is the adjacency matrix describing the graph’s connectivity. Each acquisition Sv is represented by a vertex v ∈ V and is associated with a C-dimensional feature vector x(v) extracted from the imaging data. The edges E of the graph represent the similarity between the subjects and incorporate the phenotypic information.The graph labelling is done in a semi-supervised fashion, through the use of a GCN trained on a subset of labelled graph vertices. Intuitively, label information will be propagated over the graph under the assumption that nodes connected with high edge weights are more comparable. An overview of the method is available in Fig. 1.

Fig. 1. Overview of the pipeline used for classification of population graphs using Graph Convolutional Networks.

2.1

Databases and Preprocessing

We apply our model on two large and challenging databases for binary classification tasks. With the ABIDE database, we aim to separate healthy controls from ASD patients and exploit the acquisition information which can strongly affect the comparability of subjects. Our goal on the ADNI database is to predict whether an MCI patient will convert to AD. Our objective is to demonstrate the

180

S. Parisot et al.

importance of exploiting longitudinal information, which can be easily integrated into our graph structure, to increase performance. The ABIDE database [6] aggregates data from different acquisition sites and openly shares functional MRI and phenotypic data of 1112 subjects1 . We select the same set of 871 subjects used in [1], comprising 403 individuals with ASD and 468 healthy controls acquired at 20 different sites. To ensure a fair comparison with the state of the art [1], we use the same preprocessing pipeline [3], which involves skull striping, slice timing correction, motion correction, global mean intensity normalisation, nuisance signal regression, band-pass filtering (0.01–0.1 Hz) and registration of the functional MRI images to MNI152 standard anatomical space. The mean time series for a set of regions extracted from the Harvard Oxford (HO) atlas [5] were computed and normalised to zero mean and unit variance. The individual connectivity matrices S1 , ..., SN are estimated by computing the Fisher transformed Pearson’s correlation coefficient between the representative rs-fMRI timeseries of each ROI in the HO atlas. The ADNI database is the result of efforts from several academic and private co-investigators 2 . To date, ADNI in its three studies (ADNI-1, -GO and -2) has recruited over 1700 adults, aged between 55 and 90 years, from over 50 sites from the U.S. and Canada. In this work, a subset of 540 early/late MCI subjects that contained longitudinal T1 MR images and their respective anatomical segmentations was used. In total, 1675 samples were available, with 289 subjects (843 samples) diagnosed as AD at any time during follow-up and labelled as converters. Longitudinal information ranged from 6 to 96 months, depending on each subject. Acquisitions after conversion to AD were not included. As of 1st of July 2016 the ADNI repository contained 7128 longitudinal T1 MR images from 1723 subjects. ADNI-2 is an ongoing study and therefore data is still growing. Therefore, at the time of a large scale segmentation analysis (into 138 anatomical structures using MALP-EM [9]) only a subset of 1674 subjects (5074 images) was processed, from which the subset used here was selected. 2.2

Population Graph Construction

The proposed model requires two critical design choices: (1) the definition of the feature vector x(v) describing each sample, and (2) modelling the interactions between samples via the definition of the graph edges E. We keep the feature vectors simple so as to focus on evaluating the impact of integrating contextual information in the classification performance. For the ABIDE data-set, we follow the method adopted by [1] and define a subject’s feature vector as its vectorised functional connectivity matrix. Due to the high dimensionality of the connectivity matrix, a ridge classifier is employed to select the most discriminative features from the training set. For the ADNI dataset, we simply use the volumes of all 138 segmented brain structures. 1 2

http://preprocessed-connectomes-project.org/abide/. http://adni.loni.usc.edu.

Spectral Graph Convolutions for Population-Based Disease Prediction

181

The definition of the graph’s edges is critical in order to capture the underlying structure of the data and explain the similarities between the feature vectors. We construct our sparse graph aiming to incorporate phenotypic information in our model, providing additional information on how similar two samples’ feature vectors and labels are expected to be. Considering a set of H non-imaging measures M = {Mh } (e.g. subject’s gender and age), the population graph’s adjacency matrix W is defined as follows: W (v, w) = Sim(Sv , Sw )

H 

ρ(Mh (v), Mh (w)),

(1)

h=1

where, Sim(Sv , Sw ) is a measure of similarity between subjects, increasing the weights between the most similar graph nodes; ρ is a measure of distance between phenotypic measures. Considering categorical data such as gender or acquisition site, we define ρ as the Kronecker delta function δ. For quantitative measures such as the subject’s age, we define  ρ as a unit-step function with respect to a 1 if |Mh (v) − Mh (w)| < θ threshold θ: ρ(Mh (v), Mh (w)) = 0 otherwise The underlying idea behind this formulation is that non-imaging complementary data can provide key information explaining correlations between subjects’ feature vectors. The objective is to leverage this information, so as to define an accurate neighbourhood system that optimises the performance of the subsequent graph convolutions. For the ABIDE population graph, we use H = 2 non-imaging measures, namely subject’s gender and acquisition site. We define Sim(Sv , Sw ) as the correlation distance between the subjects’ rs-fMRI connectivity networks after feature selection, as a separation between ASD and controls can be observed within certain sites. The main idea behind this graph structure is to leverage the site information, as we expect subjects to be more comparable within the same site due to the different acquisition protocols. The ADNI graph is built using the subject’s gender and age information. These values are chosen because our feature vector comprises brain volumes, which can strongly be affected by age and gender. The most important aspect of this graph is the Sim(Sv , Sw ) function, designed to leverage the fact that longitudinal acquisitions from the same subject are present in the database. While linear classifiers treat each entry independently, here we define Sim(Sv , Sw ) = λ with λ > 1 if two samples correspond to the same subject, and Sim(Sv , Sw ) = 1 otherwise, indicating the strong similarity between acquisitions of the same subject. 2.3

Graph Labelling Using Graph Convolutional Neural Networks

Discretised convolutions, those commonly used in computer vision, cannot be easily generalised to the graph setting, since these operators are only defined for regular grids, e.g. 2D or 3D images. Therefore, the definition of localised graph filters is critical for the generalisation of CNNs to irregular graphs. This can be achieved by formulating CNNs in terms of spectral graph theory, building on tools provided by graph signal processing (GSP) [13].

182

S. Parisot et al.

The concept of spectral graph convolutions exploits the fact that convolutions are multiplications in the Fourier domain. The graph Fourier transform is defined by analogy to the Euclidean domain from the eigenfunctions of the Laplace operator. The normalised graph Laplacian of a weighted graph G = {V, E, W } is defined as L = IN −D−1/2 W D−1/2 where IN and D are respectively the identity and diagonal degree matrices. Its eigendecomposition, L = U ΛU T , gives a set of orthonormal eigenvectors U ∈ RN ×N with associated real, non-negative eigenvalues Λ ∈ RN ×N . The eigenvectors associated with low frequencies/eigenvalues vary slowly across the graph, meaning that vertices connected by an edge of large weight have similar values in the corresponding locations of these eigenvectors. The graph Fourier Transform (GFT) of a spatial signal x is defined on the . . ˆ. ˆ = U T x ∈ RN , while the inverse transform is given by x = U x graph G as x Using the above formulations, spectral convolutions of the signal x with a filter gθ = diag(θ) are defined as gθ ∗ x = gθ (L)x = gθ (U ΛU T )x = U gθ (Λ)U T x, where θ ∈ RN is a vector of Fourier coefficients. Following the work of Defferrard et al. [4], we restrict the class of considered filters to polynomial filters gθ (Λ) = K k k=0 θk Λ . This approach has two main advantages: 1) it yields filters that are strictly localised in space (a K-order polynomial filter is strictly K-localised) and 2) it significantly reduces the computational complexity of the convolution operator. Indeed, such filters can be well approximated by a truncated expansion in terms of Chebyshev polynomials which can be computed recursively. Similarly to what is proposed in [8], we keep the structure of our GCN relatively simple. It consists of a series of convolutional layers, each followed by Rectified Linear Unit (ReLU) activation functions to increase non-linearity, and a convolutional output layer. The output layer is followed by a softmax activation function [8], while cross-entropy is used to calculate the training loss over all labelled examples. Unlabelled nodes are then assigned the labels maximising the softmax output.

3

Results

We evaluate our method on both the ADNI and ABIDE databases using a 10-fold stratified cross validation strategy. The use of 10-folds facilitates the comparison with the ABIDE state of the art [1] where a similar strategy is adopted. To provide a fair evaluation for ADNI, we ensure that the longitudinal acquisitions of the same subject are in the same fold (i.e. either the testing or training fold). We train a fully convolutional GCN with L hidden layers approximating the convolutions with K = 3 order Chebyshev polynomials. GCN parameters were optimised for each database with a grid search on the full database. For ABIDE, we use: L = 1, dropout rate: 0.3, l2 regularisation: 5.10−4 , learning rate: 0.005, number of features C = 2000. The parameters for ADNI are: L = 5, dropout rate: 0.02, l2 regularisation: 1.10−5 , learning rate: 0.01, graph construction variables λ = 10 and θ = 2. The ABIDE network is trained for 150 epochs. Due to the larger network size, we train the ADNI network longer, for 200 epochs. We compare our results to linear classification using a ridge classifier (using the scikit-learn library implementation [12]) which showed the best performance

Spectral Graph Convolutions for Population-Based Disease Prediction

(a) ABIDE accuracy

(b) ABIDE AUC

(c) ADNI accuracy

183

(d) ADNI AUC

Fig. 2. Comparative boxplots of the classification accuracy and area under curve (AUC) over all cross validation folds for the (a, b) ABIDE and (c, d) ADNI databases (MCI conversion task). The red dots correspond to the mean value.

amongst linear classifiers. We investigate the importance of the population graph structure by using the same GCN framework with a random graph support of same density. Comparative boxplots across all folds between the three approaches are shown in Fig. 2 for both databases. GCN results (both with population and random graphs) are computed for ten different initialisation seeds and averaged. For both databases, we observe a significant (p < 0.05) increase both in terms of accuracy and area under curve using our proposed method, with respect to the competing methods. The random support yields equivalent or worse results to the linear classifier. For ABIDE, we obtain an average accuracy of 69.5%, outperforming the recent state of the art (66.8%) [1]. Results obtained for the ADNI database show a large increase in performance with respect to the competing methods, with an average accuracy of 77% on par with state of the art results [14], corresponding to a 10% increase over a standard linear classifier.

4

Discussion

In this paper, we introduced the novel concept of graph convolutions for population-based brain analysis. We proposed a strategy to construct a population graph combining image based patient-specific information with non-imaging based pairwise interactions, and use this structure to train a GCN for semisupervised classification of populations. As a proof of concept, the method was tested on the challenging ABIDE and ADNI databases, respectively for ASD classification from a heterogeneous database and predicting MCI conversion from longitudinal information. Our experiments confirmed our initial hypothesis about the importance of contextual pairwise information for the classification process. In the proposed semi-supervised learning setting, conditioning the GCN on the adjacency matrix allows to learn representations even for the unlabelled nodes, thanks to the supervised loss gradient information that is distributed across the network. This has a clear impact on the quality of the predictions, leading to about 4.1% improvement for ABIDE and 10% for ADNI when comparing to a standard linear classifier (where only individual features are considered).

184

S. Parisot et al.

Several extensions could be considered for this work. Devising an effective strategy to construct the population graph is essential and far from obvious. Our graph encompasses several types of information in the same edge. An interesting extension would be to use attributed graphs, where the edge between two nodes corresponds to a vector rather than a scalar. This would allow to exploit complementary information and weight the influence of some measures differently. Integrating time information with respect to the longitudinal data could also be considered. Our feature vectors are currently quite simple, as our main objective was to show the influence of the contextual information in the graph. We plan to evaluate our method using richer feature vectors, potentially via the use of autoencoders from MRI images and rs-fMRI connectivity networks.

References 1. Abraham, A., Milham, M., Di Martino, A., Craddock, R.C., Samaras, D., Thirion, B., Varoquaux, G.: Deriving reproducible biomarkers from multi-site resting-state data: an autism-based example. NeuroImage 147, 736–745 (2016) 2. Brosch, T., Tam, R.: Manifold learning of brain MRIs by deep learning. In: Mori, K., Sakuma, I., Sato, Y., Barillot, C., Navab, N. (eds.) MICCAI 2013. LNCS, vol. 8150, pp. 633–640. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40763-5 78 3. Craddock, C., Sikka, S., Cheung, B., Khanuja, R., Ghosh, S., et al.: Towards automated analysis of connectomes: The configurable pipeline for the analysis of connectomes (C-PAC). Front Neuroinform. 42 (2013) 4. Defferrard, M., Bresson, X., Vandergheynst, P.: Convolutional neural networks on graphs with fast localized spectral filtering. In: NIPS, pp. 3837–3845 (2016) 5. Desikan, R.S., S´egonne, F., Fischl, B., Quinn, B.T., Dickerson, B.C., et al.: An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral based regions of interest. NeuroImage 31(3), 968–980 (2006) 6. Di Martino, A., Yan, C.G., Li, Q., Denio, E., Castellanos, F.X., et al.: The autism brain imaging data exchange: towards a large-scale evaluation of the intrinsic brain architecture in autism. Mol. Psychiatry 19(6), 659–667 (2014) 7. Havaei, M., Davy, A., et al.: Brain tumor segmentation with deep neural networks. Med. Image Anal. 35, 18–31 (2017) 8. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016) 9. Ledig, C., Heckemann, R.A., Hammers, A., et al.: Robust whole-brain segmentation: application to traumatic brain injury. Med. Image Anal. 21(1), 40–58 (2015) 10. Niepert, M., Ahmed, M., Kutzkov, K.: Learning Convolutional Neural Networks for Graphs. arXiv preprint arXiv:1605.05273 (2016) 11. Parisot, S., Darlix, A., Baumann, C., Zouaoui, S., et al.: A probabilistic atlas of diffuse WHO Grade II Glioma locations in the brain. PLOS ONE 11(1), e0144200 (2016) 12. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., et al.: Scikitlearn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

Spectral Graph Convolutions for Population-Based Disease Prediction

185

13. Shuman, D.I., Narang, S.K., et al.: The emerging field of signal processing on graphs: extending high-dimensional data analysis to networks and other irregular domains. IEEE Signal Process. Mag. 30(3), 83–98 (2013) 14. Tong, T., Gao, Q., Guerrero, R., Ledig, C., Chen, L., Rueckert, D.: A Novel grading biomarker for the prediction of conversion from mild cognitive impairment to Alzheimer’s Disease. IEEE Trans. Biomed. Eng. 64(1), 155–165 (2017) 15. Wolz, R., Aljabar, P., Hajnal, J.V., L¨ otj¨ onen, J., Rueckert, D.: Nonlinear dimensionality reduction combining MR imaging with non-imaging information. Med. Image Anal. 16(4), 819–830 (2012)

Predicting Future Disease Activity and Treatment Responders for Multiple Sclerosis Patients Using a Bag-of-Lesions Brain Representation Andrew Doyle1(B) , Doina Precup2 , Douglas L. Arnold3 , and Tal Arbel1 1

Centre for Intelligent Machines, McGill University, Montr´eal, Canada [email protected] 2 School of Computer Science, McGill University, Montr´eal, Canada 3 NeuroRx Research, Montr´eal, Canada

Abstract. The growth of lesions and the development of new lesions in MRI are markers of new disease activity in Multiple Sclerosis (MS) patients. Successfully predicting future lesion activity could lead to a better understanding of disease worsening, as well as prediction of treatment efficacy. We introduce the first, fully automatic, probabilistic framework for the prediction of future lesion activity in relapsing-remitting MS patients, based only on baseline multi-modal MRI, and use it to successfully identify responders to two different treatments. We develop a new Bag-of-Lesions (BoL) representation for patient images based on a variety of features extracted from lesions. A probabilistic codebook of lesion types is created by clustering features using Gaussian mixture models. Patients are represented as a probabilistic histogram of lesiontypes. A Random Forest classifier is trained to automatically predict future MS activity up to two years ahead based on the patient’s baseline BoL representation. The framework is trained and tested on a large, proprietary, multi-centre, multi-modal clinical trial dataset consisting of 1048 patients. Testing based on 50-fold cross validation shows that our framework compares favourably to several other classifiers. Automated identification of responders in two different treated groups of patients leads to sensitivity of 82% and 84% and specificity of 92% and 94% respectively, showing that this is a very promising approach towards personalized treatment for MS patients.

1

Introduction

Multiple Sclerosis (MS) is an inflammatory, demyelinating disease of the central nervous system which commonly affects young adults, with no currently known cure [1]. Magnetic Resonance Imaging (MRI) has been used to diagnose and monitor disease activity and progression, as one of the hallmarks of the disease includes the presence of lesions which are visible in MRI. The number of new or enlarged T2 lesions is a marker of MS activity and the volume of lesions is often used to quantify accumulated “disease burden” [2]. For relapsing remitting MS c Springer International Publishing AG 2017  M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 186–194, 2017. DOI: 10.1007/978-3-319-66179-7 22

Predicting Future Disease Activity and Treatment Responders

187

(RRMS), these measurements have become essential in the evaluation of new treatments through clinical trials; therapy success is often measured through reduction in the number of new lesions. Hence, predicting future lesion activity in MRI could lead to a better understanding of disease worsening, and also to evaluating treatment efficacy in clinical trials. However, automatic prediction is challenging as the MRI of patients with MS presents wide variability across the population, with varying number, sizes and shapes of lesions throughout the brain. The effects of these variable lesion characteristics have a totally unknown effect on patients’ outcomes, making this context a perfect candidate for automatic data mining and machine learning techniques. The longitudinal course of RRMS is also highly variable across the population, resulting in new lesions that can appear and disappear, grow or remain stable over time, for reasons that are not well understood. As a result, a unified static and dynamic model across the population is difficult to develop. While the detection of Gadolinium-enhancing lesions has been shown to be a good indicator that a patient’s disease is currently active, administering Gadolinium has important side effects for patients. Several automatic prediction methods predict the conversion of patients with preliminary symptoms to MS rather than predict dynamics of patients known to have the disease, using logistic regression on a number of clinical indicators [3] and more recently, deep learning methods [4,5]. Other efforts have predicted long-term clinical effects [6]. In this work, we develop a fully automatic, probabilistic machine learning framework to model the variability of lesions in the multi-modal MRI of patients with RRMS with the objectives of: (1) automatic identification of lesion types across the population, (2) probabilistic prediction of new lesion activity in patients two years in the future based only on baseline multi-modal MRI and (3) automatic identification of responders to treatment using lesion activity prediction learned for untreated and treated groups. Leveraging the success of the Bag-of-Words model in performing unsupervised categorization in the field of computer vision [7], we develop a novel unsupervised Bag-of-Lesions (BoL) model for brain image representation in the context of MS. The method first clusters previously labelled lesions based on a variety of image-based features (e.g. textures, prior tissue atlas). This leads to a codebook for lesion types. Lesions are represented probabilistically over codewords, and patients are represented as a “Bag-of-Lesions”, based on probabilistic lesion codeword histograms. This permits the automatic unsupervised grouping of images through histogram clustering. Experiments on a proprietary dataset of 1048 patients, acquired during a large, multi-center, multi-scanner clinical trial, show that the BoL representation at baseline, combined with a random forest classifier, can be used to accurately predict future patient lesion activity two years in the future, where activity is defined as the presence of new or enlarged T2 lesions. In 50-fold cross-validation, our results compare favourably to Support Vector Machines (SVM) and Nearest Neighbour classifiers, as well as a simpler Naive Bayesian classifier based on counts of lesions of different sizes. We also use this framework to automatically

188

A. Doyle et al.

identify responders in two different treated groups of patients, with sensitivity of 82% and 84% and specificity of 92% and 94% respectively.

2

Proposed Method

Each RRMS patient presents at baseline with a set of multi-channel MRI, I, and a set of L coarsely labelled lesions obtained automatically through an algorithm (e.g. [8]) or manually. Our first objective is to model the variability of lesions, and develop a robust, data-driven categorization of lesions into a finite set of types. In order to obtain such a representation, lesions are first divided into coarse size bins. Each lesion is then described by a set of vector-valued, intensity-based features fx . In this work, we use four different kinds of features: RIFT [10] and Local Binary Pattern (LBP) [11] at varying window sizes to encode the texture of the lesion and surrounding tissues, a probabilistic healthy tissue class prior to encode tissue context (represented by a mean and variance of healthy tissue prior probabilities from an atlas over the voxels labelled as lesion), and intensity features (mean and variance of the intensity of the lesion voxels). Other features can be added as desired. Lesion features are binned according to size groups, and modelled using a Gaussian mixture model (GMM), whose components have full covariance matrix1 . The mixture is learned in standard fashion using Expectation Maximization (EM). Bayesian Information Criterion (BIC) is used to determine the number of mixture components, nx . We refer to the components of these GMMs, denoted fx,j , j = 1 . . . nx , as feature-types. For lesion Li , let fx (I, Li ), x = 1, . . . 4 denote the features extracted for this lesion, and let c(x, j, i) = P (fx,j |fx (I, Li )), j = 1 . . . nx . We construct the Carte4 sian product of the feature types (which has x=1 nx elements). We consider each of these elements a lesion type. For each element (j1 , . . . jN ), the product c(x, j1 , i) · c(x, jN , i) represents the codeword of lesion i corresponding to feature vector x. The use of this product encodes a conditional independence assumption: feature types are considered conditionally independent given the lesion. We then collect all codewords for all lesions. Finally, a patient’s representation is a probabilistic histogram of the lesion-types present in their brain scans, referred to as a Bag of Lesions representation (by analogy to the Bag of Words representation used in text and image processing). An overview of our framework can be found in Fig. 1. As patients are represented as a distribution of lesion-types, groups of similar patients can also be found by automatic clustering using EM. The optimal number of groups can be selected automatically using the Bayesian Information Criterion (BIC). We compute the likelihood that a new test patient is part of a group based on their BoL by computing the Mahalanobis distance to each group. In this way we automatically learn patterns of lesion presentation across the population. 1

In particular, given our features, we have 4 GMMs: for RIFT, LBP, prior and intensity features.

Predicting Future Disease Activity and Treatment Responders

189

Fig. 1. (a) Learning the Bag of Lesions from Training Data. Lesions are first separated by size. Features (e.g. RIFT) are extracted from each lesion in the database. Each feature is modelled as a separate GMM, with each component referred to as a featuretype. Each lesion codeword is the combination of feature-types. (b) Representing a new patient. The lesion codeword is determined for all patient lesions. The patient is represented by a probabilistic histogram of lesion-types.

2.1

Activity Prediction

The appearance of new lesions or enlargement of existing lesions can be used as a biomarker for focal inflammatory activity, which is associated with relapses in RRMS. We seek a probabilistic prediction of future activity, based on the baseline BoL representation, P (A = 1|BoL). We train a random forest classifier to predict MS activity based on the BoL representation P (A|BoL) with different sets of lesion-types. The lesion-types are progressively eliminated using a backward elimination method, removing 20% of the least informative remaining types (as determined by the Gini impurity across all nodes of all trees) at each iteration and evaluating prediction accuracy on a retrained random forest [9]. The lesion-types that result in the highest prediction accuracy are preserved in the final model. The final prediction is computed by averaging the activity probability predicted by each tree. Because the dataset is imbalanced, with many fewer patients being inactive, the training error weights the two types of

190

A. Doyle et al.

misclassification differently, accounting for the proportion of examples in each class. 2.2

Identifying Responders to Treatment

Ground truth information regarding which patients in a treatment group have definitively responded to treatment is rarely available. In this work, responders to treatment are defined as patients predicted, with high confidence, to have new lesions or lesion growth two years from baseline if not treated, but instead had no lesion activity. This can act as a proxy for ground truth, based on the assumption that treatment must have halted the activity of the disease. To achieve this goal, we fit activity prediction models for the untreated and treated populations separately. To identify whether a new patient is a responder to a drug, we compute the patient’s probability of future activity, Puntr (A = 1|BoL) using the ”untreated” model from the Bag-of-Lesion representation computed from the baseline MRI, and the probability Ptreat (A = 0|BoL) using the model computed from treated patients. A patient is considered a responder if these probabilities exceed thresholds α and β respectively, essentially stating that the two models disagree with high confidence.

3

Experiments and Results

In order to validate the framework for characterizing lesion types and patient groups, for predicting future lesion activity and for classifying responders to treatment, we conducted experiments using a large, proprietary dataset of real MS patient brain images from a multi-centre, multi-scanner clinical trial. The data contained 1048 RRMS patients, each with 4 MR image sequences available: T1, T2, PD, and FLAIR. Each volume was at a resolution of 1 mm × 1 mm × 3 mm. Pre-processing included brain extraction [12], bias field inhomogeneity correction using N3 [13], Nyul image intensity normalization, and registration of all images to MNI-space. Included with the clinical trial dataset were: (1) T2 lesion label masks for each patient at baseline, (2) New disease activity labels for each patient, defined as the presence of any new or enlarging T2 lesions 24 months from baseline. The T2 lesion masks provided were obtained through a semi-manual process whereby a trained expert reader corrected an in-house automated segmentation result. The new and enlarged T2 lesion masks provided were obtained through expert validation of an automatic longitudinal MS lesion segmentation framework [14]. Patients were treated in a double-blind study with either a placebo or one of two drugs, divided as follows: 259 Untreated (placebo), 280 Drug A, 259 Drug B. The trial did not achieve its primary endpoint (due to insufficient evidence of effectiveness across the entire cohort). However, there was a clear trend towards a treatment response for some patients in the trial, rendering the task of automatically finding responders at once challenging and compelling for this dataset. A total of 98,106 lesions were used to build a comprehensive lesion codebook. According to clinical protocol, lesions of less than three voxels were omitted.

Predicting Future Disease Activity and Treatment Responders

191

Fig. 2. Examples of lesions from three lesion types. Top: Lesions (red) over FLAIR images. Bottom: Zoomed in. (a) Lesions between ventricles. (b) Cortical lesions. (c) Large peri-ventricular lesions.

Lesions were subdivided into four coarse size groups: tiny (3–10 voxels), small (11–25 voxels), medium (26–100 voxels), and large (101+ voxels). For each lesion, RIFT features were extracted at three scales (3 mm, 6 mm, 9 mm) with eight bins for gradients in two dimensions. LBP features were obtained by binarizing intensity differences around central voxels at fixed radii (1 mm, 2 mm, 3 mm). As such, RIFT and LBP captured the textures of the lesions and their surrounding tissues (e.g. see Fig. 1), overcoming any minor under/over-segmentation in the lesion labelling. Probabilistic healthy tissue context was obtained through registration to MNI-space, leading to prior probabilities of white matter (WM), gray matter (GM), cerebral spinal fluid (CSF), partial volume (PV) (at the interface of GM and CSF). The mean and the variance of the probabilities at the lesion voxels are taken as the features. Intensity was encoded as the mean and variance of the intensity of each of the image modalities across each lesion. Examples of lesions drawn from several types are shown in Fig. 2. Patients clustered automatically based on their BoL representation were found to exhibit similar lesion distributions (See Fig. 3). 3.1

Disease Activity Prediction

Each of the MS patient multi-channel volumes in the clinical trial dataset were considered as baseline acquisitions from which BoL representations were inferred.

192

A. Doyle et al.

Fig. 3. Patients automatically grouped based on similar lesion histogram distributions. (Top) Patients with a few very small lesions mostly in the white matter. (Bottom) Patients with large lesions near lateral ventricles.

The emergence of new and enlarging lesions 2 years after baseline was additional information provided for all but 250 patients (as they did not complete the study). These markers were used as indicators of future disease activity. A random forest classifier was used for optimal lesion-type selection and for the prediction of P (A|BoL). 50-fold cross validation experiments were performed on the untreated (placebo) dataset. Figure 4 shows the maximum likelihood random forest results in comparison with several classifiers: (1) Nearest Neighbour (NN), where activity was assigned based on the closest training case, as defined by three different distance metrics (Euclidean, Mahalanobis, and χ2 ).2 the (2) Support Vector Machines (SVM)3 , using linear, RBF and χ2 kernels. (3) Naive Bayes classifier, based solely on the number of lesions in each size bin, in order to explore whether this was the dominating factor in our framework. Both NN and SVM were based on the BoL representations. The random forest classifier (α = 0.5) performed favourably against the other methods overall, with mean values of 70% sensitivity and 58% specificity (for A = 1). All methods based on the BoL representation outperformed the Naive Bayesian method. When considering only activity predictions with high probabilities (above α = 0.8), sensitivity increased to 94%. However, the specificity dropped substantially, partially because there were only 14 inactive cases at that threshold. Two treated groups of patients were available for training and testing a separate activity prediction model under the effects of treatment after baseline. The results for 50-fold cross validation using the random forest classifier on the treated cases, the sensitivities increased to close to 1 for both treatments at high probability thresholds (β = 0.8), with specificities at around 0.5. Interestingly, when patients in the treated groups were tested using the untreated model, this led to a decrease in specificity by 7% for both treatments (α = 0.5), due to an increase in false positive predictions. This indicates the effectiveness of the treated patient prediction model and, for some patients, the treatment seems to be effectively halting the formation of new or enlarged lesions.

2 3

The Mahalanobis distance normalizes distance based on the covariance matrix of the lesion types, and the χ2 distance measures the distance between histograms. Using the scikit-learn 0.18 package, which wraps the libsvm implementation.

Predicting Future Disease Activity and Treatment Responders

193

Fig. 4. Comparison of disease activity prediction results based on a 50-fold crossvalidation on the placebo dataset: 3 Nearest Neighbour (NN) methods, 3 Support Vector Machines (SVM), proposed Random Forest classifier (α = 0.5), and Naive Bayesian classifier trained only on the number of lesions of each size.

3.2

Responder Identification

We define “responders” (R = 1) as those patients in the treated group whose baseline scans lead to high probability (α = 0.8) in predicted activity two years later under the untreated model, where they have a known outcome of inactive (i.e. no new or enlarged T2 lesions). At this probability threshold, the sensitivity for detecting activity in the untreated patients is at 98%. Using this definition, there were 25 responders in the Drug A treatment arm and 24 responders in the Drug B treatment arm. Table 1 shows the results of responder classification for two different treatments in the clinical trial dataset, when the probability thresholds are set to high (α = β = 0.8). The results indicate that the treatment can be reliably predicted to work on a small subset of patients, even though the overall objectives of the clinical trial were not met. Table 1. Responder prediction results for treatments A & B, probability thresholds α = β = 0.8. Sensitivity Specificity

4

Drug A 82%

92%

Drug B 84%

94%

Conclusion

In this paper, we introduce a fully automatic, probabilistic framework for the prediction of future MS disease activity in patients based on a new Bag-of-Lesions representation of their scans at baseline. We develop a probabilistic codebook of distinct lesion types across the population, and show how those lesion types can be used to separate patients into groups that present similar lesion patterns.

194

A. Doyle et al.

Additional clinical validation is required to determine how this translates into discoveries of natural patterns of MS disease variability. The activity prediction is then used to automatically identify potential responders to two treatments in the context of a real, large, multi-centre, multi-scanner clinical trial for RRMS patients, showing sensitivities of 82% and 84% and specificities of 92% and 94% respectively. This suggests the possibility of a tool for personalized treatment for new MS patients, and for assessing treatment efficacy. Acknowledgements. This work was supported by the Canadian NSERC Discovery and CREATE grants. We would like to thank Drs. Narayanan and Maranzano for their clinical advice, and Mr. A. Zografos Caramanos for data preparation. All patient MRI are courtesy of NeuroRx Research.

References 1. Gold, R., et al.: Placebo-controlled phase 3 study of oral BG-12 for relapsing multiple sclerosis. New Engl. J. Med. 367(12), 1098–1107 2. Brown, J.W.L., Chard, D.T.: The role of MRI in the evaluation of secondary progressive multiple sclerosis. Expert Rev. Neurother. 16(2), 157–171 (2016) 3. Barkhof, F., et al.: Comparison of MRI criteria at first presentation to predict conversion to clinically definite multiple sclerosis. Brain 120(11), 2059–2069 (1997) 4. Brosch, T., Yoo, Y., Li, D.K.B., Traboulsee, A., Tam, R.: Modeling the variability in brain morphology and lesion distribution in multiple sclerosis by deep learning. In: Golland, P., Hata, N., Barillot, C., Hornegger, J., Howe, R. (eds.) MICCAI 2014. LNCS, vol. 8674, pp. 462–469. Springer, Cham (2014). doi:10.1007/ 978-3-319-10470-6 58 5. Yoo, Y., et al.: Deep learning of brain lesion patterns for predicting future disease activity in patients with early symptoms of multiple sclerosis. In: International Workshop Large-Scale Annotation of Biomedical Data, pp. 86–94 (2016) 6. Popescu, V., et al.: Brain atrophy and lesion load predict long term disability in multiple sclerosis. J. Neurol. Neurosurg. Psych. 84(10), 1082–1091 (2013) 7. Bosch, A., Zisserman, A., Mu˜ noz, X.: Scene classification via pLSA. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 517–530. Springer, Heidelberg (2006). doi:10.1007/11744085 40 8. Shiee, N., et al.: A topology-preserving approach to the segmentation of brain images with multiple sclerosis lesions. NeuroImage 49(2), 1524–1535 (2010) 9. D´ıaz-Uriarte, R., De Andres, S.A.: Gene selection and classification of microarray data using random forest. BMC Bioinform. 7(1), 3 (2006) 10. Lazebnik, S., et al.: A sparse texture representation using local affine regions. PAMI 27, 1265–1278 (2005) 11. Ahonen, T., et al.: Face description with local binary patterns: application to face recognition. PAMI 28(12), 2037–2041 (2006) 12. Smith, S.M.: Fast robust automated brain extraction. Hum. Brain Mapp. 17(3), 143–155 (2002) 13. Sled, J.G., et al.: A nonparametric method for automatic correction of intensity nonuniformity in MRI data. TMI 17(1), 87–97 (1998) 14. Elliott, C., et al.: Temporally consistent probabilistic detection of new multiple sclerosis lesions in brain MRI. TMI 32(8), 1490–1503 (2013)

Sparse Multi-kernel Based Multi-task Learning for Joint Prediction of Clinical Scores and Biomarker Identification in Alzheimer’s Disease Peng Cao1(B) , Xiaoli Liu1 , Jinzhu Yang1 , Dazhe Zhao1 , and Osmar Zaiane2 1

Key Laboratory of Medical Image Computing of Ministry of Education, College of Computer Science and Engineering, Northeastern University, Shenyang Shi, China [email protected] 2 University of Alberta, Edmonton, Canada

Abstract. Machine learning methods have been used to predict the clinical scores and identify the image biomarkers from individual MRI scans. Recently, the multi-task learning (MTL) with sparsity-inducing norm have been widely studied to investigate the prediction power of neuroimaging measures by incorporating inherent correlations among multiple clinical cognitive measures. However, most of the existing MTL algorithms are formulated linear sparse models, in which the response (e.g., cognitive score) is a linear function of predictors (e.g., neuroimaging measures). To exploit the nonlinear relationship between the neuroimaging measures and cognitive measures, we consider that tasks to be learned share a common subset of features in the kernel space as well as the kernel functions. Specifically, we propose a multi-kernel based multitask learning with a mixed sparsity-inducing norm to better capture the complex relationship between the cognitive scores and the neuroimaging measures. The formation can be efficiently solved by mirror-descent optimization. Experiments on the Alzheimers Disease Neuroimaging Initiative (ADNI) database showed that the proposed algorithm achieved better prediction performance than state-of-the-art linear based methods both on single MRI and multiple modalities.

1

Introduction

The Alzheimer’s disease (AD) status can be characterized by the progressive impairment of memory and other cognitive functions. Thus, it is an important topic to use neuroimaging measures to predict cognitive performance. Multivariate regression models have been studied in AD for revealing relationships between neuroimaging measures and cognitive scores to understand how structural changes in brain can influence cognitive status, and predict the cognitive performance with neuroimaging measures based on the estimated relationships. Many clinical/cognitive measures have been designed to evaluate the cognitive c Springer International Publishing AG 2017  M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 195–202, 2017. DOI: 10.1007/978-3-319-66179-7 23

196

P. Cao et al.

status of the patients and used as important criteria for clinical diagnosis of probable AD [5,13,15]. There exists a correlation among multiple cognitive tests, and many multi-task learning (MTL) seeks to improve the performance of a task by exploiting the intrinsic relationships among the related cognitive tasks. The assumption of the commonly used MTL methods is that all tasks share the same data representation with 2,1 -norm regularization, since a given imaging marker can affect multiple cognitive scores and only a subset of the imaging features (brain region) are relevant [13,15]. However, they assumed linear relationship between the MRI features and the cognitive outcomes and the 2,1 norm regularization only consider the shared representation from the features in the original space. Unfortunately, this assumption usually does not hold due to the inherently complex structure in the dataset [11]. Kernel methods have the ability to capture the nonlinear relationships by mapping data to higher dimensions where it exhibits linear patterns. However, the choice of the types and parameters of the kernels for a particular task is critical, which determines the mapping between the input space and the feature space. To address the above issues, we propose a sparse multi-kernel based multi-task Learning (SMKMTL) with a mixed sparsity-inducing norm to better capture the complex relationship between the cognitive scores and the neuroimaging measures. The multiple kernel learning (MKL) [4] not only learns a optimal combination of given base kernels, but also exploits the nonlinear relationship between MRI measures and cognitive performance. The assumption of SMKMTL is that the not only the kernel functions but also the features in the high dimensional space induced by the combination of only few kernels are shared for the multiple cognitive measure tasks. Specifically, SMKMTL explicitly incorporates the task correlation structure with 2,1 -norm regularization on the high dimensional features in the RKHS space, which builds the relationship between the MRI features and cognitive score prediction tasks in a nonlinear manner, and ensures that a small subset of features will be selected for the regression models of all the cognitive outcomes prediction tasks; and an q -norm on the kernel functions, which ensures various schemes of sparsely combining the base kernels by varying q. Moreover, MKL framework has advantage of fusing multiple modalities, we apply our SMKMTL on multi-modality data (MRI, PET and demographic information) in our study. We presented mirror descent-type algorithm to efficiently solve the proposed optimization problems, and conducted extensive experiments using data from the Alzheimers Disease Neuroimaging Initiative (ADNI) to demonstrate our methods with respect to the prediction performance and multi-modality fusion.

2

Sparse Multi Kernel Multi-task Learning, SMKMTL

Consider a multi-task learning (MTL) setting with m tasks. Let p be the number of covariates, shared across all the tasks, n be the number of samples. Let X ∈ Rn×p denote the matrix of covariates, Y ∈ Rn×m be the matrix of responses with each row corresponding to a sample, and Θ ∈ Rp×m denote the parameter matrix, with column θ.t ∈ Rp corresponding to task t, t = 1, . . . , m, and row θi. ∈ Rm corresponding to feature i, i = 1, . . . , p.

Modeling AD Cognitive Scores Using SMKMTL

197

min

(1)

Θ∈Rp×t

L(Y, X, Θ) + λR(Θ),

where L(·) denotes the loss function and R(·) is the regularizer. The commonly used MTL is MT-GL model with 2,1 -norm regularization, p which considers R(Θ) = Θ2,1 = l=1 θl. 2 and is suitable for simultaneously enforcing sparsity over features for all tasks. Moreover, Argyriou proposed a Multi-Task Feature Learning (MTFL) with 2,1 -norm [1], the formulation of which is: Y − UT XΘ2F + Θ2,1 , where U is an orthogonal matrix which is to be learnt. In these learning methods, each task is traditionally performed by formulating a linear regression problem, in which the cognitive score is a linear function of the neuroimaging measures. However, the assumption of these existing linear models usually does not hold due to the inherently complex patterns between brain images and the corresponding cognitive outcomes. Modeling cognitive scores as nonlinear functions of neuroimaging measures may provide enhanced flexibility and the potential to better capture the complex relationship between the two quantities. In this paper, we consider the case that the features are associated to a kernel and hence they are in general nonlinear functions of the features. With the advantage of MKL, we assume that xi can be mapped to k different Hilbert spaces, xi → φj (xi ), j = 1, . . . , k, implicitly with k nonlinear mapping functions, and the objective of MKL is to seek the optimal kernel combination. In order to capture the intrinsic relationships among multiple related tasks in the RKHS space, we proposed a multi-kernel based multi-task learning with mixed sparsity-inducing norm. With the ε-insensitive loss function, the formulation can be expressed as:

min

θ,b,ξ,U

s.t.

1 2



k j=1

 pˆj l=1

θ.jl 2

q  q2 +C

m nt t=1

i=1

⎧ k T T ⎪ ⎪ yti − θtj Uj φj (xti ) − bt ≤ ε + ξti ⎪ ⎪ j=1 ⎨ k T T ∗ θtj Uj φj (xti ) + bt − yti ≤ ε + ξti , ⎪ ⎪ j=1 ⎪ ⎪ ⎩ ∗ ξti , ξti ≥ 0, Uj ∈ Opˆj

∗ (ξti + ξti )

(2) ∀t, i

where θj is the weight matrix for the j-th kernel, θtjl (l = 1, . . . , pˆj ) is the entries of θtj , nt is the number of samples in the t-th task, pˆj is the dimensionality of the feature space induced by the j-th kernel, ε is the parameter in the ε-insensitive ∗ are slack variables, and C is the regularization parameter. loss, ξti and ξti In the formulation of Eq. (2), the use of 2,1 -norm for θj , which forces the weights corresponding to the i-th feature across multiple tasks to be grouped together and tends to select features based on the k tasks jointly in the kernel space. Moreover, an q norm (q ∈ [1, 2]) over kernels is used over kernels instead of 1 norm to obtain various schemes of sparsely combining the base kernels by varying q.

198

P. Cao et al.

Lemma 1. Let ai ≥ 0, i = 1 . . . d and 1 < r < ∞. Then,  ai

min

η∈Δd,r

where Δd,r =

ηi

i

=

d 

r r+1

r+1 r

ai

(3)

i=1



d r z ≡ [z1 . . . zd ]T | z ≤ 1, z ≥ 0, i = 1 . . . d . According to i i=1 i

the Lemma 1 introduced in [8], we introduces new variables λ = [λ1 . . . λk ]T , and γj = [γj1 . . . γj pˆj ]T , j = 1, . . . , k. Thus, the regularizer in (2) can be writ2 m k pˆj θtjl q ¯ = 2−q . Now we ten as: minλ∈Δk,q¯ minγj ∈Δpˆj ,1 t=1 j=1 l=1 γjk λj , where q θtjl = θ¯tjl , l = 1, . . . , pˆj , and construct the perform a change of variables: √ γjk λj

Lagrangian for our optimization problem in (2) as: m

min

t=1

λ,γj ,Uj

T



maxαt y (αt − αt ) −

1 ∗ T (αt − αt ) 2

λ ∈ Δk,q¯, γj ∈ Δpˆj ,1 Uj ∈ O

s.t.

p ˆj ,α

t



k

T

j=1

T



Φtj Uj Λj Uj Φtj



(αt − αt )

(4)

∈ Snt (C)

where Λj is a diagonal matrix with entries as λj γjl , l = 1, . . . , pˆj , λ ∈ Δk,¯q , γj ∈ Δpˆj ,1 , Φtj is the data matrix with columns as φj (xti ), i = 1, . . . , nt , αt , α∗t are vectors of Lagrange multipliers corresponding to the t-th taskin the SMKMTL nt ∗ ∗ ≤ C, i = 1, . . . , nt , i=1 (αti − αti )= formulation, Snt (C) ≡ {αt | 0 ≤ αti , αti T ¯ 0}. Denoting Uj Λj Uj by Qj and eliminating variables λ, γ,U leads to: min ¯ Q

m  t=1

max

α t ∈Snt (C)

yT (αt − α∗t ) − ¯ j  0, Q

s.t.

k  1 ∗ ¯ ΦT (αt − α∗t )T tj Qj Φtj (α t − α t ) j=1 2

k

(5)



j=1

¯ j )) ≤ 1 (Tr(Q

Then, we use the method described in [6] to kernelize the formulation. Let Φj ≡ [Φ1j . . . ΦT j ] and the compact SVD of Φj be Uj Σj VjT . Now, introduce ¯ j = Uj Qj UT . Here, Qj is a symmetric positive new variables Qj such that Q j ¯ j , we can semidefinite matrix of size same as rank of Φj . Eliminating variables Q re-write the above problem using Qj as: min Q

m  t=1

T

max

αt ∈Snt (C)

s.t.



y (αt − αt ) − Qj  0,

1 ∗ T (αt − αt ) 2

k j=1



k j=1

T

Mtj Qj Mtj





(αt − αt )

(6)

q ¯

(Tr(Qj )) ≤ 1

T T where Mtj = Σ−1 j Vj Φj Φtj . Given Qj s, the problem is equivalent to solving m SVM problems individually. The Qj s are learnt using training examples of all the tasks and are shared across the tasks, and this formulation with trace norm as constraint can be solved by a mirror-descent based algorithm proposed in [2,6].

Modeling AD Cognitive Scores Using SMKMTL

3 3.1

199

Experimental Results Data and Experimental Setting

In this work, only ADNI-1 subjects with no missing features or cognitive scores are included. This yields a total of n = 816 subjects, who are categorized into 3 baseline diagnostic groups: Cognitively Normal (CN, n1 = 228), Mild Cognitive Impairment (MCI, n2 = 399), and Alzheimer’s Disease (AD, n3 = 189). The dataset has been processed by a team from UCSF (University of California at San Francisco), who performed cortical reconstruction and volumetric segmentations with the FreeSurfer image analysis suite. There were p = 319 MRI features in total, including the cortical thickness average (TA), standard deviation of thickness (TS), surface area (SA), cortical volume (CV) and subcortical volume (SV) for a variety of ROIs. In order to sufficiently investigate the comparison, we further evaluate the performance on all the widely used cognitive assessments (e.g. ADAS, MMSE, RAVLT, FLU and TRAILS, totally m = 10 tasks) [11,12,14]. We use 10-fold cross valuation to evaluate our model and conduct the comparison. In each of twenty trials, a 5-fold nested cross validation procedure for all the comparable methods in our experiments is employed to tune the regularization parameters. Data was z-scored before applying regression methods. The candidate kernels are: six different kernel bandwidths (2−2 , 2−1 , . . . , 23 ), polynomial kernels of degree 1 to 3, and a linear kernel, which totally yields 10 kernels. The kernel matrices were pre-computed and normalized to have unit trace. To have a fair comparison, we validate the regularization parameters of all the methods in the same search space C (from 10−1 to 103 ) and q (1,1.2,1.4,..,2) in our method on a subset of the training set, and use the optimal parameters to train the final models. Moreover, a warm-start technique is used for successive SVM retrainings. In this section, we conduct empirical evaluation for the proposed methods by comparing with three single task learning methods: Lasso, ridge and simpleMKL, all of which are applied independently on each task. Moreover, we compare our method with two baseline multi-task learning methods: MTL with 2,1 -norm (MT-GL) and MTFL. We also compare our proposed method with several popular state-of-the-art related methods: Clustered Multi-Task Learning (CMTL) [16]: CMTL(minΘ:F T F =Ik L(X, Y, Θ)+λ1 (tr(ΘT Θ)−tr(F T ΘT ΘF ))+ λ2 tr(ΘT Θ), where F ∈ Rm×k is an orthogonal cluster indicator matrix) incorporates a regularization term to induce clustering between tasks and then share information only to tasks belonging to the same cluster. In the CMTL, the number of clusters is set to 5 since the 7 tasks belong to 5 sets of cognitive functions. Trace-Norm Regularized Multi-Task Learning (Trace) [7]: The assumption that all models share a common low-dimensional subspace (minΘ L(X, Y, Θ)+λΘ∗ , where  · ∗ denotes the trace norm defined as the sum of the singular values). Table 1 shows the results of the comparable MTL methods in term of root mean squared error (rMSE). Experimental results show that the proposed methods significantly outperform the most recent state-of-the-art algorithms proposed in terms of rMSE for most of the scores. Moreover, compared with the other

200

P. Cao et al. Table 1. Performance comparison of various methods in terms of rMSE. MT-GL

MTFL

ADAS

7.89 ± 0.55 6.84 ± 0.36

Ridge

Lasso

6.77 ± 0.31

6.82 ± 0.41 7.64 ± 0.37 8.18 ± 0.61 6.70 ± 0.31 6.61 ± 0.45

CMTL

Trace

simpleMKL SMKMTL

MMSE

2.76 ± 0.14 2.21 ± 0.07

2.21 ± 0.09

2.26 ± 0.09 3.08 ± 0.46 6.11 ± 2.04 2.21 ± 0.08 2.09 ± 0.12

RAVLTTOTAL

11.6 ± 0.52 10.0 ± 0.54

9.61 ± 0.45 9.44 ± 0.53 11.5 ± 0.51 13.1 ± 3.12 9.65 ± 0.47 9.63 ± 0.51

RAVLTTOT6

3.70 ± 0.30 3.32 ± 0.20 3.34 ± 0.15

RAVLTT30

3.79 ± 0.27 3.44 ± 0.17 3.44 ± 0.15 3.46 ± 4.04 3.24 ± 0.25 3.91 ± 0.43 3.41 ± 0.23 3.46 ± 0.19

RAVLTRECOG

4.43 ± 0.25 3.64 ± 0.21

3.64 ± 0.25

3.63 ± 0.19 4.38 ± 0.23 4.52 ± 0.86 3.64 ± 0.25 3.39 ± 0.20

FLUANIM

6.69 ± 0.42 5.35 ± 0.45

5.29 ± 0.44

5.25 ± 0.49 6.61 ± 0.56 6.74 ± 1.42 5.30 ± 0.44 5.23 ± 0.43

FLU-VEG 4.47 ± 0.21 3.75 ± 0.10

3.70 ± 0.10

3.71 ± 0.11 4.39 ± 0.29 4.67 ± 0.79 4.82 ± 0.22 3.47 ± 0.16

TRAILS-A 26.7 ± 1.80 23.8 ± 1.40

23.4 ± 1.11

23.4 ± 1.12 27.5 ± 1.98 28.8 ± 3.28 24.1 ± 1.81 21.1 ± 1.47

TRAILS-B 81.3 ± 2.52 71.2 ± 2.81

71.3 ± 2.95

70.9 ± 2.52 83.6 ± 5.44 89.7 ± 7.83 72.8 ± 2.74 69.8 ± 1.23

3.38 ± 0.18 3.91 ± 0.26 3.78 ± 0.49 3.41 ± 0.22 3.33 ± 0.20

multi-task learning with different assumption, MT-GL, MTFL and our proposed methods belonging to the multi-task feature learning methods with the idea of sparsity, have a advantage over the other comparative multi-task learning methods. Since not all the brain regions are associated with AD, many of the features are irrelevant and redundant. Sparse based MTL methods are appropriate for the task of prediction cognitive measures and better than the non sparse based MTL methods. Furthermore, CMTL and Trace are worse than the Ridge, which demonstrates that the model assumption in them may be incorrect for modeling the correlation among the cognitive tasks. 3.2

Fusion of Multi-modality

To estimate the effect of combining multi-modality image data with our SMKMTL methods and provide a more comprehensive comparison of the result from the proposed model, we further perform some experiments, that are (1) using only MRI modality, (2) using only PET modality, (3) combining two modalities: PET and MRI (MP), and (4) combining three modalities: PET, MRI and demographic information including age, years of education and ApoE genotyping (MPD). Different the above experiments, the samples from ADNI-2 are used instead of ADNI-1, since the amount of the patients with PET is sufficient. From the ADNI-2, we obtained all the patients with both MRI and PET, totally 756 samples. The PET imaging data are from the ADNI database processed by the UC Berkeley team, who use a native-space MRI scan for each subject that is segmented and parcellated with Freesurfer to generate a summary cortical and subcortical ROI, and coregister each florbetapir scan to the corresponding MRI and calculate the mean florbetapir uptake within the cortical and reference regions. The procedure of image processing is described in http://adni. loni.usc.edu/updated-florbetapir-av-45-pet-analysis-results/. In our SMKMTL, ten different kennel function described in the first experiment are used for each

Modeling AD Cognitive Scores Using SMKMTL

201

Table 2. Performance comparison with multi-modality data in terms of rMSE. MTFL MRI

SMKMTL PET

MP

MPD

MRI

PET

MP

MPD

ADAS

6.28 ± 0.33 6.09 ± 0.27 6.05 ± 0.29 5.83 ± 0.33

6.19 ± 0.53 5.95 ± 0.17 5.87 ± 0.22 5.79 ± 0.27

MMSE

1.96 ± 0.12 1.92 ± 0.23 1.91 ± 0.18 1.85 ± 0.19

1.87 ± 0.21 1.83 ± 0.19 1.82 ± 0.11 1.77 ± 0.15

RAVLT-TOTAL 9.82 ± 0.44 9.69 ± 0.43 9.55 ± 0.51 9.51 ± 0.41 9.80 ± 0.41 9.71 ± 0.35 9.56 ± 0.42 9.51 ± 0.33 RAVLT-TOT6

3.24 ± 0.15 3.19 ± 0.16 3.09 ± 0.11 2.95 ± 0.11

RAVLT-T30

3.16 ± 0.20 3.21 ± 0.12 3.14 ± 0.10 3.05 ± 0.15 3.18 ± 0.25 3.18 ± 0.22 3.11 ± 0.18 3.07 ± 0.12

3.11 ± 0.12 3.03 ± 0.21 2.97 ± 0.09 2.84 ± 0.13

RAVLT-RECOG 3.70 ± 0.25 3.52 ± 0.12 3.44 ± 0.28 3.30 ± 0.18

3.55 ± 0.21 3.54 ± 0.15 3.34 ± 0.11 3.17 ± 0.10

FLU-ANIM

4.95 ± 0.27 4.46 ± 0.32 4.51 ± 0.28 4.29 ± 0.22

4.71 ± 0.19 4.54 ± 0.18 4.45 ± 0.22 4.21 ± 0.16

FLU-VEG

3.65 ± 0.20 3.55 ± 0.15 3.47 ± 0.25 3.38 ± 0.21

3.49 ± 0.18 3.32 ± 0.21 3.15 ± 0.17 3.09 ± 0.10

TRAILS-A

16.2 ± 2.72 15.8 ± 1.56 14.7 ± 1.43 13.8 ± 1.25

15.3 ± 1.33 13.9 ± 1.18 13.1 ± 0.84 12.5 ± 1.08

TRAILS-B

54.9 ± 1.78 52.8 ± 1.43 50.5 ± 2.02 48.7 ± 2.22

51.8 ± 1.84 50.9 ± 1.66 48.9 ± 1.52 46.0 ± 1.47

modality. To show the advantage of SMKMTL, we compare our SMKMTL with MTFL, which concatenated the multiple modalities features into a long vector features. The prediction performance results are shown in Table 2. From the results, it is clear that the method with multi-modality outperforms the methods using one single modality of data. This validates our assumption that the complementary information among different modalities is helpful for cognitive function prediction. Regardless of two or three modalities, the proposed SMKMTL achieved better performances than the linear based multi-task learning for the most cases, same as for the single modality learning task above.

4

Conclusions

In this paper, we propose to multi-kernel based multi-task learning with 2,1 norm on the correlation of tasks in the kernel space combined with q -norm on the kernels in a joint framework. Extensive experiments illustrate that the proposed method not only yields superior performance on cognitive outcomes prediction, but also is a powerful tool for fusing different modalities. Acknowledgment. This research was supported by the the National Natural Science Foundation of China (No.61502091), and the Fundamental Research Funds for the Central Universities (No.161604001, N150408001).

References 1. Argyriou, A., Evgeniou, T., Pontil, M.: Convex multi-task feature learning. Mach. Learn. 73(3), 243–272 (2008) 2. Duchi, J.C., Shalev-Shwartz, S., Singer, Y., Tewari, A.: Composite objective mirror descent. In COLT, pp. 14–26 (2010) 3. Evgeniou, T., Micchelli, C.A., Pontil, M.: Learning multiple tasks with kernel methods. J. Mach. Learn. Res. 6, 615–637 (2005) 4. G¨ onen, M., Alpaydin, E.: Multiple kernel learning algorithms. J. Mach. Learn. Res. 12, 2211–2268 (2011)

202

P. Cao et al.

5. Huo, Z., Shen, D., Huang, H.: New multi-task learning model to predict Alzheimer’s disease cognitive assessment. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9900, pp. 317–325. Springer, Cham (2016). doi:10.1007/978-3-319-46720-7 37 6. Jawanpuria, P., Nath, J.S.: Multi-task multiple kernel learning. In: Proceedings of the 2011 SIAM International Conference on Data Mining, pp. 828–838. SIAM (2011) 7. Ji, S., Ye, J.: An accelerated gradient method for trace norm minimization. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 457–464. ACM (2009) 8. Micchelli, C.A., Pontil, M.: Learning the kernel function via regularization. J. Mach. Learn. Res. 6, 1099–1125 (2005) 9. Rakotomamonjy, A., Bach, F.R., Canu, S., Grandvalet, Y.: SimpleMKL. J. Mach. Learn. Res. 9, 2491–2521 (2008) 10. Rakotomamonjy, A., Flamary, R., Gasso, G., Canu, S.: lp-lq penalty for sparse linear and sparse multiple kernel multitask learning. IEEE Trans. Neural Networks 22(8), 1307–1320 (2011) 11. Wan, J., Zhang, Z., Rao, B.D., Fang, S., Yan, J., Saykin, A.J., Shen, L.: Identifying the neuroanatomical basis of cognitive impairment in Alzheimer’s disease by correlation-and nonlinearity-aware sparse bayesian learning. IEEE Trans. Med. Imaging 33(7), 1475–1487 (2014) 12. Wan, J., Zhang, Z., Yan, J., Li, T., Rao, B.D., Fang, S., Kim, S., Risacher, S.L., Saykin, A.J., Shen, L.: Sparse bayesian multi-task learning for predicting cognitive outcomes from neuroimaging measures in Alzheimer’s disease. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 940–947 (2012) 13. Wang, H., Nie, F., Huang, H., Risacher, S., Ding, C., Saykin, A.J., Shen, L., et al.: Sparse multi-task regression and feature selection to identify brain imaging predictors for memory performance. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 557–562. IEEE (2011) 14. Yan, J., Li, T., Wang, H., Huang, H., Wan, J., Nho, K., Kim, S., Risacher, S.L., Saykin, A.J., Shen, L., et al.: Cortical surface biomarkers for predicting cognitive outcomes using group l 2, 1 norm. Neurobiol. Aging 36, S185–S193 (2015) 15. Zhang, D., Shen, D., Initiative, A.D.N., et al.: Multi-modal multi-task learning for joint prediction of multiple regression and classification variables in Alzheimer’s disease. NeuroImage 59(2), 895–907 (2012) 16. Zhou, J., Chen, J., Ye, J.: Clustered multi-task learning via alternating structure optimization. In: Advances in Neural Information Processing Systems, pp. 702–710 (2011)

Machine Learning in Medical Image Computing

Personalized Diagnosis for Alzheimer’s Disease Yingying Zhu1(&), Minjeong Kim1, Xiaofeng Zhu1, Jin Yan3, Daniel Kaufer2, and Guorong Wu1 1

3

Department of Radiology and BRIC, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA [email protected] 2 Department of Neurology, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA Department of Cancer Biology, Duke University, Durham, NC 27705, USA

Abstract. Current learning-based methods for the diagnosis of Alzheimer’s Disease (AD) rely on training a general classifier aiming to recognize abnormal structural alternations from homogenously distributed dataset deriving from a large population. However, due to diverse disease pathology, the real imaging data in routine clinic practices is highly complex and heterogeneous. Hence, prototype methods commonly performing well in the laboratory cannot achieve expected outcome when applied under the real clinic setting. To address this issue, herein we propose a novel personalized model for AD diagnosis. We customize a subject-specific AD classifier for the new testing data by iteratively reweighting the training data to reveal the latent testing data distribution and refining the classifier based on the weighted training data. Furthermore, to improve estimation of diagnosis result and clinical scores at the individual level, we extend our personalized AD diagnosis model to a joint classification and regression scenario. Our model shows improved performance on classification and regression accuracy when applied on Magnetic Resonance Imaging (MRI) selected from Alzheimer’s Disease Neuroimaging Initiative (ADNI) database. Our work pinpoints the clinical potential of personalized diagnosis framework in AD.

1 Introduction Alzheimer’s Disease is one of the most common neurodegenerative disorders, which leads to gradual progressive memory loss, cognition declines, loss of functional abilities, and ultimate death [1–4]. Modern imaging technique MRI offers a non-invasive way to observe the abnormal structure changes of AD progression in vivo. In order to facilitate the MRI diagnosis of AD, a number of machine learning approaches have been developed to recognize AD-related altered brain structure [5, 6]. Most current learning-based methods train a general classifier (such as kernel Support Vector Machine) to find a hyper-plane to separate two groups in a high dimensional non-linear space, which is suitable for data of homogeneous distribution (as shown in (a)). However, it is evident that AD pathology is of heterogeneous characteristic. Therefore, real data distribution could be too complex to be represented by only one general model. As shown in (b), the real data distribution might have multiple sub-groups of distinct disease patterns due to inter-personal variance (different © Springer International Publishing AG 2017 M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 205–213, 2017. DOI: 10.1007/978-3-319-66179-7_24

206

Y. Zhu et al.

colors indicate different sub-groups). The incapability of a general classifier to fit heterogeneous data for precise classification prompts us to construct a specific classifier for each sub-group/person to achieve accurate classification. As shown in (b), the green, orange, and pink curves in are ideal classifiers for the green dataset, orange dataset, and pink dataset, respectively. Based on this strategy, we developed a personalized classification model specifically for each testing person desirable for real clinical applications.

Fig. 1. (a) Conventional methods with a general classifier applied on homogeneously distributed data; (b) Heterogeneously distributed realistic data requiring person-specific classification solution; (c) Proposed personalized classification model by re-weighting training data to fit testing data distribution. (Best viewed in color)

The key to the personalized model is to reweight the training data to fit testing data distribution and train a person-specific classifier using the weighted training data. Figure 1(b) and (c) show a toy example on how to construct a personalized classifier. There are three sub-groups of different distributions in the training dataset (green, orange, and pink) and one testing dataset (purple). Although it seems that the center of the testing dataset is closer to the green dataset than the orange dataset, only the distribution of the orange dataset resembles the testing dataset. Therefore, the orange dataset is associated with high-value weights and others are assigned with small value weights (see Fig. 1(c), size is proportional to weights). This weighted dataset reveals the latent testing data distribution and a personalized classifier (purple curve in Fig. 1(c)) can be learned from the weighted training dataset. Furthermore, to optimize weights for building the classifier, we developed an integrated solution which makes learning the training data weights and training the personalized classifier simultaneous. This personalized training strategy can sort out the data heterogeneity issue and produce more accurate diagnosis results at the individual level. Moreover, we extend the personalized diagnosis model to a joint classification (on the binary clinical labels) and regression (on the continuous clinical scores) such that we can further improve the accuracy of diagnosis by utilizing both imaging and phenotype data. We evaluated the proposed personalized AD diagnosis model on ADNI dataset and achieved more than 8% improvement on average in identifying AD and MCI (Mild Cognition Impairment) subjects, compared to using general classification models.

Personalized Diagnosis for Alzheimer’s Disease

207

2 Methods 2.1

Generalized Classification Model

Suppose the training set X consist of N subjects, denoted by X ¼ fxi ji ¼ 1;    ; N g, where each xi is the feature vector extracted from MRI. Each training subject also has a clinical label li to identify whether the underlying subject is at MCI stage (‘1’) or has converted to AD (‘ þ 1’). These clinical labels form a set L ¼ fli ji ¼ 1; . . .; Ng. Hereafter, we take the kernel SVM as the example of the generalized classification model to illustrate the idea of our personalized AD diagnosis. Kernel SVM seeks to learn a non-linear mapping u to determine the label for the new testing data y, where u is essentially the weighted kernel distances with respect to each known instance xi in P the training dataset X, i.e., uðyÞ ¼ Nj¼1 ai kC ðxi ; yÞ, where kC ðxi ; yÞ denotes the kernel distance from xi to y in the high dimensional non-linear space. The classification coefficient a ¼ fai ji ¼ 1; . . .; N g can be optimized from the training dataset X via the following classic energy function: 1 arg mina aT K C a þ lEC ðuÞ: 2

ð1Þ

The first termis the regularization term where K c is a N  N kernel matrix with each element kC xi ; xj measuring the distance between any two training data xi to xj ði; j ¼ 1; . . .; NÞ in the high dimensional non-linear space. The second term is the misclassiP fication error term EC ðuÞ ¼ Ni¼1 kuðxi Þ  li kh , where kvkh ¼ maxð0; vÞ is the hinge loss function. l is a scalar balancing the regularization term and misclassification error term. The classification coefficient a is optimized to fit the whole population in Eq. (1). If the data distribution is as homogeneous as the example shown in (a), a generalized classifier can achieve good performance on classification. However, real clinical data usually has complex distribution due to heterogeneous characteristics of AD pathology. Hence, one general classifier alone is not sufficient to cover all individuals. 2.2

Personalized Classification Model

To address the issue of heterogeneity, we propose to learn a person-specific classifier by leveraging the most relevant data in the training dataset. The training data are reweighted to reveal the testing subject distribution. Compared to the general model, we measure the relevance degree of each training data xi w.r.t. the new testing data y denoted by r ¼ fri gi¼1;...;N . In contrast to the general model which treats all training data uniformly, here, we penalize the misclassification error for different training data w.r.t the relevance degree to testing data. To achieve personalized AD diagnosis, we adjust the energy function of general classification model to personalized classification model in three ways: 1. The misclassification errors are weighted based on the relevance degree r, it turns to P the weighted average across all training data: Ec ðu; rÞ ¼ Ni¼1 ri kuðxi Þ  li k2h :

208

Y. Zhu et al.

2. The insight of personalized classifier is to re-weight each training data such that the difference between the distribution of testing data and weighted training data distribution is minimized. Therefore, we introduce the additional distribution mismatch term which is related to the relevance values r. First, we need to estimate the distribution for testing data y. Recall that the challenge in medical imaging area is the limited number of data with label information. Actually, it is not difficult to find a sufficient number of unlabeled data. Therefore, we propose to construct the distribution for the testing data y from another unlabeled dataset, denoted by U ðU \ X ¼ ;Þ. Specifically, we select the top similar M data to y from U to form the testing dataset Y ¼ fyj jyj 2 U; j ¼ 1; . . .; Mg (we set M ¼ 4 in our experiment). Since the size of testing dataset is very small ðM  NÞ, it is hard to calculate the data density in characterizing the distribution of Y. To avoid the unreliable estimation of data density, we resort to the Kernel Mean Matching (KMM) method [7], which is able to measure the distribution dissimilarity in the high dimensional or even infinite dimensional Reproducing Kernel Hilbert Space (RKHS) H. Specifically, we define the distribution mismatch term DðX; Y; rÞ as the function of r by:  2  1 XN 1 XM    DðX; Y; rÞ ¼  r / ð xi Þ  / yj   ; i¼1 i j¼1 N M 2

ð2Þ

where / is a non-linear mapping from image feature space to the RKHS. The intuition behind Eq. (2) is to adjust the distribution of training dataset X based on the relevance degrees such that the weighted distribution of training dataset (first term in Eq. (2)) can fit the distribution of testing dataset Y, i.e., X and Y are comparable in the RKHS. To make Eq. (2) solvable, “kernel trick” is used to compute the pairwise kernel distance in  H, rather than the exact value /ðxi Þ. Thus, we use P ¼ pij NN ði; j ¼ 1;    ; N Þ to   denote the N  N kernel matrix where each element pij ¼ kD xi ; xj measures the kernel distance between training data xi and xj in RKHS H. Similarly, we use h ¼ ½hi i¼1;...;N to denote a N-length column vector where the i-th element   PM N hi ¼ M j¼1 kD xi ; yj measures the average distance of the training data xi to all testing data. It is worth noting that the kernel function kD used here might be different with the kernel function kc in kernel SVM (Eq. (1)). After that, we turn Eq. (2) into a quadratic term as: 1 ED ðX; Y; rÞ ¼ ðrÞT Pr  hT r; 2

ð3Þ

3. Since the training error term Ec ðu; rÞ and the distribution mismatching term ED ðX; Y; rÞ are both influenced by the relevance degrees r, we can leverage the r to jointly minimize the classification error and distribution difference. By doing so, we can guarantee that the relevance degrees are optimized towards the eventual goal of improving the classification accuracy for the new testing data.

Personalized Diagnosis for Alzheimer’s Disease

209

Overall energy function. By integrating the above three modifications, the overall energy function for personalized AD diagnosis can be defined as: 1 arg mina;r aT K C a þ lEC ðu; rÞ þ kED ðX; Y; rÞ 2

ð4Þ

where k is the scalar used to control the strength of distribution matching. Since the clinic label of the testing data is unknown yet, the estimation of the weights for training data is driven by the distribution mismatch term ED ðX; Y; rÞ in an unsupervised manner. In addition, the estimation of relevance degrees r are jointly driven by minimizing the classification error and distribution mismatch. Optimization. Equation (4) is a bi-convex quadratic problem [8], i.e., EðaÞ is convex if r is fixed and E ðrÞ is convex if a is fixed. Under these conditions, an alternate gradient search approach is guaranteed to monotonically decrease the objective function. Hence, we alternatively optimize Eq. (4) w. r. t. r and a until converge [9]. Discussion. Conventional subject selection approaches, however, are usually separately performed prior to train the classifier, and thus resulting in a sequential two-step strategy. Therefore, the selected training data in the two-step strategy might not be optimal for classification, since there is no chance to refine the subject selection procedure. Since the personalized classifier is free of the less relevant training samples, the customized mapping function in our personalized classifier is more robust to the new testing subject than the general classifier learned from entire training data, as demonstrated in (c). 2.3

Advance Personalized AD Diagnosis Model

In many medical applications, clinical scores such as Mini-Mental State Examination (MMSE) and Clinical Dementia Rating (CDR) scores are widely used to quantify memory loss and behavior abnormality and facilitate the clinical AD diagnosis. Since the clinical scores have higher correlations than the imaging features, there is an increasing trend to integrate classification (for the binary diagnosis labels) and regression (for the continuous clinical scores). Hence, we go one step further to present the advanced personalized AD diagnosis model. Suppose each training data xi has the clinical scores ci which forms the set of clinical scores C ¼ fci ji ¼ 1; . . .; Ng. For consistency, we use the kernel Support Vector Regression (SVR) model to learn another P non-linear mapping function wðyÞ ¼ Ni¼1 bi kR ðxi ; yÞ to determine the scores for the new testing data y based on the weighted average of kernel distance kR ðxi ; yÞ to all training data. The regression coefficients b ¼ fbi ji ¼ 1; . . .; N g can be optimized by: 1 arg minb bT K R b þ gER ðwÞ; 2

ð5Þ

where g is the scalar balancing the regularization term and regression error term P ER ðwÞ ¼ Ni¼1 kwðxi Þ  ci k22 . K R is the N  N kernel matrix with each element

210

Y. Zhu et al.

  kR xi ; xj measuring the kernel distance between xi and xj in the regression problem. To personalize the regression, we turn the regression error term ER ðwÞ into the weighted P average across all training data as ER ðw; rÞ ¼ Ni¼1 ri kwðxi Þ  ci k22 . Thus, the energy function of personalized regression can be derived as: 1 arg minb;r bT K R b þ lEC ðw; rÞ þ kED ðX; Y; rÞ; 2

ð6Þ

Furthermore, we integrate personalized classification and regression and derive the overall energy function of advanced personalized AD diagnosis model as: 1 1 arg mina;b;r aT K C a þ bT K R b þ lEC ðu; rÞ þ gER ðw; rÞ þ kED ðX; Y; rÞ: 2 2

ð7Þ

Equation (7) can be similarity solved by alternatively updating a, b, and r until converge.

3 Experiments We evaluate our proposed personalized AD diagnosis model on 150 MCI and 150 AD subjects selected from ADNI database, each has the clinical scores such as ADAS-COG (Alzheimer’s Disease Assessment Scale-Cognitive Subscale) and MMSE. For each subject, we first segment each image into white matter (WM), gray matter (GM), cerebral spinal fluid (CSF). Then, we register the AAL template with 90 manually labeled ROIs (regions of interest) to the underlying subject image. We concatenate the tissue percentiles across 90 ROIs as the morphological feature for each subject. The accuracy of classification and regression is evaluated by leave-one-out strategy. For each leave-one-out case, we divide the remaining subjects into five folds. One-fold data is used as the validation dataset for parameter turning, one-fold is used as the candidature dataset for augmenting testing data and the left three-fold data are use as the training dataset. The optimal parameters are learned by an exhaustive strategy on the validation dataset. The search range for parameters is set to ½104 ; 104 . Three statistical measures are used to evaluate classification, including accuracy (ACC), sensitivity (SEN) and specificity (SPEC). Root Mean Square Error (RMSE) and Correlation Coefficients (CC) are used to evaluate the regression performance for two popular clinical measurements, i.e., ADAS-Cog and MMSE. Kernel SVM/SVR (generalized model) are the baseline methods in comparison. In the following experiments, we evaluate the classification and regression separately. We use the RBF kernel and the regularization parameters are tuned using five-fold inner cross-validation. In classification, we compare our personalized classifier (Eq. 4) with kernel SVM and convention subject selection (estimate the weight based on feature similarity) followed by kernel SVM called SS+SVM. Similarly, we evaluate the regression performance for the baseline kernel SVR, SS+SVR, and our personalized regressor. Our advance personalized AD diagnosis model (called Personalized SVM+SVR)

Personalized Diagnosis for Alzheimer’s Disease

211

Fig. 2. Classification performance in identifying MCI/AD, NC/MCI/AD subjects and regression performance in estimating MMSE and ADAS-Cog scores. (CC represent correlation coefficients and RMSE represents root mean square error. Best viewed in color)

is both evaluated in classification and regression tasks, in order to show the benefit from joint classification and regression. Evaluation of classification performance. The ACC, SEN, and SPEC results in identifying MCI/AD and NC/MCI/AD subjects by kernel SVM, SS+SVM, personalized SVM, and personalized SVM+SVR are shown in Fig. 2(a). In general, all personalized approaches achieve higher accuracy than the baseline kernel SVM method which does not have any adjustment to the testing subject. Since our personalized SVM method can jointly select the most relevant subjects and train the classifier, it achieves overall 1.3% improvement in terms of ACC value over the naïve SS+SVM approach that selects training data and trains the classifier separately. Furthermore, our personalized SVM+SVR can obtain additional 3.1% improvement of ACC value over the personalized SVM method, which shows the substantial benefit of joint classification and regression. It is worth noting that our advanced personalized model (personalized SVM+SVR) achieves 8.3% improvement in terms of ACC value, compared to generalized model (kernel SVM). Evaluation of regression performance. Since the clinical scores of each testing subject are known, we calculate the RMSE and CC values between the ground truth and the estimated score by four competing methods. We show the MMSE and CC results of estimating MMSE and ADAS-Cog measurements by kernel SVR, SS+SVR, personalized SVR, and personalized SVM+SVR in Fig. 2(b) and (c), respectively. It is apparent that (1) all personalized method beat the generalized regression method (kernel SVR); (2) our personalized SVR outperforms the naïve SS+SVR method due to the advantage of joint weighting training data and regression; (3) personalized SVM+SVR has the minimal MMSE value and large CC value between the ground truth and estimated scores, indicate the advantage of allowing clinical labels to guide the regression of clinical scores. Evaluation of personalized diagnosis model with respect to data heterogeneity. Since the main objective of our study is to address the heterogeneous issue of imaging data by using personalized model, we specifically evaluate the performance of the personalized model with respect to the heterogeneity in the observed imaging data. Here we assume that the data heterogeneity proportionally increases as the size of imaging data becomes larger and larger. Therefore, we examine the performance of

212

Y. Zhu et al.

Fig. 3. The performance of personalized model vs. general model with respect to different number of training data. (Best viewed in color)

classification and regression w.r.t. the different number of the training data, as shown in Fig. 3. We run the two-sample t-test for results in Fig. 3 and find the improvement is significant with p < 0.05. One can observe that (1) both general and personalized models perform at similar accuracy when the size of training data is small; (2) as the number of training subjects increases, the personalized model achieved much higher performance accuracy compared to general model in terms of both classification and regression tasks, although the performance accuracy of all methods increase consistently; (3) the improvement of personalized model over the generalized model becomes more prominent as the subject number increases. These results prove that the personalized model we propose is superior to the general model when applied on a large heterogeneous dataset and thus is of potential for clinical practice.

4 Conclusion To address the heterogeneous issue of image-based diagnosis of AD, we construct a personalized diagnosis model in this work. In this model, we establish a subject-specific AD classifier by reweighting training data to reveal the latent distribution for each testing data and simultaneously refining classifier. We further improve the diagnosis performance at the individual level by establishing a joint classification and regression scenario. Finally, we evaluate our method on ADNI dataset for both clinical label and clinical score estimation compared to the state-of-art counterpart methods and demonstrate the potential of our personalized model in translating computer assisted diagnosis method into routine clinical practice.

References 1. Viola, K., et al.: Towards non-invasive diagnostic imaging of early-stage Alzheimer’s disease. Nat. Nanotechnol. 10, 91–98 (2015) 2. Thompson, P.M., Hayashi, K.M., Dutton, R.A., Chiang, M.-C., Leow, A.D., Sowell, E.R., et al.: Tracking Alzheimer’s disease. In: Annals of New York Academy of Sciences, vol. 1097, pp. 198–214 (2007)

Personalized Diagnosis for Alzheimer’s Disease

213

3. Zhu, Y., Zhu, X., Kim, M., Shen, D., Wu, G.: Early diagnosis of alzheimer’s disease by joint feature selection and classification on temporally structured support vector machine. In: Ourselin, S., Joskowicz, L., Sabuncu, Mert R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9900, pp. 264–272. Springer, Cham (2016). doi:10.1007/978-3-319-46720-7_31 4. Wang, Z., Zhu, X., Adeli, E., Zhu, Y., Zu, C., Nie, F., Shen, D., Wu, G.: Progressive graph-based transductive learning for multi-modal classification of brain disorder disease. In: Ourselin, S., Joskowicz, L., Sabuncu, Mert R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9900, pp. 291–299. Springer, Cham (2016). doi:10.1007/978-3-319-46720-7_34 5. Lindberg, O., et al.: Hippocampal shape analysis in Alzheimer’s disease and frontotemporal lobar degeneration subtypes. J. Alzheimers Dis. 30, 355–365 (2012) 6. Pettigrew, C., et al.: Cortical thickness in relation to clinical symptom onset in preclinical AD. Neuroimage: Clinical 15, 116–122 (2016) 7. Gretton, A., et al.: Covariate shift by kernel mean matching. In: Dataset Shift in Machine Learning, pp. 123–135 (2009) 8. Boyd, S., et al.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning 3, 1–122 (2011) 9. Zhu, Y., Lucey, S.: Convolutional sparse coding for trajectory reconstruction. TPAMI 37, 529–540 (2015)

GP-Unet: Lesion Detection from Weak Labels with a 3D Regression Network Florian Dubost1,2,3(B) , Gerda Bortsova1,2,3 , Hieab Adams3,4 , Arfan Ikram3,4,5 , Wiro J. Niessen1,2,3,6 , Meike Vernooij3,4 , and Marleen De Bruijne1,2,3,7 1

Biomedical Imaging Group Rotterdam, Erasmus MC, Rotterdam, The Netherlands [email protected] 2 Department of Medical Informatics, Erasmus MC, Rotterdam, The Netherlands 3 Department of Radiology, Erasmus MC, Rotterdam, The Netherlands 4 Department of Epidemiology, Erasmus MC, Rotterdam, The Netherlands 5 Department of Neurology, Erasmus MC, Rotterdam, The Netherlands 6 Imaging Physics, Faculty of Applied Sciences, TU Delft, Delft, The Netherlands 7 Department of Computer Science, University of Copenhagen, Copenhagen, Denmark

Abstract. We propose a novel convolutional neural network for lesion detection from weak labels. Only a single, global label per image - the lesion count - is needed for training. We train a regression network with a fully convolutional architecture combined with a global pooling layer to aggregate the 3D output into a scalar indicating the lesion count. When testing on unseen images, we first run the network to estimate the number of lesions. Then we remove the global pooling layer to compute localization maps of the size of the input image. We evaluate the proposed network on the detection of enlarged perivascular spaces in the basal ganglia in MRI. Our method achieves a sensitivity of 62% with on average 1.5 false positives per image. Compared with four other approaches based on intensity thresholding, saliency and class maps, our method has a 20% higher sensitivity.

1

Introduction

This paper addresses the problem of the detection of small structures in 3D images. We aim at developing a machine learning method requiring the least possible annotations for training. Several deep learning techniques [1–3] have recently been proposed for 3D segmentation. These methods use fully convolutional networks (FCN) [4] with a downsampling and upsampling path and can therefore detect small regions and fine details. Although efforts have been made to reduce the amount of annotations required with e.g., sparse annotations [1], these techniques still need pixel-wise annotations for training. Acquiring those is very time-consuming and often not feasible for large datasets. In [5] the authors propose 2D localization networks requiring only global image labels for their training. They aggregate the last features maps of their network into scalars thanks to a global pooling layer and can therefore compute c Springer International Publishing AG 2017  M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 214–221, 2017. DOI: 10.1007/978-3-319-66179-7 25

GP-Unet: Lesion Detection from Weak Labels with a 3D Regression Network

215

a classification. Heatmaps can be obtained as a weighted sum of these feature maps. These heatmaps indicate which regions of the image contributed the most to the classification results. Because of the downsampling introduced by the pooling operations in the network, the heatmaps are several times smaller than the original input image, which makes it impossible to detect very small structures. In [6] a similar technique is applied to 3D CT lung data. This network splits into two branches, one with fully connected layers for classification, and the other with a global pooling layer for localization. The loss is computed as a weighted sum of these two terms. However, as in [5], this network is not suitable for the detection of small structures. In this paper we propose a method to learn fine pixelwise detection from image-level labels. By combining global pooling with a fully convolutional architecture including upsampling steps as in the popular U-Net [7], we compute a heatmap of the size of the input. This heatmap reveals the presence of the targeted structures. During the training, unlike [5] or [6] where the authors use a classification network, the weights of the network are optimized to solve a regression task, where the objective is to predict the number of lesions. We evaluate our method on the detection of enlarged perivascular spaces (EPVS) in the basal ganglia in MRI. EPVS is an emerging biomarker for cerebral small vessel disease. The detection of EPVS is a challenging task - even for a human observer - because of the very small size and variable shape. In [8–10] the authors propose different EPVS segmentation methods. However detection of individual EPVS has not been addressed much in the literature. Only [9] proposes an automatic method for this, but this method requires pixel-wise annotated scans for training.

2

Methods

Our method takes as input a full 3D MRI brain scan, computes a smaller smooth region of interest (ROI) and inputs it to a FCN which computes a heatmap revealing the presence of the targeted brain lesions. The FCN is trained with weak labels. Its architecture changes between training and testing but the optimized weights stay the same. 2.1

Preprocessing

Scans are first registered to MNI space. A binary mask segmenting a region of interest (ROI) in the brain is computed with standard algorithms [11]. This mask is dilated and its borders are smoothed with a Gaussian kernel. After applying the mask, we crop the image around the ROI, trim its highest values and rescale its intensities such that the input to the network is S ∈ [0, 1]h×w×d . 2.2

3D Regression Fully Convolutional Network

Figure 1 shows the architecture of our network. It is similar to the one of 3D U-Net [1] but is less deep and has, during training, a global pooling layer [12]

216

F. Dubost et al.

Fig. 1. 3D Regression FCN Architecture. Top-left: network architecture (see Sect. 2.2). Bottom-right: during training, the network is built to solve a regression problem and outputs yˆ ∈ R computed according to Eq. (1). During testing, the global pooling layer is removed. Using Eq. (3), the network computes a heatmap M ∈ Rh×w×d of the same size as the network input volume S.

before the last layer. The use of this layer is detailed in the next section. Our network has 3 × 3 × 3 convolutional layers. In the encoding path, these layers are alternated with 2 × 2 × 2 pooling layers to downsample the image and increase the size of the receptive field. In the deconding path we use 2 × 2 × 2 upsampling layers. Skip connections connect the encoding and decoding paths. We do not use padding. After each convolutional layer - except for the last one - we compute a rectified linear unit activation. The depth and number of feature maps of the network are set to fit the available GPU memory. See Fig. 1 for the architecture details. We change the last layers of our network between training and testing. The testing step is performed with a standard fully convolutional architecture. The output is a heatmap of the size of the input. During training, we use only global image labels. To compute the loss function we need therefore to collapse the image output into a scalar. For this purpose, during training only, we introduce a global pooling layer before the last layer. In others word, we train a regression network and transform it to a segmentation network for testing (See Fig. 1). We detail this below. Training - Regression Network. After the last convolutional layer, instead of stacking the voxels of each feature map and feeding them to a fully connected layer, we use a global pooling layer, which computes one pooling operation for each feature map. The resulting output of such a layer is a vector of scalars x ∈ Rn , with n ∈ N the number of features maps of the preceding layer.

GP-Unet: Lesion Detection from Weak Labels with a 3D Regression Network

217

Let fi ∈ Rh×w×d , for i ∈ {1, 2, .., n}, be the i-th feature map received by the global pooling layer. We can write the global pooling operation as g(fi ) = xi , with xi ∈ R. Let G be the underlying mapping of the global pooling layer and f ∈ Rn×h×w×d the vector containing the n feature maps fi . We have G(f ) ∈ Rn and in the case of global max pooling we can write (G(f ))i = max fi,x,y,z . x,y,z

A convolution layer with a single output feature map yˆ and a filter size of 1 × 1 × 1 can be written as  wi hi + b, (1) yˆ = i

with wi ∈ R, the weights, b the bias and hi the i-th feature map of the preceding layer. Considering the layer following the global pooling layer, we can write Eq. (1) as  yˆ = wi (G(f ))i + b = w.G(f ) + b, (2) i

with . denoting the scalar product, yˆ ∈ R, b ∈ R and w ∈ Rn . The output of the network yˆ is the estimated lesion count. There is no activation function applied after this last convolutional  layer. To optimize the weights, we compute a root m 1 yt − yt )2 , where yˆt ∈ R is the predicted count, mean square loss l = m t=1 (ˆ yt ∈ N the ground truth for volume St and m the number of samples. Testing - Detection Network. During testing we remove the global pooling layer, but we do not change the weights of the network. The hi of Eq. (1) are the feature maps fi defined earlier and the bias bl can be omitted. We can thus rewrite Eq. (2) as  M= wi fi , (3) i h×w×d

with M ∈ R the new output of the network. M is a heatmap indicating the location of the targeted lesions. Note that computation of M is mathematically equivalent to the class map computation proposed in [5]. However M has the same size as the input S and the weights wi are optimized for a regression problem instead of classification. After computing the heatmap, we need to select an appropriate threshold before we can compute a segmentation and retrieve the location of the targeted lesions. This efficiently be done by using the estimate of the lesion count yˆ provided by the neural network: we only need to keep the global pooling layer while testing. The threshold is then selected so that the number of connected components in the thresholded heatmap is equal to the lesion count.

3

Experiments

We evaluate our method on a testing set of 30 MRI scans and compute the sensitivity, false discovery rate and average number of false positive per image (Fig. 2).

218

F. Dubost et al.

Fig. 2. Examples of EPVS detection in the basal ganglia. Left: Ground truth on the intensities - the lesions are circled in green. Center: Heatmap from the neural network. True positives in green, false positives in blue and false negatives in red. Right: Segmentation of the heatmap.

Data. Our scans were acquired with a 1.5 T scanner. Our dataset is a subset of the Rotterdam Scan Study [13] and contains 1642 3D PD-weighted MRI scans. Note that in our dataset PD-weighted images have a signal intensity similar to T2-weighted images, the modality commonly used to detect EPVS. The voxel resolution is 0.49 × 0.49 × 0.8 mm3 . Each of the scans of dataset is annotated with a single, global image label: a trained observer counted the number of EPVS in the slice of the basal ganglia showing the anterior commissure. In a random subset of 30 scans, they marked the center of all EPVS visible in this same slice. These 30 scans are kept for testing set. The remaining dataset is randomly split into the following subsets: 1289 scans for training and 323 for validation. Experimental Settings. The initial ROI segmentation of the BG is computed with the subcortical segmentation of FreeSurfer [11]. For registration to MNI space we use the rigid registration implemented in Elastix [14] with the mutual information as similarity measure and default settings. The Gaussian blurring kernel has a standard deviation σ = 2. The preprocessed image S has a size of 168 × 128 × 84 voxels. We trim its 1% highest values. We initialize the weights of the CNN by sampling from a Gaussian distribution, use Adadelta [15] for optimization and augment the training data with randomly transformed samples. A lesion is counted as detected if the center of a connected component in the thresholded heatmap is located within a radius of x pixels of the annotation. In our experiments we set x = 3 which corresponds to the average radius of the largest lesions. We implemented our algorithms in Python with Keras and Theano libraries and ran the experiments on a Nvidia GeForce GTX 1070 GPU. Baseline Methods. We compare our method with four conventional approaches, using the same preprocessing (Sect. 2.1). The first method (a) thresholds the input image based on it intensities: EPVS are among the most hyperintense structures in the basal ganlgia. Both second and third methods compute saliency maps [16] using a neural network. Saliency maps are obtained by computing the gradient of the regression score with respect to the input image. These maps can be easily computed given any classification or regression

GP-Unet: Lesion Detection from Weak Labels with a 3D Regression Network

219

Table 1. Results of our method in comparison to the baselines. TPR stands for true positive rate (i.e. sensitivity), FPav is the number of false positives per image in average and FDR stands for false discovery rate. The comparing methods are described in Sect. 3. Method

TPR FPav FDR

Intensities (a)

40.6

2.3

Saliency (b)

39.8

2.7

54.2

Saliency FCN (c)

18.7

3.3

70.2

59.8

Regression (d)

19.6

3.2

70.5

Regression FCN (e)

54.8

1.9

37.7

Intensities + Reg FCN (f) 62.0 1.5

31.4

network architecture. The third method (c) uses the same network architecture as ours, the second one (b) uses a regression network without the upsampling path. Finally we compared with a method similar to [5] where we replace the original classification objective with a regression. This network is the same as ours but without the upsampling part. The thresholds are chosen based on the lesion count estimated by each network as explain in Sect. 2. Results. In Table 1 we report sensitivity, false discovery rate and average false positive per image (FPav). Our method (e) detects 54.8% of the lesions with on average 1.9 false positive per image. We found that the performance could be improved further by thresholding the weighted sum of the heatmap and the original voxel intensities of the image (f). This improves the sensitivity to 62.0% with 1.5 FPav. Considering the sensitivity, our method outperforms the other baseline methods by more than 20% (Table 1). By modifying the heatmap threshold to overestimate the lesion count, we can increase the sensitivity further - 71.1% - at the price of more false positive - 4.4 in average and a false discovery rate of 60% (Fig. 3).

4

Discussion

Our experiments show that using a fully convolutional network and optimizing its weights for a regression problem gives the best results on our dataset (Fig. 3 and Table 1). The methods in [5,6] produce coarse localization heatmaps on which it is impossible to locate small structures. In our early experiments we computed similar networks, optimized for classification. The resulting heatmaps highlight the whole ROI, which does not help to solve the detection problem. Using a regression network, the more precise output allows us to extract finer-scale information, but without an upsampling path, the results are still imprecise (see Fig. 3 method (d)). In our experiments, saliency maps [16] perform similarly to voxel intensity. By construction these maps are noisy (Fig. 3 method (b)).

220

a

F. Dubost et al.

b

c

d

e

f

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

a b c d e f 0

1

2

3

4

5

6

Fig. 3. Comparison with baselines. Left: Heatmaps of the different methods. EPVS are circled in green on the original image (a). The saliency maps (b, c(FCN)) are very noisy. Our method (e) has finer details than a network without upsampling path (d) similar to [5]. Combining the intensities with our heatmap (f) offers the best results. Right: Free-ROC curves for each method, showing the FPav on the x-axis and the sensitivity on the y-axis.

Considering other works in automatic detection of EPVS, in [9] the authors compute an EPVS segmentation using a random forest classifier. They train this classifier with pixelwise ground truth annotations. They also use a 7 T scanner which produces high resolution images and eases small lesion detection. They report a sensitivity of 78% and a false discovery rate of 55%. This is better than our 71% sensitivity and 60% false discovery rate. However contrary to [9] which requires pixelwise annotation, our method uses only weak labels for training. In [8,10], the authors also compute segmentations, but the evaluation is limited to a correlation with a five-category visual score. Overall we consider that our detection results are satisfactory, considering that, during training, our method did not use any ground truth information on the location or shape of the lesions, but only their count within the whole ROI.

5

Conclusion

We presented a novel 3D neural network to detect small structures from global image-level labels. Our method combines a fully convolutional architecture with global pooling. During training the weights are optimized to solve a regression problem, while during testing the network computes lesion heatmaps of the size of the input. We demonstrated the potential of our method to detect enlarged perivascular spaces in brain MRI. We obtained a sensitivity of 62% and on average 1.5 false positives per scan. Our method outperforms four other approaches with an increase of more than 20% in sensitivity. We believe this method can be applied to the detection of other brain lesions. Acknowledgments. This research was funded by The Netherlands Organisation for Health Research and Development (ZonMw) Project 104003005.

GP-Unet: Lesion Detection from Weak Labels with a 3D Regression Network

221

References ¨ Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O.: 3D U-Net: 1. C ¸ i¸cek, O., learning dense volumetric segmentation from sparse annotation. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 424–432. Springer, Cham (2016). doi:10.1007/978-3-319-46723-8 49 2. Chen, H., Dou, Q., Yu, L., Qin, J., Heng, P.A.: VoxResNet: deep voxelwise residual networks for volumetric brain segmentation. NeuroImage (2017) 3. Bortsova, G., van Tulder, G., Dubost, F., Peng, T., Navab, N., van der Lugt, A., Bos, D., de Bruijne, M.: Segmentation of intracranial arterial calcification with deeply supervised residual dropout networks. In: Descoteaux, M., et al. (eds.) MICCAI 2017, Part III. LNCS, vol. 10435, pp. 359–367. Springer, Cham (2017). doi:10. 1007/978-3-319-66179-7 41 4. Long, J., Shelhamer, E. and Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015) 5. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A. and Torralba, A.: Learning deep features for discriminative localization. In: CVPR (2016) 6. Hwang, S., Kim, H.-E.: Self-transfer learning for weakly supervised lesion localization. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 239–246. Springer, Cham (2016). doi:10.1007/ 978-3-319-46723-8 28 7. Ronneberger, O., Fischer, P., Brox, T.: Convolutional networks for biomedical image segmentation. In: MICCAI, pp. 234–241 (2015) 8. Ramirez, J., Berezuk, C., McNeely, A.A., Scott, C.J., Gao, F., Black, S.E.: Visible Virchow-Robin spaces on magnetic resonance imaging of Alzheimer’s disease patients and normal elderly from the Sunnybrook Dementia Study. J. Alzheimers Dis. 43(2), 415–424 (2015) 9. Park, S.H., Zong, X., Gao, Y., Lin, W., Shen, D.: Segmentation of perivascular spaces in 7T MR image using auto-context model with orientation-normalized features. NeuroImage 134, 223–235 (2016) 10. Ballerini, L., Lovreglio, R., Hernandez, M., del C. Vald´es Hern´ andez, M., Maniega, S.M., Pellegrini, E., Wardlaw, J.M.: Application of the ordered logit model to optimising frangi filter parameters for segmentation of perivascular spaces. Procedia Comput. Sci. 90, 6167 (2016) 11. Desikan, R.S., Sgonne, F., Fischl, B., Quinn, B.T., Dickerson, B.C., Blacker, D., Buckner, R.L., Dale, A.M., Maguire, R.P., Hyman, B.T., Albert, M.S.: An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral based regions of interest. Neuroimage 31(3), 968–980 (2006) 12. Lin, M., Chen, Q., Yan, S.: Network in network. In: ICLR (2014) 13. Hofman, A., Brusselle, G.G., Murad, S.D., van Duijn, C.M., Franco, O.H., Goedegebure, A., Ikram, M.A., Klaver, C.C., Nijsten, T.E., Peeters, R.P., Stricker, B.H.C.: The Rotterdam Study: 2016 objectives and design update. Eur. J. Epidemiol. 30(8), 661–708 (2015) 14. Klein, S., Staring, M., Murphy, K., Viergever, M.A., Pluim, J.P.W.: Elastix: a toolbox for intensity based medical image registration. TMI 29(1), 196–205 (2010) 15. Zeiler, M.D.: ADADELTA: an adaptive learning rate method. arXiv preprint. arxiv:1212.5701 16. Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: Visualising image classification models and saliency maps. In: CVPR (2014)

Deep Supervision for Pancreatic Cyst Segmentation in Abdominal CT Scans Yuyin Zhou1 , Lingxi Xie1(B) , Elliot K. Fishman2 , and Alan L. Yuille1 1

2

The Johns Hopkins University, Baltimore, MD 21218, USA [email protected], [email protected], [email protected] The Johns Hopkins University School of Medicine, Baltimore, MD 21287, USA [email protected] http://ml.cs.tsinghua.edu.cn/~lingxi/Projects/PanCystSeg.html

Abstract. Automatic segmentation of an organ and its cystic region is a prerequisite of computer-aided diagnosis. In this paper, we focus on pancreatic cyst segmentation in abdominal CT scan. This task is important and very useful in clinical practice yet challenging due to the low contrast in boundary, the variability in location, shape and the different stages of the pancreatic cancer. Inspired by the high relevance between the location of a pancreas and its cystic region, we introduce extra deep supervision into the segmentation network, so that cyst segmentation can be improved with the help of relatively easier pancreas segmentation. Under a reasonable transformation function, our approach can be factorized into two stages, and each stage can be efficiently optimized via gradient back-propagation throughout the deep networks. We collect a new dataset with 131 pathological samples, which, to the best of our knowledge, is the largest set for pancreatic cyst segmentation. Without human assistance, our approach reports a 63.44% average accuracy, measured by the Dice-Sørensen coefficient (DSC), which is higher than the number (60.46%) without deep supervision.

1

Introduction

In 2012, pancreatic cancers of all types were the 7th most common cause of cancer deaths, resulting in 330,000 deaths globally [15]. By the time of diagnosis, pancreatic cancer has often spread to other parts of the body. Therefore, it is very important to use medical imaging analysis to assist identifying malignant cysts in the early stages of pancreatic cancer to increase the survival chance of a patient [3]. The emerge of deep learning has largely advanced the field of computer-aided diagnosis (CAD). With the help of the state-of-the-art deep convolutional neural networks [7,14], such as the fully-convolutional networks (FCN) [10] for semantic segmentation, researchers have achieved accurate segmentation on many abdominal organs. There are often different frameworks for segmenting different organs [1,13]. Meanwhile, it is of great interest to find the lesion area in an organ [2,5,16], which, frequently, is even more challenging due to the tiny volume and variable properties of these parts. c Springer International Publishing AG 2017  M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 222–230, 2017. DOI: 10.1007/978-3-319-66179-7 26

Deep Supervision for Pancreatic Cyst Segmentation in Abdominal CT Scans

223

This paper focuses on segmenting pancreatic cyst from abdominal CT scan. Pancreas is one of the abdominal organs that are very difficult to be segmented even in the healthy cases [12,13,17], mainly due to the low contrast in the boundary and the high variability in its geometric properties. In the pathological cases, the difference in the pancreatic cancer stage also impacts both the morphology of the pancreas and the cyst [4,8]. Despite the importance of pancreatic cyst segmentation, this topic is less studied: some of the existing methods are based on old-fashioned models [6], and a state-of-the-art approach [3] requires a bounding box of the cyst to be annotated beforehand, as well as a lot of interactive operations throughout the segmentation process to annotate some voxels on or off the target. These requirements are often unpractical when the user is not well knowledgable in medicine (e.g., a common patient). This paper presents the first system which produces reasonable pancreatic cyst segmentation without human assistance on the testing stage. Intuitively, the pancreatic cyst is often closely related to the pancreas, and thus segmenting the pancreas (relatively easier) may assist the localization and segmentation of the cyst. To this end, we introduce deep supervision [9] into the original segmentation network, leading to a joint objective function taking both the pancreas and the cyst into consideration. Using a reasonable transformation function, the optimization process can be factorized into two stages, in which we first find the pancreas, and then localize and segment the cyst based on the predicted pancreas mask. Our approach works efficiently based on a recent published coarse-to-fine segmentation approach [17]. We perform experiments on a newly collected dataset with 131 pathological samples from CT scan. Without human assistance on the testing stage, our approach achieves an average DiceSørensen coefficient (DSC) of 63.44%, which is practical for clinical applications.

2 2.1

Approach Formulation

Let a CT-scanned image be a 3D volume X. Each volume is annotated with ground-truth pancreas segmentation P and cyst segmentation C , and both of them are of the same dimensionality as X. Pi = 1 and Ci = 1 indicate a foreground voxel of pancreas and cyst, respectively. Denote a cyst segmentation model as M : C = f (X; Θ), where Θ denotes the model parameters. The loss function can be written as L(C, C ). In a regular deep neural network such as our baseline, the fully-convolutional network (FCN) [10], we optimize L with respect to the network weights Θ via gradient back-propagation. To deal with small targets, we follow [11] to compute the DSC loss function: L(C, C ) =   2× i Ci Ci    . The gradient ∂L(C,C ) can be easily computed. ∂C i Ci + i Ci Pancreas is a small organ in human body, which typically occupies less than 1% voxels in a CT volume. In comparison, the pancreatic cyst is even smaller. In our newly collected dataset, the fraction of the cyst, relative to the entire volume, is often much smaller than 0.1%. In a very challenging case, the cyst only

224

Y. Zhou et al.

Fig. 1. A relatively difficult case in pancreatic cyst segmentation and the results produced by different input regions, namely using the entire image and the region around the ground-truth pancreas mask (best viewed in color). The cystic, predicted and overlapping regions are marked by red, green and yellow, respectively. For better visualization, the right two figures are zoomed in w.r.t. the red frame.

occupies 0.0015% of the volume, or around 1.5% of the pancreas. This largely increases the difficulty of segmentation or even localization. Figure 1 shows a representative example where cyst segmentation fails completely when we take the entire 2D slice as the input. To deal with this problem, we note that the location of the pancreatic cyst is highly relevant to the pancreas. Denote the set of voxels of the pancreas as P  = {i | Pi = 1}, and similarly, the set of cyst voxels as C  = {i | Ci = 1}. Frequently, a large fraction of C  falls within P  (e.g., |P  ∩ C  | / |C  | > 95% in 121 out of 131 cases in our dataset). Starting from the pancreas mask increases the chance of accurately segmenting the cyst. Figure 1 shows an example of using the ground-truth pancreas mask to recover the failure case of cyst segmentation. This inspires us to perform cyst segmentation based on the pancreas region, which is relatively easy to detect. To this end, we introduce the pancreas mask P as an explicit variable of our approach, and append another term to the loss function to jointly optimize both pancreas and cyst segmentation networks. Mathematically, let the pancreas segmentation model be MP : P = fP (X; Θ P ), and the corresponding loss term be LP (P, P ). Based on P, we create a smaller input region by applying a transformation X = σ[X, P], and feed X to the next stage. Thus, the cyst segmentation model can be written as MC : C = fC (X ; Θ C ), and we have the corresponding loss them LC (C, C ). To optimize both Θ P and Θ C , we consider the following loss function: L(P, P , C, C ) = λLP (P, P ) + (1 − λ) LC (C, C ),

(1)

where λ is the balancing parameter defining the weight between either terms. 2.2

Optimization

We use gradient descent for optimization, which involves computing the gradients ∂LC ∂L over Θ P and Θ C . Among these, ∂Θ = ∂Θ , and thus we can compute it via C C

Deep Supervision for Pancreatic Cyst Segmentation in Abdominal CT Scans

225

Fig. 2. The framework of our approach (best viewed in color). Two deep segmentation networks are stacked, and two loss functions are computed. The predicted pancreas mask is used in transforming the input image for cyst segmentation.

standard back-propagation in a deep neural network. On the other hand, Θ P is involved in both loss terms, and applying the chain rule yields: ∂LP ∂LC ∂X ∂P ∂L · = + · . ∂Θ P ∂Θ P ∂X ∂P ∂Θ P

(2)

The second term on the right-hand side depends on the definition of X = σ[X, P]. In practice, we define a simple transformation to simplify the computation. The intensity value (directly related to the Hounsfield units in CT scan) of each voxel is either preserved or set as 0, and the criterion is whether there exists a nearby voxel which is likely to fall within the pancreas region: Xi = Xi × I{∃j | Pj > 0.5 ∧ |i − j| < t},

(3)

where t is the threshold which is the farthest distance from a cyst voxel to the pancreas volume. We set t = 15 in practice, and our approach is not sensitive to ∂X  this parameter. With this formulation, i.e., ∂Pji = 0 almost everywhere. Thus, we 

∂LP ∂L have ∂X ∂P = 0 and ∂Θ P = ∂Θ P . This allows us to factorize the optimization into ∂L ∂L two stages in both training and testing. Since ∂Θ and ∂Θ are individually P C optimized, the balancing parameter λ in Eq. (1) can be ignored. The overall framework is illustrated in Fig. 2. In training, we directly set X = σ[X, P ], so that the cyst segmentation model MC receives more reliable supervision. In testing, starting from X, we compute P, X and C orderly. Dealing with two stages individually reduces the computational overheads. It is also possible to formulate the second stage as multi-label segmentation. The implementation details follow our recent work [17], which achieves the state-of-the-art performance in the NIH pancreas segmentation dataset [13]. Due to the limited amount of training data, instead of applying 3D networks directly, we cut each 3D volume into a series of 2D pieces, and feed them into a fullyconvolutional network (FCN) [10]. This operation is performed along three directions, namely the coronal, sagittal and axial views. At the testing stage, the cues

226

Y. Zhou et al.

from three views are fused, and each voxel is considered to fall on foreground if at least two views predict so. In the pathological dataset, we observe a large variance in the morphology of both pancreas and cyst. which increases the difficulty for the deep network to converge in the training process. Consequently, using one single model may result in less stable segmentation results. In practice, we use the FCN-8s model [10] from the pre-trained weights on the PascalVOC dataset. This model is based on a 16-layer VGGNet [14], and we believe a deeper network may lead to better results. We fine-tune it through 60 K iterations with a learning rate of 10−5 . Nine models, namely the snapshots after {20 K, 25 K, . . . , 60 K} iterations, are used to test each volume, and the final result is obtained by computing the union of nine predicted foreground voxel sets. Regarding other technical details, we simply follow [17], including using the DSC loss layer instead of the voxel-wise loss layer to prevent the bias towards background [11], and applying a flexible convergence condition for fine-tuning at the testing stage.

3 3.1

Experiments Dataset and Evaluation

We evaluate our approach on a dataset collected by the radiologists in our team. This dataset contains 131 contrast-enhanced abdominal CT volumes, and each of them is manually labeled with both pancreas and pancreatic cyst masks. The resolution of each CT scan is 512 × 512 × L, where L ∈ [358, 1121] is the number of sampling slices along the long axis of the body. The slice thickness varies from 0.5 mm–1.0 mm. We split the dataset into 4 fixed folds, and each of them contains approximately the same number of samples. We apply cross validation, i.e., training our approach on 3 out of 4 folds and testing it on the remaining one. We measure the segmentation accuracy by computing the Dice-Sørensen Coefficient (DSC) for each 3D volume. This is a similarity metric between the prediction voxel set A and the ground-truth set G, its mathematical form is DSC(A, G) = 2×|A∩G| |A|+|G| . We report the average DSC score together with other statistics over all 131 testing cases from 4 testing folds. 3.2

Results

Cystic Pancreas Segmentation. We first investigate pathological pancreas segmentation which serves as the first stage of our approach. With the baseline approach described in [17], we obtain an average DSC of 79.23 ± 9.72%. Please note that this number is lower than 82.37 ± 5.68%, which was reported by the same approach in the NIH pancreas segmentation dataset with 82 healthy samples. Meanwhile, we report 34.65% DSC in the worst pathological case, while this number is 62.43% in the NIH dataset [17]. Therefore, we can conclude that a cystic pancreas is more difficult to segment than a normal case (Table 1).

Deep Supervision for Pancreatic Cyst Segmentation in Abdominal CT Scans

227

Table 1. Pancreas and cyst segmentation accuracy, measured by DSC (%), produced by different approaches. Bold fonts indicate the results that oracle information (groundtruth bounding box) is not used. With deep supervision, the average accuracy cyst segmentation is improved, and the standard deviation is decreased. Method

Mean DSC

Max/Min DSC

Pancreas Segmentation, w/GT Pancreas B-Box

83.99 ± 4.33

93.82/69.54

Pancreas Segmentation

79.23 ± 9.72

92.95/34.65

Cyst Segmentation, w/GT Cyst B-Box

77.92 ± 12.83

Cyst Segmentation, w/o Deep Supervision 60.46 ± 31.37 Cyst Segmentation, w/Deep Supervision

96.14/24.69 95.67/0.00

63.44 ± 27.71 95.55/0.00

Pancreatic Cyst Segmentation. Based on the predicted pancreas mask, we now study the pancreatic cyst segmentation which is the second stage in our approach. Over 131 testing cases, our approach reports an average DSC of 63.60 ± 27.71%, obtaining 2.98% absolute or 4.92% relative accuracy gain over the baseline. The high standard deviation (27.71%) indicates the significant variance in the difficulty of cyst segmentation. On the one hand, our approach can report rather high accuracy (e.g., >95% DSC) in some easy cases. On the other hand, in some challenging cases, if the oracle cyst bounding box is unavailable, both approaches (with or without deep supervision) can come into a complete failure (i.e., DSC is 0%). In comparison, our approach with deep supervision misses 8 cyst cases, while the version without deep supervision misses 16. To the best of our knowledge, pancreatic cyst segmentation is very few studied previously. A competitor is [3] published in 2016, which combines random walk and region growth for segmentation. However, it requires the user to annotate the region-of-interest (ROI) beforehand, and provide interactive annotations on foreground/background voxels throughout the segmentation process. In comparison, when the bounding box is provided or not, our approach achieves 77.92% and 63.44% average accuracies, respectively. Being cheap or free in extra annotation, our approach can be widely applied to automatic diagnosis, especially for the common users without professional knowledge in medicine. Visualization. Three representative cases are shown in Fig. 3. In the first case, both the pancreas and the cyst can be segmented accurately from the original CT scan. In the second case, however, the cyst is small in volume and less discriminative in contrast, and thus an accurate segmentation is only possible when we roughly localize the pancreas and shrink the input image size accordingly. The accuracy gain of our approach mainly owes to the accuracy gain of this type of cases. The third case shows a failure example of our approach, in which an inaccurate pancreas segmentation leads to a complete missing in cyst detection. Note that the baseline approach reports a 59.93% DSC in this case, and, if the oracle pancreas bounding box is provided, we can still achieve a DSC of 77.56%. This inspires us

228

Y. Zhou et al.

Fig. 3. Sample pancreas and pancreatic cyst segmentation results (best viewed in color). From left to right: input image (in which pancreas and cyst are marked in red and green, respectively), pancreas segmentation result, and cyst segmentation results when we apply deep supervision (denoted by +) or not (−). The figures in the right three columns are zoomed in w.r.t. the red frames. In the last example, pancreas segmentation fails in this slice, resulting in a complete failure in cyst segmentation.

that cyst segmentation can sometimes help pancreas segmentation, and this topic is left for future research.

4

Conclusions

This paper presents the first system for pancreatic cyst segmentation which can work without human assistance on the testing stage. Motivated by the high relevance of a cystic pancreas and a pancreatic cyst, we formulate pancreas segmentation as an explicit variable in the formulation, and introduce deep supervision to assist the network training process. The joint optimization can be factorized into two stages, making our approach very easy to implement. We collect a dataset with 131 pathological cases. Based on a coarse-to-fine segmentation algorithm, our approach produces reasonable cyst segmentation results. It is worth emphasizing that our approach does not require any extra human annotations on the testing stage, which is especially practical in assisting common patients in cheap and periodic clinical applications. This work teaches us that a lesion can be detected more effectively by considering its highly related organ(s). This knowledge, being simple and straightforward, is useful in the future work in pathological organ or lesion segmentation.

Deep Supervision for Pancreatic Cyst Segmentation in Abdominal CT Scans

229

Acknowledgements. This work was supported by the Lustgarten Foundation for Pancreatic Cancer Research. We thank Dr. Seyoun Park for enormous help.

References 1. Al-Ayyoub, M., Alawad, D., Al-Darabsah, K., Aljarrah, I.: Automatic detection and classification of brain hemorrhages. WSEAS Trans. Comput. 12(10), 395–405 (2013) 2. Christ, P., Ettlinger, F., Gr¨ un, F., Elshaera, M., Lipkova, J., Schlecht, S., Ahmaddy, F., Tatavarty, S., Bickel, M., Bilic, P., et al.: Automatic Liver and Tumor Segmentation of CT and MRI Volumes using Cascaded Fully Convolutional Neural Networks. arXiv preprint arXiv:1702.05970 (2017) 3. Dmitriev, K., Gutenko, I., Nadeem, S., Kaufman, A.: Pancreas and Cyst Segmentation. SPIE Medical, Imaging, p. 97842C (2016) ´ 4. Gintowt, A., Hac, S., Dobrowolski, S., Sledzi´ nski, Z.: An unusual presentation of pancreatic pseudocyst mimicking cystic neoplasm of the pancreas: a case report. Cases J. 2(1), 9138 (2009) 5. Havaei, M., Davy, A., Warde-Farley, D., Biard, A., Courville, A., Bengio, Y., Pal, C., Jodoin, P., Larochelle, H.: Brain Tumor Segmentation with Deep Neural Networks. Medical Image Analysis (2017) 6. Klauß, M., Sch¨ obinger, M., Wolf, I., Werner, J., Meinzer, H., Kauczor, H., Grenacher, L.: Value of three-dimensional reconstructions in pancreatic carcinoma using multidetector CT: initial results. World J. Gastroenterol. 15(46), 5827–5832 (2009) 7. Krizhevsky, A., Sutskever, I., Hinton, G.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (2012) 8. Lasboo, A., Rezai, P., Yaghmai, V.: Morphological analysis of pancreatic cystic masses. Acad. Radiol. 17(3), 348–351 (2010) 9. Lee, C., Xie, S., Gallagher, P., Zhang, Z., Tu, Z.: Deeply-Supervised Nets. In: International Conference on Artificial Intelligence and Statistics (2015) 10. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Computer Vision and Pattern Recognition (2015) 11. Milletari, F., Navab, N., Ahmadi, S.: V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. arXiv preprint arXiv:1606.04797 (2016) 12. Roth, H.R., Lu, L., Farag, A., Sohn, A., Summers, R.M.: Spatial aggregation of holistically-nested networks for automated pancreas segmentation. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 451–459. Springer, Cham (2016). doi:10.1007/978-3-319-46723-8 52 13. Roth, H.R., Lu, L., Farag, A., Shin, H.-C., Liu, J., Turkbey, E.B., Summers, R.M.: DeepOrgan: multi-level deep convolutional networks for automated pancreas segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9349, pp. 556–564. Springer, Cham (2015). doi:10.1007/ 978-3-319-24553-9 68

230

Y. Zhou et al.

14. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015) 15. Stewart, B., Wild, C., et al.: World Cancer Report 2014 (2014) 16. Wang, D., Khosla, A., Gargeya, R., Irshad, H., Beck, A.: Deep learning for identifying metastatic breast cancer. arXiv preprint arXiv:1606.05718 (2016) 17. Zhou, Y., Xie, L., Shen, W., Fishman, E., Yuille, A.: Pancreas Segmentation in Abdominal CT Scan: A Coarse-to-Fine Approach. arXiv preprint arXiv:1612.08230 (2016)

Error Corrective Boosting for Learning Fully Convolutional Networks with Limited Data Abhijit Guha Roy1,2,3(B) , Sailesh Conjeti2 , Debdoot Sheet3 , Amin Katouzian4 , Nassir Navab2,5 , and Christian Wachinger1 1

5

Artificial Intelligence in Medical Imaging (AI-Med), KJP, LMU M¨ unchen, Munich, Germany 2 Computer Aided Medical Procedures, Technische Universit¨ at M¨ unchen, Munich, Germany [email protected] 3 Indian Institute of Technology Kharagpur, Kharagpur, WB, India 4 IBM Almaden Research Center, Almaden, USA Computer Aided Medical Procedures, Johns Hopkins University, Baltimore, USA

Abstract. Training deep fully convolutional neural networks (F-CNNs) for semantic image segmentation requires access to abundant labeled data. While large datasets of unlabeled image data are available in medical applications, access to manually labeled data is very limited. We propose to automatically create auxiliary labels on initially unlabeled data with existing tools and to use them for pre-training. For the subsequent fine-tuning of the network with manually labeled data, we introduce error corrective boosting (ECB), which emphasizes parameter updates on classes with lower accuracy. Furthermore, we introduce SkipDeconvNet (SD-Net), a new F-CNN architecture for brain segmentation that combines skip connections with the unpooling strategy for upsampling. The SD-Net addresses challenges of severe class imbalance and errors along boundaries. With application to whole-brain MRI T1 scan segmentation, we generate auxiliary labels on a large dataset with FreeSurfer and fine-tune on two datasets with manual annotations. Our results show that the inclusion of auxiliary labels and ECB yields significant improvements. SD-Net segments a 3D scan in 7 s in comparison to 30 h for the closest multi-atlas segmentation method, while reaching similar performance. It also outperforms the latest state-of-the-art F-CNN models.

1

Introduction

Fully convolutional neural networks (F-CNNs) have gained high popularity for image segmentation in computer vision [1–3] and biomedical imaging [4,5]. They directly produce a segmentation for all image pixels in an end-to-end fashion without the need of splitting the image into patches. F-CNNs can therefore fully exploit the image context avoiding artificial partitioning of an image, Electronic supplementary material The online version of this chapter (doi:10. 1007/978-3-319-66179-7 27) contains supplementary material, which is available to authorized users. c Springer International Publishing AG 2017  M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 231–239, 2017. DOI: 10.1007/978-3-319-66179-7 27

232

A.G. Roy et al.

Fig. 1. Illustration of the different steps involved in training of F-CNNs with surplus auxiliary labeled data and limited manually labeled data.

which also results in an enormous speed-up. Yet, training F-CNNs is challenging because each image serves as a single training sample and consequently much larger datasets with manual labels are required in comparison to patch-based approaches, where each image provides multiple patches. While the amount of unlabeled data rapidly grows, the access to labeled data is still limited due to the labour intense process of manual annotations. At the same time, the success of deep learning is mainly driven by supervised learning, while unsupervised approaches are still an active field of research. Data augmentation [4] artificially increases the training dataset by simulating different variations of the same data, but it cannot encompass all possible morphological variations. We propose to process unlabeled data with existing automated software tools to create auxiliary labels. These auxiliary labels may not be comparable to manual expert annotations, however, they allow us to efficiently leverage the vast amount of initially unlabeled data for supervised pre-training of the network. We also propose to fine-tune such a pre-trained network using error corrective boosting (ECB), that selectively focuses on classes with erroneous segmentations. In this work, we focus on whole-brain segmentation of MRI T1 scans. To this end, we introduce a new F-CNN architecture for segmentation, termed SkipDeconv-Net (SD-Net). It combines skip connections from the U-net [4] with the passing of indices for unpooling similar to DeconvNet [2]. This architecture provides rich context information while facilitating the segmentation of small structures. To counter the severe class imbalance problem in whole-brain segmentation, we use median frequency balancing [3] together with placing emphasis on the correct segmentation along anatomical boundaries. For the creation of auxiliary labels, we segment brain scans with FreeSurfer [6], a standard tool for automated labeling in neuroimaging. Figure 1 shows the steps involved in the training process. First, we train SD-Net on a large amount of data with corresponding auxiliary labels, in effect creating a network that imitates FreeSurfer, referred as FS-Net. Second, we fine-tune FS-Net with limited manually labeled data with ECB, to improve the segmentation incorrectly represented by FS-Net. Related work: F-CNN models have recently attracted much attention in segmentation. The FCN model [1] up-samples the intermediate pooled feature maps

Error Corrective Boosting for Learning Fully Convolutional Networks

233

with bilinear interpolation, while the DeconvNet [2] up-samples with indices from the pooling layers, to reach final segmentation. For medical images, U-net was proposed consisting of an encoder-decoder network with skip connections [4]. For MRI T1, eight sub-cortical structures were segmented using an F-CNN model, with slices in [7] and with patches in [8]. Whole-brain segmentation with CNN using 3D patches was presented in [9,10]. To the best of our knowledge, this work is the first F-CNN model for whole-brain segmentation. To address the challenge of training a deep network with limited annotations, previous works fine-tune models pre-trained for classification on natural images [11,12]. In fine-tuning, the training data is replaced by data from the target application with additional task specific layers and except for varying the learning rate, the same training procedure is used. With ECB, we change the class specific penalty in the loss function to focus on regions with high inaccuracies. Furthermore, instead of relying on pre-training on natural images that exhibit substantially different image statistics and are composed of three color channels, we propose using auxiliary labels to directly pre-train an F-CNN, tailored for segmenting T1 scans.

2 2.1

Method SD-Net for Image Segmentation

We describe the architecture, loss function, and model learning of the proposed SD-Net for image segmentation in the following section: Architecture: The SD-Net has an encoder-decoder based F-CNN architecture consisting of three encoder and three decoder blocks followed by a classifier with softmax to estimate probability maps for each of the classes. It combines skip connections from U-net [4] and the passing of indices for unpooling from DeconvNet [2], hence the name SkipDeconv-Net (SD-Net). We use skip connections between the encoder and decoder as they provide rich contextual information for segmentation and also a path for the gradients to flow from the shallower decoder to the deeper encoder part during training. In contrast to U-net where upsampling is done by convolution, we use unpooling, which offers advantages for segmenting small structures by placing the activation at the proper spatial location. Figure 2 illustrates the network architecture for segmenting a 2D image. Each encoder block consists of a 7 × 7 convolutional layer, followed by a batch normalization layer and a ReLU (Rectifier Linear Unit) activation function. Appropriate padding is provided before every convolution to ensure similar spatial dimensions of input and output. With the 7 × 7 kernels, we have an effective receptive field at the lowest encoder block that almost captures the entire brain mask. It therefore presents a good trade-off between model complexity and the capability of learning long-range connections. Each encoder block is followed by a max pooling layer, reducing the spatial dimension of feature maps by half. Each decoder block consists of an unpooling layer, a concatenation by skip connection, a 7×7 convolutional layer, batch normalization and ReLU function. The unpooling layer upsamples the spatial dimension of the input feature

234

A.G. Roy et al.

Fig. 2. Illustration of the proposed SkipDeconv-Net (SD-Net) architecture.

map by using the saved indices with maximum activation during max pooling of the corresponding encoder block. The remaining locations are filled with zeros. Unpooling does not require to estimate parameters, in contrast to the up-convolution in U-net. The unpooled feature maps are concatenated with the feature maps of the encoder part that have the same spatial dimension. The following convolution layer densifies the sparse unpooled feature maps for smooth prediction. The classifier consists of a 1 × 1 convolutional layer to transfer the 64 dimensional feature map to a dimension corresponding to number of classes (N ) followed by a softmax layer. Loss Function: SD-Net is trained by optimizing two loss functions: (i) weighted multi-class logistic loss and (ii) Dice loss. The logistic loss provides a probabilistic measure of similarity between the prediction and ground truth. The Dice loss is inspired by the Dice overlap ratio and yields a true positive count based estimate of similarity [5]. Given the estimated probability pl (x) at pixel x to belong to the class l and the ground truth probability gl (x), the loss function is   2 x pl (x)gl (x)   2 L=− . ω(x)gl (x) log(pl (x)) − (1) 2 (x) + p x x gl (x) l x       LogisticLoss

DiceLoss

We introduce weights ω(x) to tailor the loss function to challenges that we have encountered in image segmentation: the class imbalance and the segmentation errors along anatomical boundaries. Given the frequency fl of class l in the training data, i.e., the class probability, the indicator function I, the training segmentation S, and the 2D gradient operator ∇, the weights are defined as ω(x) =

 l

I(S(x) == l)

median(f ) + ω0 · I(|∇S(x)| > 0) fl

(2)

Error Corrective Boosting for Learning Fully Convolutional Networks

235

with the vector of all frequencies f = [f1 , . . . , fN ]. The first term models median frequency balancing [3] and compensates for the class imbalance problem by highlighting classes with low probability. The second term puts higher weight on anatomical boundary regions to emphasize on the correct segmentation of contours. ω0 balances the two terms. Model Learning: We learn the SD-Net with stochastic gradient descent. The learning rate is initially set to 0.1 and reduced by one order after every 20 epochs till convergence. The weight decay is set to 0.0001. Mini batches of size 10 images are used, constrained by the 12 GB RAM of the Tesla K40 GPU. A high momentum of 0.9 is set to compensate for this small batch size. 2.2

Fine-Tuning with Error Corrective Boosting

Since the SD-Net directly predicts the segmentation of the entire 2D slice, each 3D scan only provides a limited number of slices for training. Due to this limited availability of manually labeled brain scans and challenges of unsupervised training, we propose to use large scale auxiliary labels for assisting in training the network. The auxiliary labels are created with FreeSurfer [6]. Although these labels cannot replace extensive manual annotations, they can be automatically computed on a large dataset and be used to train FS-Net, which is essentially an F-CNN mimicking FreeSurfer. To the best of our knowledge, this work is the first application of auxiliary, computer-generated labels for training neural networks for image segmentation. Pre-training provides a strong initialization of the network and we want to use the manually labeled data to improve on brain structures that are poorly represented by the auxiliary labels. To this end, we introduce error corrective boosting (ECB) for fine-tuning, which boosts the learning process for classes with high segmentation inaccuracy. ECB iteratively updates the weights in the logistic loss function in Eq. (1) during fine-tuning. We start the fine-tuning with the standard weights as described in Eq. (2). At epoch t > 1, we iteratively evaluate the accuracy atl of class l on the validation set. The weights are updated for each epoch, following an approach that could be considered as median accuracy balancing as shown in Eq. (3). ω (t+1) (x) =

 l

I(S(x) == l)

median(at ) − mt al t − mt

(3)

with the vector of accuracies at = [at1 , . . . , atN ] and the margin mt = min(at ) − q that normalizes the accuracies with respect to the least performing class. The constant q is set to 0.05, i.e. 5%, to avoid numerical instability. Error corrective boosting sets high weights for classes with low accuracy to selectively correct for errors in the auxiliary labels, which is particularly helpful for whole-brain segmentation with a large number of classes.

236

A.G. Roy et al.

Fig. 3. Boxplot of Dice scores for all structures on the left hemisphere. Comparison of different training strategies of the SD-Net with PICSL. Class probabilities are reported to indicate the severe class imbalance, about 88% are background.

3

Results

Datasets: We pre-train the networks with FreeSurfer labels using 581 MRI-T1 volumes from the IXI dataset1 . These volumes were acquired from 3 different hospitals with different MRI protocols. For the fine-tuning and validation, we use two datasets with manual labels: (i) 30 volumes from the MICCAI Multi-Atlas Labeling challenge [13] and (ii) 20 volumes from the MindBoggle dataset [14]. Both datasets are part of OASIS [15]. In the challenge dataset, 15 volumes were used for training, 5 for validation and 10 for testing. In the MindBoggle dataset, 10 volumes were used for training, 5 for validation and 5 for testing. We segment the major 26 cortical and sub-cortical structures on the challenge data and 24 on MindBoggle, as left/right white matter are not annotated. Baselines: We evaluate our two main contributions, the SD-Net architecture for segmentation and the inclusion of auxiliary labels in combination with ECB. We compare SD-Net to two state-of-the-art networks for semantic segmentation, U-net [4] and FCN [1], and also to a variant of SD-Net without Dice loss. For the auxiliary labels, we report results for (i) directly deploying the IXI pretrained network (IXI FS-Net), (ii) training only on the manually annotated data, (iii) normal fine-tuning, and (iv) ECB-based fine-tuning. We use data augmentation with small spatial translations and rotations in all models during training. We also compare to PICSL [16] (winner) and spatial STAPLE [17] (top 5) for the challenge data whose results were available. Results: Table 1 lists the mean Dice scores on the test data of both datasets for all methods. We first compare the different F-CNN architectures, columns in the table. U-net outperforms FCN on all training scenarios, where the accuracy of FCN is particularly poor on the IXI FS-Net. The SD-Net shows the best performance with an average increase of 2% mean Dice score over U-Net, significant with p < 0.01. The SD-Net without the Dice loss in Eq. (1) does not 1

http://brain-development.org/ixi-dataset/.

Error Corrective Boosting for Learning Fully Convolutional Networks

237

Table 1. Mean and standard deviation of the Dice scores for the different F-CNN models and training procedures on both datasets.

perform as well as the combined loss. We also retrained SD-Net with only limited manual annotated data with ω0 = {3, 4, 5, 6, 7}, resulting in the respective mean dice scores {0.85, 0.83, 0.85, 0.84, 0.85}. These results show that there is a limited sensitivity to ω0 and we set it to 5 for the remaining experiments. Next, we compare the results for the different training setups, presented as rows in the table. Training on the FreeSurfer segmentations of the IXI data yields the worst performance, as it only includes the auxiliary labels. Importantly, finetuning the FS-Net with the manually labeled data yields a substantial improvement over directly training from the manual labels. This confirms the advantage of initializing the network with auxiliary labels. Moreover, ECB fine-tuning leads to further improvement of the Dice score in comparison to normal fine-tuning. On the challenge dataset, this improvement is statistically significant with p < 0.01. Finally, SD-Net with ECB results in significantly higher Dice scores (p = 0.02) than spatial STAPLE and the same Dice score as PICSL. Figure 3 presents a structure-wise comparison of the different training strategies for the SD-Net together with PICSL. The class probability for each of these structures are also presented to indicate the severe class imbalance problem. There is a consistent increase in Dice scores for all the structures, from training with manually annotated data over normal fine-tuning to ECB. The increase

Fig. 4. Comparison of training the SD-Net with only manual labels, normal fine-tuning and ECB together with the ground truth segmentation. A zoomed view of the white box is presented below, where the hippocampus (blue) is indicated by a red arrow.

238

A.G. Roy et al.

is strongest for structures that are improperly segmented like the hippocampus and amygdala, as they are assigned the highest weights in ECB. Figure 4 illustrates the ground-truth segmentation together with results from the variations of training the SD-Net. Zoomed in regions are presented for the hippocampus, to highlight the effect of the fine-tuning. The hippocampus with class probability 0.16% is under-segmented when trained with only limited manual data. The segmentation improves after normal fine-tuning, with the best results for ECB. Segmenting all 2D slices in a 3D volume with SD-Net takes 7 s on the GPU. This is orders of magnitude faster than multi-atlas approaches, e.g., PICSL and STAPLE, that require about 30 h with 2 h per pair-wise registration. SD-Net is also much faster than the 2–3 min reported for the segmentation of eight structures by the patch-based technique in [8].

4

Conclusion

We introduced SD-Net, an F-CNN, encoder-decoder architecture with unpooling that jointly optimizes logistic and Dice loss. We proposed a training strategy with limited labeled data, where we generated auxiliary segmentations from unlabeled data and fine-tuned the pre-trained network with ECB. We demonstrated that (i) SD-Net outperforms U-net and FCN, (ii) using auxiliary labels improves the accuracy and (iii) ECB exploits the manually labeled data better than normal fine-tuning. Our approach achieves state-of-the-art performance for whole-brain segmentation while being orders of magnitude faster. Acknowledgement. This work was supported in part by the Faculty of Medicine at LMU (F¨ oFoLe), the Bavarian State Ministry of Education, Science and the Arts in the framework of the Centre Digitisation.Bavaria (ZD.B), the NVIDIA corporation and DAAD (German Academic Exchange Service). The authors would also like to thank Magdalini Paschali for proof reading and feedback.

References 1. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR 2015, pp. 3431–3440. IEEE (2015) 2. Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. In: ICCV 2015, pp. 1520–1528. IEEE (2015) 3. Badrinarayanan, V., Kendall, A., Segnet, C.R.: A deep convolutional encoderdecoder architecture for image segmentation. arXiv preprint: 1511.00561 (2015) 4. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). doi:10. 1007/978-3-319-24574-4 28 5. Milletari, F., Navab, N., Ahmadi, S.A.: V-net: fully convolutional neural networks for volumetric medical image segmentation. In: 3DV, pp. 565–571. IEEE (2016) 6. Fischl, B., Salat, D.H., Busa, E., Albert, M., Dieterich, M., Haselgrove, C., Van Der Kouwe, A., Killiany, R., Kennedy, D., Klaveness, S., Montillo, A.: Whole brain segmentation: automated labeling of neuroanatomical structures in the human brain. Neuron 33(3), 341–55 (2002)

Error Corrective Boosting for Learning Fully Convolutional Networks

239

7. Shakeri, M., Tsogkas, S., Ferrante, E., Lippe, S., Kadoury, S., Paragios, N., Kokkinos, I.: Sub-cortical brain structure segmentation using F-CNNs. In: ISBI 2016, pp. 269–272 (2016) 8. Dolz, J., Desrosiers, C., Ayed, I.B.: 3D fully convolutional networks for subcortical segmentation in MRI: a large-scale study (2016). arXiv preprint:1612.03925 9. Brebisson, A., Montana, G.: Deep neural networks for anatomical brain segmentation. In: CVPR Workshops, pp. 20–28 (2015) 10. Wachinger, C., Reuter, M., Klein, T.: DeepNAT: deep convolutional neural network for segmenting neuroanatomy. Neuroimage (2017) 11. Shin, H.C., Roth, H.R., Gao, M., Lu, L., Xu, Z., Nogues, I., Yao, J., Mollura, D., Summers, R.M.: Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. TMI 35(5), 1285– 98 (2016) 12. Tajbakhsh, N., Shin, J.Y., Gurudu, S.R., Hurst, R.T., Kendall, C.B., Gotway, M.B., Liang, J.: Convolutional neural networks for medical image analysis: full training or fine tuning? TMI 35(5), 1299–1312 (2016) 13. Landman, B, Warfield, S.: MICCAI workshop on multiatlas labeling. In: MICCAI Grand Challenge (2012) 14. Klein, A., Tourville, J.: 101 labeled brain images and a consistent human cortical labeling protocol. Front. Neurosci. 6, 171 (2012) 15. Marcus, D.S., Fotenos, A.F., Csernansky, J.G., Morris, J.C., Buckner, R.L.: Open access series of imaging studies: longitudinal MRI data in nondemented and demented older adults. J. Cog. Neuroscience. 12, 2677–84 (2010) 16. Wang, H., Yushkevich, P.: Multi-atlas segmentation with joint label fusion and corrective learning-an open source implementation. Front. Neuroinform. 7, 27 (2013) 17. Asman, A.J., Landman, B.A.: Formulating spatially varying performance in the statistical fusion framework. TMI. 6, 1326–36 (2012)

Direct Detection of Pixel-Level Myocardial Infarction Areas via a Deep-Learning Algorithm Chenchu Xu1 , Lei Xu3 , Zhifan Gao2 , Shen Zhao2 , Heye Zhang2(B) , Yanping Zhang1(B) , Xiuquan Du1 , Shu Zhao1 , Dhanjoo Ghista4 , and Shuo Li5 1

4 5

Anhui University, Hefei, China [email protected] 2 Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China [email protected] 3 Beijing AnZhen Hospital, Beijing, China University 2020 Foundation, Framingham, MA, USA University of Western Ontario, London, ON, Canada

Abstract. Accurate detection of the myocardial infarction (MI) area is crucial for early diagnosis planning and follow-up management. In this study, we propose an end-to-end deep-learning algorithm framework (OF-RNN) to accurately detect the MI area at the pixel level. Our OFRNN consists of three different function layers: the heart localization layers, which can accurately and automatically crop the region-of-interest (ROI) sequences, including the left ventricle, using the whole cardiac magnetic resonance image sequences; the motion statistical layers, which are used to build a time-series architecture to capture two types of motion features (at the pixel-level) by integrating the local motion features generated by long short-term memory-recurrent neural networks and the global motion features generated by deep optical flows from the whole ROI sequence, which can effectively characterize myocardial physiologic function; and the fully connected discriminate layers, which use stacked auto-encoders to further learn these features, and they use a softmax classifier to build the correspondences from the motion features to the tissue identities (infarction or not) for each pixel. Through the seamless connection of each layer, our OF-RNN can obtain the area, position, and shape of the MI for each patient. Our proposed framework yielded an overall classification accuracy of 94.35% at the pixel level, from 114 clinical subjects. These results indicate the potential of our proposed method in aiding standardized MI assessments.

1

Introduction

There is a great demand for detecting the accurate location of a myocardial ischemia area for better myocardial infarction (MI) diagnosis. The use of magnetic resonance contrast agents based on gadolinium-chelates for visualizing the C. Xu and L. Xu are contributed equally to this work. c Springer International Publishing AG 2017  M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 240–249, 2017. DOI: 10.1007/978-3-319-66179-7 28

Direct Detection of Pixel-Level Myocardial Infarction Areas

241

position and size of scarred myocardium has become ‘the gold standard’ for evaluating the area of the MI [1]. However, the contrast agents are not only expensive but also nephrotoxic and neurotoxic and, hence, could damage the health of humans [2]. In routine clinical procedures, and especially for early screening and postoperative assessment, visual assessment is one popular method, but it is subject to high inter-observer variability and is both subjective and nonreproducible. Furthermore, the estimation of the time course of the wall motion remains difficult even for experienced radiologists. Therefore, computer-aided detection systems have been attempted in recent years to automatically analyze the left ventricle (LV) myocardial function quantitatively. This computerized vision can serve to simulate the brain of a trained physicians intuitive attempts at clinical judgment in a medical setting. Previous MI detection methods have been mainly based on information theoretic measures and Kalman filter approaches [3], Bayesian probability model [4], pattern recognition technique [5,6], and biomechanical approaches [7]. However, all of these existing methods still fail to directly and accurately identify the position and size of the MI area. More specifically, these methods have not been able to capture sufficient information to establish integrated correspondences between the myocardial motion field and MI area. More recently, unsupervised deep learning feature selection techniques have been successfully used to solve many difficult computer vision problems. The general concept behind deep learning is to learn hierarchical feature representations by first inferring simple representations and then progressively building up more complex representations from the previous level. This method has been successfully applied to the recognition and prediction of prostate cancer, Alzheimers disease, and vertebrae and neural foramina stenosis [8]. In this study, an end-to-end deep-learning framework has been developed for accurate and direct detection of infarction size at the pixel level using cardiac magnetic resonance (CMR) images. Our methods contributions and advantages are as follows: (1) for the first time, we propose an MI area detection framework at the pixel level that can give the physician the explicit position, size and shape of the infarcted areas; (2) a feature extraction architecture is used to establish solid correspondences between the myocardial motion field and MI area, which can help in understanding the complex cardiac structure and periodic nature of heart motion; and (3) a unified deep-learning framework can seamlessly fuse different methods and layers to better learn hierarchical feature representations and feature selection. Therefore, our framework has great potential for improving the efficiency of the clinical diagnosis of MI.

2

Methodology

As shown in Fig. 1, there are three function layers inside the OF-RNN. The heart localization layers can automatically detect the ROI, including the LV, and the motion statistical layers can generate motion features that accurately characterize myocardial physiologic and physical function, followed by the fully

242

C. Xu et al.

Fig. 1. The architecture of OF-RNN: heart localization layers, motion statistical layers, and fully connected discriminate layers.

connected discriminate layers that use stacked auto-encoders and softmax classifiers to detect the MI area from motion features. Heart localization layers. One FAST R-CNN [9] is used here for the automatic detection of a region of interest (ROI) around the LV, to reduce the computational complexity and improve the accuracy. In this study, the first process of the heart localization layers is to generate category-independent region proposals. Afterward, a typical convolutional neural network model is used to produce a convolution feature map by input images. Then, for each object proposed, an ROI pooling layer extracts a fixed-length feature vector from the feature map. The ROI pooling layer uses max pooling to convert the features inside any valid region of interest into a small feature map with a fixed spatial extent of H×W , where H and W are layer hyper-parameters that are independent of any particular ROI. Finally, each feature vector is fed into a sequence of fully connected layers that branch into two sibling output layers, thereby generating a 64 × 64 bounding-box for cropping the ROI image sequences, including the LV from CMR sequences. Motion statistical layers. The motion statistical feature layers are used to extract time-series image motion features through ROI image sequences to understand the periodic nature of ghd heart motion. The local motion features are generated by LSTM-RNN, and the global motion features are generated by deep optical flow. Thus, in the first step, we attempt to compute the local motion

Direct Detection of Pixel-Level Myocardial Infarction Areas

243

features that are extracted from the ROI image sequence. For each ROI sequence, the input image I = (I1 , I2 ...IJ , J = 25) of size 64 × 64, I(p) represents a pixel coordinate p = [x, y] of the image I. A window of size 11 × 11 is constructed for the overlapping I[x, y] neighborhoods, which has an intensity value that is representative of the feature of each p on image IJ . This approach results in the J image sequence features being unrolled as vector Pl (p) ∈ R11∗11∗J for each pixel as input. Then, four layers of RNN [10] with LSTM cells layers are used to learn the input. Give the input layer Xt at time t, each time corresponds to each frame(t = J), which indicates that xt = Pl (p) at frame J, and for the hidden state frame of the previous time step ht − 1, the hidden and output layers for the current time step are computed as follows: ht = φ (Wxh [ht−1 , xt ]) , pt = sof t max (Why ht ) , yˆt = arg max pt (1) where xt , ht and yt are layers that represent the input, hidden, and output at each time step t, respectively; Wxh and Why are the matrices that denote the weights between the input and hidden layers and between the hidden and output layers, respectively, and φ denotes the activation function. The LSTM cell [10] is designed to mitigate the vanishing gradient. In addition to the hidden layer vector ht , the LSTMs maintain a memory vector ct , an input gate it , a forget gate ft , and an output gate ot ; These gates in the LSTMs are computed as follows: ⎡

⎤ ⎛ ⎞ it sigm ⎢ ft ⎥ ⎜ sigm ⎟ ⎢ ⎥=⎜ ⎟ ⎣ ot ⎦ ⎝ sigm ⎠ Wt [D(xt ), ht−1 ] c˜t tanh

(2)

where Wt is the weight matrix, and D is the dropout operator. The final memory cell and the final hidden state are given by ct = ft  ct−1 + it  c˜t , ht = ot  tanh(ct ) (3) In the second step, we attempt to compute the global motion feature of the image sequence based on an optical flow algorithm [11] by the deep architecture. An optical flow can describe a dense vector field, where a displacement vector is assigned to each pixel, which points to where that pixel can be found in another image. Considering an adjacent frame, a reference image I = (IJ−1 ) and a target image I  = (IJ ), the goal is to estimate the flow w = (u, v) that contains both horizontal and vertical components. We assume that the images are already smoothed by using a Gaussian filter with a standard deviation of σ. The energy to be optimized is the weighted sum of a data term ED, a smoothness term ES, and a matching term EM :  E(w) = Ω

(4)

ED + αES + βEM dx

Next, a procedure is developed to produce a pyramid of response maps, and we start from the optical flow constraint, assuming a constant brightness. A basic way to build a Data term and a Smoothness term is the following: 

ED = δΨ

c  i=1



w J¯0i w



+ γΨ

c  i=1



i w J¯xy w

(5)

244

C. Xu et al. ES = Ψ (||∇u||2 + ||∇v||2 )

(6)

i w is the tensor for channel I; δ and γ are where Ψ is a robust penalizer; J¯xy the two balanced weights. The matching term encourages the flow estimation to be similar to a precomputed vector field w , and a term c(x) has been added.  2 EM = cΨ (w − w  ) (7)

For any pixel p of I  , Cn,p (p ) is a measure of similarity between In,p and We have In,p to be a patch size of N × N (N ∈ 4, 8, 16) from the first image centered at p. We start with the bottom-level correlation maps, which are iteratively aggregated to obtain the upper levels. This aggregation consists of max-pooling, sub-sampling, computing a shifted average and non-linear rectification. In the end, for each image IJ−1 , a fully motion field wJ−1 = (uJ−1 ,vJ−1 ) is computed with reference to the next frame IJ .  In,p .

Fully connected discriminate layers. The fully connected discriminate layers are used to detect the MI area accurately from the local motion features and the global motion features. First, for each wj , we use image patches, say 3 × 3, by extracting the feature beginning from a point p in the first frame and tracing p in the following frame. We can thereby obtain Pg (p) while containing a 3 × 3 vector for displacement and a 3 × 3 vector for the orientation of p for each frame. Second, we conduct a simple concatenation between the local image feature Pl (p) from the LSTM-RNN and the motion trajectories feature Pg (p) via optical flow, to establish a whole feature vector P (p). Finally, an auto-encoder with three stacking layers is used for learning the P (p), followed by a softmax layer, which is used to determine whether p belongs to the MI area or not.

3

Experimental Results

Data acquisition. We collected the short axis image dataset and the corresponding

Fig. 2. (a, b) Our predicted MI area (the green zone) can be a good fit for the ground truth (the yellow arrow) (c) our predicted MI area (the green zone) can be a good fit for the ground truth (the yellow dotted line).

Direct Detection of Pixel-Level Myocardial Infarction Areas

245

enhanced images using gadolinium agents from 114 subjects in this study on a 3T CMR scanner. Each subjects short-axis image dataset consisted of 25 2D images (a cardiac cycle), a total of 43 apical, 37 mid-cavity and 34 basal shortaxis image datasets for 114 subjects. The temporal resolution is 45.1 ± 8.8 ms, and the short-axis planes are 8-mm thick. The delayed enhancement images were obtained approximately 20 min after intravenous injection of 0.2 mmol/kg gadolinium diethyltriaminepentaacetic acid. A cardiologist (with more than 10 years of experience) analyzed the delayed enhancement images and manually traced the MI area by the pattern of late gadolinium enhancement as the ground truth. Implementation details. We implemented all of the codes using Python and MATLAB R2015b on a Linux (Kylin 14.04) desktop computer with an Intel Xeon CPU E5-2650 and 32 GB DDR2 memory. The graphics card is an NVIDIA Quadro K600, and the deep learning libraries were implemented with Keras (Theano) with RMSProp solver. The training time was 373 min, and the testing time was 191 s for each subject (25 images). Performance evaluation criteria. We used three types of criteria to measure the performance of the classifier: (1) the receiver operating characteristic (ROC) curve; (2) the precision-recall (PR) curve; (3) for pixel-level accuracy, we assessed the classifier performance with a 10-fold cross-validation test, and for segmentlevel accuracy, we used 2/3 data for training and the remaining data for testing. Automatic localization of the LV. The experiment’s result shows that OFRNN can obtain good localization of the LV. We achieve an overall classification accuracy of 96.49%, with a sensitivity of 94.39% and a specificity of 98.67%, in locating the LV in the heart localization layers. We used an architecture similar to the Zeiler and Fergus model to pre-train the network. Using selective searches quality mode, we sweep over 2 k proposals per image. Our results for the ROI localization bounding-box from 2.85 k CMR images were compared to the ground truth marked by the expert cardiologist. The ROCs and PRs curves are shown in Fig. 3(a, b). MI area detection. Our approach can also accurately detect the MI area, as shown in Fig. 2. The overall pixel classification accuracy is 94.35%, with a sensitivity of 91.23% and a specificity of 98.42%. We used the softmax classifier by fine-tuning the motion statistical layers to assess each pixel (as normal/abnormal). We also compared our results to 16 regional myocardial segments (depicted as normal/abnormal) by following the American Heart Association standards. The accuracy performance for the apical slices was an average of 99.2%; for the mid-cavity slices, it was an average of 98.1%; and for the basal slices, an average of 97.9%. The ROCs and PRs of the motion statistical layers are shown in Fig. 3(a, b). Local and global motion statistical features. A combination of local and global motion statistical features has the potential to improve the results because the features influence one another through a shared representation. To evaluate the effect of motion features, we use local or global motion statistical features

246

C. Xu et al.

Fig. 3. (a, b) ROCs and PRs show that our results have good classification performance. (c, d) ROCs and PRs for local motion features and global motion features. (e) The accuracy and time for various patch sizes.

separately along with both motion features in our framework. Table 1 and Fig. 3(c, d) show that the results that combine motion statistical features in our framework have better accuracy, sensitivity, and specificity in comparison to those that use only the local or global motion features, in another 10-fold cross-validation test. Table 1. Combined motion statistical features effectively improve the overall accuracy of our method Local motion feature



√ √

Global motion feature



Accuracy

92.6% 87.3% 94.3%

Sensitivity

86.5% 79.4% 91.2%

Specificity

97.9

96.2% 98.4%

Size of patch. We use an N × N patch to extract the local motion features from the whole image sequence. Because the displacements of the LV wall between two consecutive images are small (approximately 1 or 2 pixels/frame), it is necessary to adjust the size of the patch to capture sufficient local motion information. Figure 3(e) shows the accuracy and computational time of our framework, using

Direct Detection of Pixel-Level Myocardial Infarction Areas

247

Fig. 4. A pair of frames at the beginning of systole (a) and at the end of systole (b) were first displayed, followed by the visual results of our deep optical flow (c) and Horn and Schunck (HS) optical flow (d) at pixel precision.

from 3 × 3 to 17 × 17 patches in one 10-fold cross-validation test. We find that the 11 × 11 patch size in our framework can obtain better accuracy in a reasonable amount of time. Performance of the LSTM-RNN. To evaluate the performance of the LSTMRNN, we replaced the LSTM-RNN using SVMrbf, SAE-3, DBN-3, CNN and RNN in our deep learning framework, and we ran these different frameworks over 114 subjects using a 10-fold cross-validation test. Table 2 reports the classification performance by using the other five different learning strategies: the RNN, Deep Belief Networks (DBN), Convolutional Neural Network (CNN), SAE and Support Vector Machine with RBF kernel (SVMrbf). LSTM-RNN shows better accuracy and precision in all of the methods. Table 2. LSTM-RNN works best in comparison with other models SVMrbf SAE-3

DBN-3 CNN

RNN

LSTM-RNN

Accuracy 80.9%

83.5%% 84.9%

83.7% 88.4% 94.3%

Precision 74.2%

75.5%

76.5% 84.8% 91.3%

75.1%

Performance of the optical flow. The purpose of the optical flow is to capture the global motion features. To evaluate the performance of our optical flow algorithm with a deep architecture, we used the average angular error (AAE) to evaluate our deep optical flow and other optical flow approaches. The other optical flow methods, including the Horn and Schunck method, pyramid Horn and Schunck method, intensity-based optical flow method, and phase-based optical flow method, can be found in [12]. The comparison results are shown in Table 3, and visual examples are illustrated in Fig. 4.

248

C. Xu et al.

Table 3. Deep optical flow (OF) can work better in comparison to other optical flow techniques in capturing global motion features Horn and Schunck (HS) OF density 100% AAE

4

12.6◦ ± 9.2◦

Pyramid HS Deep OF

Intensity-based OF Phase-based OF

100%

100%

7.4◦ ± 3.4◦

5.7◦ ± 2.3◦ 5.7◦ ± 4.1◦

55%

13% 5.5◦ ± 3.9◦

Conclusions

We have, for the first time, developed and presented an end-to-end deep-learning framework for the detection of infarction areas at the pixel level from CMR sequences. Our experimental analysis was conducted on 114 subjects, and it yielded an overall classification accuracy of 94.35% at the pixel level. All of these results demonstrate that our proposed method can aid in the clinical diagnosis of MI assessments. Acknowledgment. This work was supported in part by the Shenzhen Research and Innovation Funding (JCYJ20151030151431727, SGLH20150213143207911), the National Key Research and Development Program of China (2016YFC1300302, 2016YFC1301700), the CAS Presidents International Fellowship for Visiting Scientists (2017VTA0011), the National Natural Science Foundation of China (No. 61673020), the Provincial Natural Science Research Program of Higher Education Institutions of Anhui province (KJ2016A016) and the Anhui Provincial Natural Science Foundation (1708085QF143).

References 1. ¨ org Barkhausen, J., Ebert, W., Weinmann, H.J.: Imaging of myocardial infarction: comparison of magnevist and gadophrin-3 in rabbits. J. Am. Coll. Cardiol. 39(8), 1392–1398 (2002) 2. Wagner, A., Mahrholdt, H., Holly, T.: Contrast enhanced MRI detects subendocardial myocardial infarcts that are missed by routine spect perfusion imaging. Lancet 361, 374–379 (2003) 3. Shi, P., Liu, H.: Stochastic finite element framework for simultaneous estimation of cardiac kinematic functions and material parameters. Med. Image Anal. 7(4), 445–464 (2003) 4. Wang, Z., Salah, M.B., Gu, B., Islam, A., Goela, A., Shuo, L.: Direct estimation of cardiac biventricular volumes with an adapted bayesian formulation. IEEE Trans. Biomed. Eng. 61(4), 1251–1260 (2014) 5. Afshin, M., Ben Ayed, I., Punithakumar, K., Law, M.W.K., Islam, A., Goela, A., Ross, I., Peters, T., Li, S.: Assessment of regional myocardial function via statistical features in MR images. In: Fichtinger, G., Martel, A., Peters, T. (eds.) MICCAI 2011. LNCS, vol. 6893, pp. 107–114. Springer, Heidelberg (2011). doi:10. 1007/978-3-642-23626-6 14

Direct Detection of Pixel-Level Myocardial Infarction Areas

249

6. Zhen, X., Islam, A., Bhaduri, M., Chan, I., Li, S.: Direct and simultaneous fourchamber volume estimation by multi-output regression. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9349, pp. 669–676. Springer, Cham (2015). doi:10.1007/978-3-319-24553-9 82 7. Wong, K.C.L., Tee, M., Chen, M., Bluemke, D.A., Summers, R.M., Yao, J.: Computer-aided infarction identification from cardiac CT images: a biomechanical approach with SVM. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9350, pp. 144–151. Springer, Cham (2015). doi:10. 1007/978-3-319-24571-3 18 8. Cai, Y.: Multi-modal vertebrae recognition using transformed deep convolution network. Comput. Med. Imaging Graph. 51, 11–19 (2016) 9. Girshick, R.: Fast R-CNN. In: IEEE International Conference on Computer Vision, pp. 1440–1448 (2015) 10. Graves, A.: Supervised sequence labelling. In: Graves, A. (ed.) Supervised Sequence Labelling with Recurrent Neural Networks. Studies in Computational Intelligence, vol. 385. Springer, Heidelberg (2012) 11. Revaud, J., Weinzaepfel, P., Harchaoui, Z., Schmid, C.: Deepmatching: hierarchical deformable dense matching. Int. J. Comput. Vis. 120(3), 300–323 (2016) 12. Fortun, D., Bouthemy, P., Kervrann, C.: Optical flow modeling and computation: a survey. Comput. Vis. Image Underst. 134, 1–21 (2015)

Skin Disease Recognition Using Deep Saliency Features and Multimodal Learning of Dermoscopy and Clinical Images Zongyuan Ge1(B) , Sergey Demyanov1 , Rajib Chakravorty1 , Adrian Bowling2 , and Rahil Garnavi1 1

2

IBM Research, Melbourne, VIC, Australia [email protected] MoleMap NZ Ltd., Auckland, New Zealand

Abstract. Skin cancer is the most common cancer world-wide, among which Melanoma the most fatal cancer, accounts for more than 10,000 deaths annually in Australia and United States. The 5-year survival rate for Melanoma can be increased over 90% if detected in its early stage. However, intrinsic visual similarity across various skin conditions makes the diagnosis challenging both for clinicians and automated classification methods. Many automated skin cancer diagnostic systems have been proposed in literature, all of which consider solely dermoscopy images in their analysis. In reality, however, clinicians consider two modalities of imaging; an initial screening using clinical photography images to capture a macro view of the mole, followed by dermoscopy imaging which visualizes morphological structures within the skin lesion. Evidences show that these two modalities provide complementary visual features that can empower the decision making process. In this work, we propose a novel deep convolutional neural network (DCNN) architecture along with a saliency feature descriptor to capture discriminative features of the two modalities for skin lesions classification. The proposed DCNN accepts a pair images of clinical and dermoscopic view of a single lesion and is capable of learning single-modality and cross-modality representations, simultaneously. Using one of the largest collected skin lesion datasets, we demonstrate that the proposed multi-modality method significantly outperforms single-modality methods on three tasks; differentiation between 15 various skin diseases, distinguishing cancerous (3 cancer types including melanoma) from non-cancerous moles, and detecting melanoma from benign cases.

1

Introduction

Over 5 million skin cancer cases are diagnosed annually in America and Australia [13]. In Australia, the mean cost per patient for classification and staging of suspicious lesions (specialized surveillance and stage III in year 2) is over $3,000 [14]. Also, the availability of fully trained dermatologists worldwide is highly limited [4]. Shortage of experts and high costs make computer aided diagnosis (CAD) a necessary as an cost-effectiveness and data-driven skin disease diagnosis tool to fight against the increasing mortality of skin cancers. c Springer International Publishing AG 2017  M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 250–258, 2017. DOI: 10.1007/978-3-319-66179-7 29

Skin Disease Recognition Using Deep Saliency Features

251

A skin lesion is visually examined in two steps: clinical screening followed by dermoscopic analysis. Dermoscopy images are highly standardized images obtained through a high-resolution magnifying imaging device in contact with the skin. Clinical images, on the other hand, are taken by a standard digital camera and present more variations in view, angle and lighting. The majority of automated skin disease classification methods [7] could exhibit limited generalization capability when both dermoscopic and clinical modalities are being used because their domain-specific hand-crafted features are designed specifically for dermoscopy images [1]. Self feature learning scheme like deep convolutional neural networks (DCNNs) trained on very large datasets [11] has shown impressive performance in visual tasks such as object recognition and detection [12]. More importantly, those learned networks can be easily adapted to other domain tasks such as medical image segmentation [2] and skin cancer feature detection [5], all of which only cater for single image modality of dermoscopy. To take advantage of multi-modality information embedded within dermoscopy and clinical images of the skin lesion, we develop a jointly-learned multi-modality DCNN along with a saliency-based feature descriptor to address the challenging problem of skin disease classification. The contributions of this paper are the following: (i) We propose and analyze several strategies to optimize DCNNs parameters learning of two image modalities. (ii) We propose a DCNNbased feature descriptor Class Activation Mapping-Bilinear Pooling (CAM-BP) which is able to locate saliency areas of skin images. During inference, CAMBP assists the decision making process by producing probability maps, which improves the overall performance. (iii) We conduct comprehensive experiments and show the effectiveness of the proposed method on three diagnostic use cases: multi-class skin disease classification (across 15 disease categories), skin cancer recognition and melanoma detection.

2

Methods

In this work we explore the advantages of connecting two image modalities through a joint learning DCNN framework, and propose a novel saliency feature descriptor for multi-modality skin disease classification task. In Sect. 2.1, we first introduce two schemes for multi-modality learning (Sole-Net and ShareNet), then discuss our proposed framework Triple-Net. In Sect. 2.2, we introduce CAM-BP and explain how and why saliency information is important for discriminative feature pooling. 2.1

Cross-Modality DCNN Learning

Sole-Net: We first explore Sole-Net which is a fairly intuitive DCNN method combining information of two modalities. Each DCNN parameters are being learnt separately from each modality, and the final decision is obtained by averaging of outputs from the two trained models. The architecture of Sole-Net is illustrated in Fig. 1(a). We first denote (xC , xD ) the pair training set where xC

252

Z. Ge et al.

Fig. 1. Figure shows comparison of several DCNN frameworks which accept multimodal inputs. (a) Sole-Net: Features from the two modalities are learnt in an dissociated manner with two separate loss functions (network blocks in two different colors). (b) Triple-Net: To improve upon the cross modelling ability, a new sub-network is trained on concatenated feature maps from middle layers.

and xD are the clinical and dermoscopy images from one lesion. Each of those two DCNNs CC and CD contains a singe-modality learning sub-network with different parameters in different colors (blue and yellow). The cost function of each modality sub-network can be computed as1 : costC = ||pC (xC ) − yC/D ||22

(1)

yC/D ||22

(2)

costD = ||pD (xD ) −

where costC represents the cost for clinical image and costD denotes the cost for dermoscopy image inputs. pC (xC ) and pD (xD ) (p1 and p2 in the Figure) are the outputs of each network. yC/D is the shared one-hot vector disease label of the observed lesion. Share-Net: Then we explore the Share-Net where the architecture is similar to Sole-Net except CC and CD are sharing identical parameters. The gross cost function of Share-Net can be defined as: costS = ||pS (xC ) − yC/D ||22 + ||pS (xD ) − yC/D ||22

(3)

During training, the Share-Net allows its parameters across two sub-networks to be updated in a mirrored manner, The advantage of this architecture is that with the inputs of the same semantic meaning (i.e. both modalities belonging to same lesion), sharing weights across sub-networks means fewer parameters to train, which in turn means that less data required, and the model is less prone to overfitting [3]. Triple-Net: Sole-Net is capable to capturing single-modality information. However, it lacks the ability to generalize to other modalities (see Sect. 3.1). Sharenet can obtain cross-modality knowledge to some extend, but it is limited by 1

In the experiment, we observed minor overall performance difference between mean square loss and cross-entropy loss.

Skin Disease Recognition Using Deep Saliency Features

253

its capacity to learn discriminative cross-modality features because of sharing weights scheme. To exploit the merits of using cross-modality and single-modality information simultaneously, we propose Triple-Net. The proposed framework takes advantage of Sole-Net and Share-Net, but also contains extra sub-network and loss to improve discriminative cross-modality feature learning. As illustrated in Fig. 1, our proposed DCNN framework consists of three sub-networks. The first two sub-networks configure the same as the Share-Net. The third sub-network CT takes in two corresponding convolutional feature maps RCl and RDl from a stage output (lth layer) of two sub-networks CCl and CDl . The Triple-Net has multiple cost functions and the cross-modality cost can be computed as: costT = ||pT (plC (xC ), plD (xD )) − yC/D ||22

(4)

plC/D denotes the lth layer output of the network. pT indicates the cross representation sub-network output. With the costs computed from Eq. (3) and (4), the overall Triple-Net cost is calculated as: costoverall = costS + α ∗ costT

(5)

where α is a hyper-parameter to setup the trade-off between single-modality and cross-modality learning rates. During prediction process, both single-modality and cross-modality are being used for decision making. The single-modality subnetwork takes as pC (xC ) + pD (xD ) an indicator for class prediction while crossmodality sub-network takes pT (plC (xC ), plD (xD )) as the evidence for decision. Triple-Net employs the combinations of those two. 2.2

Saliency Feature Learning

To take advantage of fine-grained information contained in the appearance of skin lesions, feature pooling method such as Bilinear Pooling (BP) [10] applied to DCNNs is a good candidate to capture fine-grained details within the image [6]. In short, it performs outer-product pair-wisely between two sub-feature maps from two DCNNs to generate distinctive representations (more details in [6,10]). However, the major disadvantage of BP is that grid-based local points are equally weighted (see Fig. 2) which leads to inability to catch saliency such as lesion area of skin images. To deal with this issue, we propose to pool BP features with spatial weights dependent on a saliency map. Saliency map can be interpreted as the area that is most likely to belong the foreground and to contain crucial information of the image. Class activation map (CAM) is a technique to generate class activation maps using the global average pooling [15]. Each labeled category gets a class-based activation map which indicates the discriminative regions by the CNN to identify that class. CAM provides evidences which can be used to measure the probability to be a foreground object. In our proposed CAM-BP, we apply CAM as a saliency map to weight BP features. An illustration of CAM-BP is shown in Fig. 2. It can be formulated as:   wc fk (i, j) k k  vec(fk (i, j)fk (i, j)T ) (6) Z C

254

Z. Ge et al. CAM Image

CAM-BP

Features

BP

Fig. 2. Proposed saliency-based CAM-BP method: CAM activation map and BP are extracted separately. Then, the output of BP is spatially-weighted based on CAM to generate CAM-BP representation.

f (i, j)k ∈ Rd denotes the activation of feature map k in one of the convolutional layer at location (i, j). Where wk indicates the importance of the activation unit k at spatial location (i, j) driving to the final decision of class c. Z is a term to normalize the equation sums up to 1. The left side of element-wise production in Eq. 6 indicates how CAM is calculated and right side denotes BP. vec() is the vectorization operation to compute the outer-product, thus 2 vec(fk (i, j)fk (i, j)T ) ∈ Rd . Average sum pooling is calculated to produce the final feature representation.

3

Experiments

Dataset: The dataset used in this work is provided by MoleMap2 . The images are annotated by expert dermatologists with disease labels. To validate the effectiveness of our methods, we select a subset of 13,292 lesions which contains at least one image from each modality. We then randomly acquire two images from each lesion covering both modalities to prepare the dataset, resulting in 26,584 images from 15 skin conditions; 12 benign categories3 and 3 types of skin cancer including melanoma, basal cell carcinoma and squamous cell carcinoma. We randomly partition the dataset into the ratio 7:3 for training and testing. Network and training: We use VGG-16 CNN architecture [12] pre-trained to 92.6% top-five accuracy on the 2012 ImageNet Challenge as the base model for our evaluated frameworks. The extra sub-network in Triple-Net takes network blocks starting from the last Conv layer of VGG-16 and trained from scratch with batch normalization. We then use fine-tuning to optimize the parameters of the DCNNs given the amount of available training data. All layers of the network are fine-tuned with a learning rate of 0.001 and a decay factor of 0.95 every epoch. 2 3

http://molemap.co.nz. Actinic Keratosis, Blue Naevus, Bowens Disease, Compound Naevus, Dermal Naevus, Dermatofibroma, Hemangioma, Junctional Naevus, Keratotic Lesion, Seborrheic Keratosis, Sebaceous Hyperplasia and Solar Lentigo.

Skin Disease Recognition Using Deep Saliency Features

255

Stochastic gradient decent (SGD) with momentum of 0.9 and decay of 5e–5 is used to train the network. During training, images are augmented with random mirroring. α in Eq. 5 is fixed to 1.5 to ensure a relatively high updating rate because of raw parameters. Following the training process as in [15], GoogLeNet is used as the base network to generate CAM and trained individually. 3.1

Analysis of Cross-Modality Learning

First, we validate the importance of cross-modality on three various of DCNNs described in Sect. 2.1 using 15-class skin disease classification task. The results are reported as overall accuracies. In this task from first two blocks of Table 1 we observe that: (1) Share-Net outperforms Sole-Net on both modalities, 54.1% vs 52.2% on clinical images and 55.0% vs 53.1% on dermoscopy images. (2) Cross-modality outputs boost the performance significantly. Compared with single-modality prediction, cross-modality predictions of Sole-Net and Share-Net results in nearly 16% and 15% improvement, respectively. (3) Triplet-Net outperforms Sole-Net and Share-net achieving 68.2% accuracy. Some classification samples of our proposed method are illustrated in Fig. 3. The benefits of cross-modality learning can be further investigated in terms of swapping the modality inputs. Ideally, the performance of a well-regularised DCNN should be robust to modality swapping as the pair inputs represent the same semantic meaning (same lesion). From experimental results, we observed the performance drop is 7% less on Triple-net compared to Sole-Net, which shows that Triple-Net is more tolerable to modality swapping. 3.2

Results with CAM-BP

To conclusively evaluate the proposed CAM-BP, we apply it to both multimodal approach of Share-Net and Triplet-Net, which reflect the generalization of this feature descriptor to various DCNNs. Figure 3 (bottom row) shows a few image samples demonstrating the effectiveness of CAM-BP in capturing lesion 1

lesion 2

lesion 3

lesion 4

Fig. 3. The bottom row of the figure shows CAM-BP activation maps of two modalities clinical (left patch) vs. dermoscopy (right patch) for four different moles. The upper row shows samples where using only one modality has resulted in misclassification (marked in red block), but when both modalities are used in our proposed system the disease label is picked up correctly.

256

Z. Ge et al. Table 1. Results on 15-disease classification Methods

Modality

Sole-Net

Dermoscopic/Clinical 53.1%/52.2%

Accuracy

Share-Net

Dermoscopic/Clinical 55.0%/54.1%

Triple-Net

Dermoscopic/Clinical 60.1%/59.4%

Sole-Net

Cross

61.2%

Share-Net

Cross

62.9%

Triple-Net

Cross

68.2%

Share-Net + CAM-BP Dermoscopic/Clinical 57.4%/58.1% Triple-Net + CAM-BP Dermoscopic/Clinical 61.3%/61.2% Share-Net + CAM-BP Cross

64.6%

Triple-Net + CAM-BP Cross

70.0%

Fig. 4. Figure on the left shows our proposed method performance on three different skin disease detection tasks.

complementary saliency area from both modalities. This is important in clinical practice because visualizing the activation area provided by CAM-BP makes the model more interpretable. From last block of Table 1, the improvements across different DCNNs varies, but the overall performance improvement is consistent reaching 70% accuracy for 15-class skin disease classification. 3.3

Comparative Study and Other Detection Tasks

We have re-produced the results of the two other related DCNN-based methods modified on our image set: the residual network (ResNet) which achieved the state-of-the-arts on ImageNet 2015 challenge [9], and residual network with bilinear pooling (ResNet-BP) [8] which achieved the best performance on the ISBI 16 skin classification challenge. Figure 4 (right) shows the comparison results of our proposed method with previous competitive methods on 15 skin disease classification using single and cross modalities. Although the pre-trained network (VGG-16) being used in our method is smaller than ResNet in terms of number of

Skin Disease Recognition Using Deep Saliency Features

257

layers and parameters, we obtain 6.7% relative performance gain against ResNetBP on 15 diseases classification task using multiple image modalities. Moreover, we have examined the performance of our method on another two use cases including detecting 3 cancer types, and more specifically recognizing melanoma. In Fig. 4 (left), we observe that by combing two modalities, our proposed Triple-Net CAM-BP achieves impressive results on distinguishing between cancerous and non-cancerous moles with an accuracy of 82.0%, and detecting melanoma from benign lesions with 96.6% accuracy.

4

Conclusion

In this work, we demonstrate the effectiveness of cross-modality learning of DCNN for skin classification on a method accept both dermoscopy and clinical inputs. The key advantage of our method resides in two parts: (i) the use of crossmodality learning that extracts comprehensive features from sub-networks. (ii) the use of CAM-BP helps to locate the saliency area where the most important information can be retrieved, and produces discriminative features for inference.

References 1. Ballerini, L., Fisher, R.B., Aldridge, B., Rees, J.: A color and texture based hierarchical k-nn approach to the classification of non-melanoma skin lesions. In: Color Medical Image Analysis, pp. 63–86. IEEE (2013) 2. de Brebisson, A., Montana, G.: Deep neural networks for anatomical brain segmentation. In: CVPR Workshops (2015) 3. Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: CVPR (2005) 4. Academic Grade Pay Commission: Productivity commission: Heal workforce (2014) 5. Demyanov, S., Chakravorty, R., Abedini, M., Halpern, A., Garnavi, R.: Classification of dermoscopy patterns using deep convolutional neural networks. In: ISBI (2016) 6. Gao, Y., Beijbom, O., Zhang, N., Darrell, T.: Compact bilinear pooling. In: CVPR (2016) 7. Garnavi, R., Aldeen, M., Bailey, J.: Computer-aided diagnosis of melanoma using border-and wavelet-based texture analysis. IEEE Trans. Inf. Technol. Biomed. 16(6), 1239–1252 (2012) 8. Ge, Z., Demyanov, S., Bozorgtabar, B., Mani, A., Chakravorty, R., Adrian, B., Garnavi, R.: Exploiting local and generic features for accurate skin lesions classification using clinical and dermoscopy imaging. In: ISBI (2017) 9. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016) 10. Lin, T.Y., RoyChowdhury, A., Maji, S.: Bilinear CNN models for fine-grained visual recognition. In: ICCV (2015) 11. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2014)

258

Z. Ge et al.

12. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014). arXiv:1409.1556 13. American Cancer Society: Cancer facts & figures 2016 (2016) 14. Watts, C.G., Cust, A.E., Menzies, S.W., Mann, G.J., Morton, R.L.: Costeffectiveness of skin surveillance through a specialized clinic for patients at high risk of melanoma. J. Clin. Oncol. 35(1), 63–71 (2016) 15. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: CVPR (2016)

Boundary Regularized Convolutional Neural Network for Layer Parsing of Breast Anatomy in Automated Whole Breast Ultrasound Cheng Bian1, Ran Lee1, Yi-Hong Chou2, and Jie-Zhi Cheng3 ✉ (

)

1

School of Biomedical Engineering, Shenzhen University, Shenzhen 518060, China Department of Radiology, Taipei Veterans General Hospital, Taipei 11217, Taiwan Department of Electrical Engineering, Chang Gung University, Taoyuan 33302, Taiwan [email protected] 2

3

Abstract. A boundary regularized deep convolutional encoder-decoder network (ConvEDNet) is developed in this study to address the difficult anatomical layer parsing problem in the noisy Automated Whole Breast Ultrasound (AWBUS) images. To achieve better network initialization, a two-stage adaptive domain transfer (2DT) is employed to land the VGG-16 encoder on the AWBUS domain with the bridge of network training for AWBUS edge detector. The knowledge transferred encoder is denoted as VGG-USEdge. To further augment the training of ConvEDNet, a deep boundary supervision (DBS) strategy is introduced to regularize the feature learning for better robustness to speckle noise and shad‐ owing effect. We argue that simply counting on the image context cue, which can be learnt with the guidance of label maps, may not be sufficient to deal with the intrinsic noisy property of ultrasound images. With the regularization of boundary cue, the segmentation learning can be boosted. The efficacy of the proposed 2DTDBS ConvEDNet is corroborated with the extensive comparison to the state-ofthe-art deep learning segmentation methods. The segmentation results may assist the clinical image reading, particularly for junior medical doctors and residents and help to reduce false-positive findings from a computer-aided detection scheme. Keywords: Segmentation · Ultrasound · Breast · Deep learning

1

Introduction

Automated whole breast ultrasound (AWBUS) is a new medical imaging modality approved by FDA in 2012. The AWBUS technology can automatically depict the whole anatomy of breast in a 3D volume, and hence enable the chance of offline thorough image reading. However, the advantage of the volumetric imaging may also introduce more workload for radiologists. Even for a senior radiologist, the reading of a AWBUS volume can take tens of minutes to reach confident diagnostic workup, due to the large image information and difficulty of interpretation of ultrasound images. Accordingly, the manpower shortage of radiologists can be possibly expected with the popularization of this new imaging technology. To improve the reading efficiency, an automatic © Springer International Publishing AG 2017 M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 259–266, 2017. DOI: 10.1007/978-3-319-66179-7_30

260

C. Bian et al.

segmentation method is proposed in this study to parse the AWBUS images into the breast anatomic layers of subcutaneous fat, breast parenchyma, pectolis muscles and chest wall. The layer decomposition is shown in Fig. 1. The layer parsing of breast anatomy in the AWBU images can assist the clinical image reading for less-experienced radiologists and residents. Meanwhile, the breast density, which is an important biomarker for cancer risk [1, 13], can be easily computed with the parsing results. In the context of computer-aided detection, the layer parsing may also help to exclude falsepositive detections [2].

Fig. 1. Illustration of anatomical layers in AWBUS. (A), (B), (C) and (D) indicate layers of subcutaneous fat, breast parenchyma, muscle and chest wall, respectively. The green lines are septa boundaries of layers. The red dotted circle indicate a significant shadowing effect, whereas the blue dotted rectangles suggest regions that are difficult for layer differentiation.

Referring to Fig. 1, the segmentation task for layer parsing in AWBUS images can be very challenging. It shall not only deal with the intrinsic low quality properties of ultrasound images like the speckle noise and shadowing effect, but also tackle the distri‐ bution overlapping problem of echogenicity among different breast anatomical layers. The low image quality and echogenicity overlapping problems may also lead to illdefined septa boundaries in places between the consecutive layers, see Fig. 1, and hence render layer parser task more problematic. On the other hand, the appearance and morphology of breast anatomic layers can vary significantly from person to person. For people with less breast density, the fat layer can be thicker, whereas the AWBUS images that depict dense breasts may have larger breast parenchyma. Therefore, the issue of high inter-subject variability for the anatomical structures should also be well considered in the design of layer parsing algorithm. In the literature, most works focused on the segmentation of breast lesions in ultra‐ sound images [14, 15]. To our best knowledge, this is less related work for breast anatomy parsing in AWBUS images. In this study, we propose to leverage the deep convolutional encoder-decoder network, ConvEDNet for short [3–6], with the 2-stages domain transfer (2DT) and deep boundary supervision (DBS), i.e., deep supervision [7] with boundary cue, techniques for layer parsing in the AWBUS images. The

Boundary Regularized Convolutional Neural Network

261

ConvEDNet [3–6] can perform end-to-end training for the semantic segmentation, and is typically constituted with the convolutional (encoder) and deconvolutional (decoder) paths, which learn useful object features and restore the object geometry and morphology at the resolution of the input images, respectively. The learning of ConvEDNet is majorly based on the image context cues with the guidance of object labels [3–6]. However, as discussed earlier, simply usage of the image context cues may not be sufficient to address the issues of low image quality, ill-defined septa boundaries, etc. Accordingly, we further incorporate the boundary cue drawn by radiologists with auxiliary learning techniques of 2DT and DBS to boost the training of ConvEDNet. The cues of boundary and image context are complementary to each other, and can be syner‐ gized to achieve promising image analysis performance [8]. The details of 2DT and DBS will be elaborated latter. The proposed 2DT-DBS ConvEDNet is extensively compared with the state-of-theart ConvEDNets, i.e., FCN [3], DeconvNet [4], SegNet [5] and U-Net [6]. We also perform the ablation experiments to illustrate the effectiveness of the implementation of the 2DT and DBS techniques. One related deep learning method [8] which also fuses the image context and boundary cues for the training of FCN in the multi-task paradigm is implemented for comparison. Specifically, in [8] the object labeling and boundary delineation are treated as two tasks to co-train the FCN. Our formulation on the other hand adopts the deep supervision strategy to augment the feature learning in the encoder path with the auxiliary boundary cue that encodes the object geometry and morphology. With the extensive experimental comparison, it will be shown that the proposed 2DTDBS ConvEDNet can outperform other baseline methods for the layer parsing of breast anatomy in AWBUS.

2

Method

The architecture of our network is illustrated in Fig. 2. The mainstream architecture is a ConvEDNet. The segmentation is based on 2D AWBUS images in this study. The data annotation and the learning/testing of layer parsing methods are performed independ‐ ently on the sagittal and axial 2D views of AWBUS images, and the final annotation and layer parsing results are reached by averaging the three septa boundaries of the four layers from the two corresponding boundaries of the two 2D views. Since the ultrasound data are relatively noisy, the segmentation capability of the encoder-decoder path in ConvEDNet may not be sufficient to address the challenging issues in our problem. In this study, we employ the 2DT and DBS to augment the network training. As shown in Fig. 2, our network is equipped with five auxiliary side networks to impart the boundary knowledge to regularize the feature learning. The computational breast anatomy decomposition in the AWBUS images is formu‐ lated as a pixel classification problem with four classes of subcutaneous fat, breast parenchyma, pectolis muscle, and chest wall. Given the annotated label map set, C, and the original 2D AWBUS image set, X , the training of the ConvEDNet tries to seek proper neural parameters, Wc, with the minimization of the loss function:

262

C. Bian et al.

Fig. 2. Our boundary regularized ConvEDNet architecture. For the encoder layer size of each side network, N is the size of the connecting layer of mainstream encoder.

( ) ( ) ‖  C, X;Wc = c C, X;Wc + ‖ ‖Wc ‖2 ,

(1)

where c (⋅) is the cross entropy function [12], and ‖⋅‖2 is the L2 norm for regularization. The minimum of the loss function (1) can be sought by stochastic gradient descend for the end-to-end learning of segmentation. 2.1 Our Mainstream ConvEDNet (MConvEDNet) Similar to [4], the encoder of the MConvEDNet is composed of VGG-16 [9] net with removal of the last classification layer, see Fig. 2. We change the kernel size of conv6 and deconv6 as 5 to fit our data. The unpooling layers at the decoder are paired with the max-pooling layers of the encoder. The locations of maximum activations at the maxpool layers are memorized with switch variables to assist the unpooling process. 2.2 Two-Stages Domain Transfer (2DT) Since the cost for collection and annotation of medical images is relatively expensive, the common approach to attain good performance with the deep learning technique is to initialize network with parameters learnt from natural images [10]. However, consid‐ ering that the domains of natural and AWBUS images are quite different, we propose to engage the knowledge transfer of model parameters in two stages. Specifically, the first stage of domain transfer is carried out to employ the VGG-16 [9] as the encoder followed by a decoder with single deconvolutional layer for the anatomical edge detec‐ tion in AWBUS images. The learning of the edge detector is guided by the boundary maps, where the three septa boundaries of the four layers are drawn. To boost the learning of edge detection, deep supervision with the boundary maps is also implemented by the same 5 auxiliary side networks shown in Fig. 2. This type of edge detector network is also called Holistically-nested Edge Detector (HED) net [11]. The training for the AWBUS edge detector will land VGG-16 encoder into the AWBUS domain to be

Boundary Regularized Convolutional Neural Network

263

familiar with the presence of speckle noise and shadowing effect. Similar to [11], the AWBUS edge detection is formulated as a 2-class differentiation with edge label as 1 whereas non-edge label as 0. The learnt encoder for the AWBUS edge detector is denoted as VGG-USEdge. The tasks of anatomic edge detection and layer parsing may relate to each but remain different. Therefore, the VGG-USEdge encoder may provide more useful prior knowledge than VGG-16. Accordingly, the VGG-USEdge is applied to initialize the encoder network of our MConvEDNet. 2.3 Deep Boundary Supervision (DBS) As can be found in Fig. 2, the MConvEDNet is relatively deep and hence the gradient vanishing issue can possibly occur in the network training. Meanwhile, the learning process can also be thwarted with the difficult issues discussed earlier. To further boost the learning process, the deep supervision strategy is employed. Here, we introduce the cue of layer boundaries with the deep supervision strategy to improve the learning. To further illustrate the efficacy of boundary cue, we further implement two comparison options. The first option is the deep supervision with label map cue on MConvEDNet. The second one is to perform the DBS on both encoder and decoder, which totally have 10 auxiliary side networks. It will be shown that the pure deep DBS can better boost the segmentation than the other deep supervision strategies. The DBS is realized by adding auxiliary side networks to the endings of 5 layers in the encoder of MConvEDNet. The auxiliary side networks are shallow and simply constitutes of coupled single convolutional and deconvolutional layers, see Fig. 2. Given the neural parameters, Wep, of an auxiliary side network p, 1 ≤ p ≤ Q; Q is the total number of convolutional layers at the encoder, and the edge map set of layer boundaries, E , the learning of the end-to-end segmentation with the DBS can be realized by the minimization of the reformulated loss function of

∑Q ( ( ) ( ) ) ‖ e E, X;Wep ,  C, X;Wc = c C, X;Wc + ‖ ‖ Wc ‖ 2 + p

(2)

where e (⋅) is the class-balanced cross entropy function for the auxiliary side networks that considers the non-balance issue between edge and non-edge classes [11]. With the minimization of cost function, the encoder network be equipped with the capability to drive the prediction masks of mainstream network as close to the manual label maps as possible, and keep the output edge maps of the side networks not deviating from the annotated edge maps significantly. For the comparing implementation of deep supervi‐ sion with label map cue, we can simply replace the training map set E with C. The deep supervision with both label and edge map cues need two parallel side networks which consider training map sets of C and E, respectively. 2.4 Implementation Details The learning rate of the mainstream network is initialized as 0.01, while the weight decay and momentums parameters are set as 0.0005 and 0.9, respectively. For the auxiliary

264

C. Bian et al.

networks of deep supervision, the learning rates are 10−6, where the parameters of weight decay and momentum are the same as those of MConvEDNet. No dropout is imple‐ mented but the batch normalization is adopted. The architectures of auxiliary side networks for the edge detector net and MConvEDNet are the same for simplicity, but with different random initialization on network parameters. Our method is developed based on the Caffe environment [12].

3

Dataset and Annotation

The AWBUS data were collected from Taipei Veterans General Hospital, Taipei, Taiwan, with the approval of their institutional review board (IRB). 16 AWBUS volumes acquired from 16 subjects are involved in this study. The subject ages range from 30 to 62. The non-human dark regions of all AWBUS images are excluded and leaving all image contents with the size of 160 × 160. The annotation for the boundaries of the breast anatomical layers in the AWBUS images was performed by a radiologist with 5 years of experience in breast ultrasound. The annotated data were further reviewed by a very senior radiologist with experience of medical ultrasound more than 30 years to ensure the correctness of the annotation. Each AWBUS volume contains around 170–200 2D images and the overall number of 2D images is 3134.

4

Experiments and Results

The evaluation of the AWBUS image segmentation is based on leave-one-out cross validation (LOO-CV). The basic unit of LOO-CV is an AWBUS volume but not a 2D image. Two assessment metrics, intersection over union (IoU) [4] and curve distance (CD) [15], are adopted for the quantitative evaluation between the computerized segmentation results and manual annotations. The CD is the averaged absolute distance between two comparing lines. The state-of-the-art ConvEDNets of FCN, DeconvNet, SegNet and U-Net are also implemented as baseline methods for comparison. Mean‐ while, the multi-task method [8], denoted as “Multitask” which fuses image context and boundary cues is also implemented for comparison. The combinational options of 2DT and DBS are also implemented to show the effect of each technique on our problem. As discussed in Sect. 2.3, to illustrate efficacy of DBS, the implementation of DBS on encoder and decoder (FullyDBS) and deep supervision with label map (DLS) are also performed. To show the effectiveness of 2DT, we also implement the random parameter initialization (RandI) for the DBS+ConvEDNet. Table 1 reports the mean ± standard deviation statistics of the CD and IoU metrics for the segmentation results of each implementation over the LOO-CV scheme. Specif‐ ically, the segmentation performances w.r.t. the three septa boundaries in-between layers (CD) and four anatomic layers (IoU) are listed in the columns of Table 1. The layers A, B, C and D represents fat, breast parenchyma, pectolis muscles and chest wall, respec‐ tively. The lines 1, 2 and 3 are the septa boundaries w.r.t. the layer pairs of “A/B”, “B/C”, and “C/D”. It is worth noting that the our MConvEDNet is based on the

Boundary Regularized Convolutional Neural Network

265

DeconvNet [4]. To give the visual comparison, the segmentation results of all methods involved in this study are listed in Fig. 3. Table 1. Segmentation performances of different methods. “Main” represents our mainstream ConvEDNet (MConvEDNet). It is worth noting that the encoders of “DeconvNet” and “DBS +Main” are initialized with VGG-16. Metrics

CD (pixel)

Method

Line1

Line2

Line3

IoU (%) A

B

C

D

FCN [3]

8.0 ± 4.8

9.2 ± 4.9

10.6 ± 7.7

74.2 ± 11.5

50.2 ± 20.2

62.8 ± 13.7

72.5 ± 16.6

SegNet [5]

7.8 ± 7.0

11.6 ± 6.8

13.2 ± 8.9

75.3 ± 14.3

50.9 ± 18.6

54.7 ± 18.6

67.1 ± 17.7

U-Net [6]

6.36 ± 5.9

9.98 ± 6.3

11.9 ± 8.4

76.7 ± 13.2

53.8 ± 17.9

57.4 ± 17.5

68.3 ± 18.5

DeconvNet [4]

4.7 ± 4.1

6.6 ± 4.9

9.2 ± 7.9

82.8 ± 9.4

67.0 ± 16.8

69.3 ± 15.6

74.7 ± 16.6

DLS+Main

5.1 ± 4.3

6.7 ± 4.3

10.0 ± 7.5

81.5 ± 10.4

65.7 ± 15.9

67.8 ± 13.0

73.9 ± 16.5

FullyDBS+Main

5.9 ± 5.2

7.5 ± 4.6

10.4 ± 7.9

79.7 ± 10.7

61.8 ± 17.9

65.9 ± 14.0

72.3 ± 16.4

DBS+Main

4.2 ± 3.6

5.9 ± 4.1

9.1 ± 7.2

84.4 ± 8.3

69.3 ± 15.5

70.2 ± 12.9

75.5 ± 16.1

DBS+Main+2DT

3.9 ± 3.6

5.6 ± 3.9

8.3 ± 7.0

86.8 ± 7.9

72.2 ± 14.6

72.4 ± 12.6

76.1 ± 15.9

DBS+Main+RndI

10.5 ± 7.6

13.7 ± 7.1

14.6 ± 9.7

69.1 ± 14.4

60.6 ± 17.5

50.8 ± 19.1

64.5 ± 18.7

Multitask [8]

4.9 ± 4.1

6.9 ± 4.3

9.6 ± 7.4

82.2 ± 9.2

64.9 ± 16.6

67.4 ± 13.5

74.4 ± 16.1

original image

manual outlines

FCN

SegNet

U-Net

DeconvNet

DLS+Main

FullyDBS+Main

DBS+Main

DBS+Main+2DT

DBS+Main+RandI

Multi-task

Fig. 3. Visual Comparison for the layer parsing results from different implementations. The layer boundaries of computerized results are drawn with red color, whereas the manual outlines of radiologists are colored in green.

5

Discussion and Conclusion

As can be observed from Fig. 3 and Table 1, the FCN segmentation results are relatively not stable. Some regions are obviously mislabeled. Therefore, the FCN may have less robustness to the low ultrasound image quality. On the other hand, the DeconvNet is relatively more suitable for our problem, because of deep decoding path. The SegNet results appear worse than the results of FCN in the muscle (C) layer and septa boundary between muscle and chest wall layers. It thus suggests that the fix of feature map with the same size may not help on our problem. The results of U-Net are in-between the results of SegNet and FCN, though the skip connection strategy is adopted in U-Net to alleviate the gradient vanishing problem. Therefore, the feature learning is relatively

266

C. Bian et al.

difficult even with the skip connections between the encoder and decoder correspond‐ ences. Accordingly, the incorporation of boundary cue may help to improve the ultra‐ sound image segmentation. It can be found in Table 1 that the best segmentation performance can be achieved by our method “DBS+Main+2DT” with both IoU and CD metrics. Therefore, it may suggest that our 2DT-DBS ConvEDNet may have better capability to withstand speckle noise, shadowing and other challenges shown in the introduction section. Based on the extensive comparisons with other baseline implementations, the efficacy of the 2DTDBS ConvEDNet on the layer parsing problem can be corroborated. Acknowledgement. This work was supported by the National Natural Science Funds of China (No. 61501305), the Shenzhen Basic Research Project (No. JCYJ20150525092940982), the Natural Science Foundation of SZU (No. 2016089).

References 1. McCormack, V.A., dos Santos Silva, I.: Breast density and parenchymal patterns as markers of breast cancer risk: a meta-analysis. Cancer Epidemiol. Biomark. Prev. 15, 1159–1169 (2006) 2. Tan, T., et al.: Chest wall segmentation in automated 3D breast ultrasound scans. Med. Image Anal. 17, 1273–1281 (2013) 3. Long, J., et al.: Fully convolutional networks for semantic segmentation. In: CVPR 2015, pp. 3431–3440 (2015) 4. Noh, H., et al.: Learning deconvolution network for semantic segmentation. In: ICCV 2015, pp. 1520–1528 (2015) 5. Badrinarayanan, V.: Segnet: a deep convolutional encoder-decoder architecture for robust semantic pixel-wise labelling. arXiv preprint arXiv:1505.07293 (2015) 6. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). doi:10.1007/978-3-319-24574-4_28 7. Lee, C.-Y., et al.: Deeply-Supervised nets. In: AISTATS, 2015, June 2015 8. Chen, H., et al.: DCAN: deep contour-aware networks for accurate gland segmentation. In: CVPR 2016, pp. 2487–2496 (2016) 9. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014) 10. Shin, H.-C., et al.: Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE TMI 35, 1285–1298 (2016) 11. Xie, S., Tu, Z.: Holistically-nested edge detection. In: ICCV 2015, pp. 1395–1403 (2015) 12. Jia, Y., et al.: Caffe: convolutional architecture for fast feature embedding. In: ACM MM 2014, pp. 675–678 (2014) 13. Gubern-Merida, A., et al.: Breast segmentation and density estimation in breast MRI: A fully automatic framework. IEEE JBHI 19, 349–357 (2015) 14. Huang, Q., et al.: Breast ultrasound image segmentation: a survey. IJCARS 12, 1–5 (2017) 15. Cheng, J.-Z., et al.: ACCOMP: augmented cell competition algorithm for breast lesion demarcation in sonography. Med. Phys. 37(12), 6240–6252 (2010)

Zoom-in-Net: Deep Mining Lesions for Diabetic Retinopathy Detection Zhe Wang1(B) , Yanxin Yin2 , Jianping Shi3 , Wei Fang4 , Hongsheng Li1 , and Xiaogang Wang1 1

The Chinese University of Hong Kong, Shatin, Hong Kong [email protected] 2 Tsing Hua University, Beijing, China 3 SenseTime Group Limited, Beijing, China 4 Sir Run Run Shaw Hospital, Hangzhou, China

Abstract. We propose a convolution neural network based algorithm for simultaneously diagnosing diabetic retinopathy and highlighting suspicious regions. Our contributions are two folds: (1) a network termed Zoom-in-Net which mimics the zoom-in process of a clinician to examine the retinal images. Trained with only image-level supervisions, Zoomin-Net can generate attention maps which highlight suspicious regions, and predicts the disease level accurately based on both the whole image and its high resolution suspicious patches. (2) Only four bounding boxes generated from the automatically learned attention maps are enough to cover 80% of the lesions labeled by an experienced ophthalmologist, which shows good localization ability of the attention maps. By clustering features at high response locations on the attention maps, we discover meaningful clusters which contain potential lesions in diabetic retinopathy. Experiments show that our algorithm outperform the stateof-the-art methods on two datasets, EyePACS and Messidor.

1

Introduction

Identifying suspicious region for medical images is of significant importance since it provides intuitive illustrations for physicians and patients of how the diagnosis is made. However, most previous works rely on strong supervisions which require lesion location information. This largely limits the size of the dataset as the annotations in medical imaging are expensive to acquire. Therefore, it is necessary to develop algorithms which can make use of large datasets with weak supervisions for simultaneous classification and localization. In this work, we propose a general weakly supervised learning framework, Zoom-in-Net, based on convolution neural networks (CNN). The proposed method is accurate in classification and meanwhile, it can also automatically discover the Electronic supplementary material The online version of this chapter (doi:10. 1007/978-3-319-66179-7 31) contains supplementary material, which is available to authorized users. c Springer International Publishing AG 2017  M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 267–275, 2017. DOI: 10.1007/978-3-319-66179-7 31

268

Z. Wang et al.

lesions in the images at a high recall with only several bounding boxes. This framework can be easily extended to various classification problems and provides convenient visual inspections for the doctors. To verify the effectiveness of our method, we aim to solve the problem of Diabetic retinopathy (DR) detection as it is an important problem and a large scale dataset [1] with image-level labels are publicly available online. DR is an eye disease caused by diabetes. Today, it is the leading cause of blindness in the working-age population of the developed world. Treatments can be applied to slow down or stop further vision loss once this disease is diagnosed. However, DR has no early warning sign and the diagnosis is a time-consuming and manual process that requires an experienced clinician to examine the retinal image. It is often too late to provide effective treatments because of the delay. In order to alleviate the workload of human interpretation, various image analysis algorithms have been proposed over the last few decades. Early approaches [2,6] use hand-crafted features to represent the images, of which the main bottlenecks are the limited expressive power of the features. Recently, CNN based methods [3,4,9] have dramatically improved the performance of DR detection. Most of them treat CNN as a black box, which lacks intuitive explanation. Few previous works localize the lesions with image-level supervisions, such as visualizing the evidence hotspots of the spinal MRIs classification [10]. But they did not use the hotspots to further improve performance. The proposed Zoom-in-Net has the attention mechanism, which generates attention maps using only image-level supervisions. The attention map is a heatmap that indicates which pixels play more important roles in making the image-level decision. In addition, our Zoom-in-Net mimics the clinicians’ behavior which skim the DR images to identify suspicious regions and then zoom-in to verify. Zoom-in-Net is validated on EyePACS and Messidor datasets. It outperforms state-of-the-art methods as well as general physicians. Moreover, we also validated the attention localization accuracy on around 200 images labeled by an experienced ophthalmologist. Our attention localization reaches a recall of 0.82 which proves to be useful for doctors. The clustered regions at high response locations of the attention maps shows meaningful lesions in diabetic retinopathy.

2

Architecture of Zoom-in-Net

The proposed Zoom-in-Net learns from image-level supervisions of DR detection, however equipped with the function to both classify images and localize the suspicious regions. It mimics the zoom-in process of a clinician examining an image by selecting highly suspicious regions on a high resolution image and makes a decision according to both the global image and local patches. Our Zoom-in-Net consists of three parts as shown in Fig. 1: a main network (M-Net) for DR classification, a sub-network, Attention Network (A-Net), to generate attention maps, and another sub-network, Crop-Network (C-Net), which takes high resolution patches of highest attention values as input and correct predictions from M-Net. Our illustration is based on the 5-level DR detection

Zoom-in-Net: Deep Mining Lesions for Diabetic Retinopathy Detection

269

Fig. 1. An overview of Zoom-in-Net. It consists of three sub-networks. M-Net and CNet classify the image and high resolution suspicious patches, respectively, while A-Net generates the gated attention maps for localizing suspicious regions and mining lesions.

task, i.e., 0 - No DR; 1 - Mild; 2 - Moderate; 3 - Severe and 4 - Proliferative DR. It can be easily adapted to different classification tasks. Main Network (M-Net). The M-Net is a CNN which takes an image as input and processes it by stacks of linear operations including convolutions, batch normalization, non-linear operations like Pooling and Rectified Linear Units. We adopt the Inception-Resnet [15] as the architecture of M-Net. The intermediate feature maps produced by the layer inception resnet c 5 elt, i.e. M ∈ R1024×14×14 , separate the M-Net into two parts as shown in Fig. 1. It is followed by a fully connected layer and mapped into a probability vector yM ∈ R5 , which indicates the probability of the image belonging to each disease level. M is further used as input to the Attention Network. As the Kaggle’s challenge provides both left and right eyes of a patient, we also utilize the correlation between two eyes. Statistics show that more than 95% of the eye pairs have the scores differ by at most 1. Therefore, we concatenate the features of both eyes from M-Net together and train the network to take advantage of it in an end-to-end manner. Attention Network (A-Net). The A-Net takes the feature maps M as input. It consists of two branches. The first branch, A-Net part I, is a convolution layer with 1×1 kernels. It can be regarded as a linear classifier applied to each pixel and produces score maps S ∈ R5×14×14 for the 5 disease levels. The second branch, A-Net part II, generates attention gate maps with three convolution layers as shown in Fig. 2. In particular, it produces separate attention gate map for each disease level. Each attention map is obtained by a spatial softmax operator. Intuitively, the spatial softmax forces the attention values to compete each other and concentrate only on the most informative regions. Therefore, by regarding

270

Z. Wang et al.

Fig. 2. Structure of A-Net part II. It takes in the feature maps M from M-Net and generates a attention map A. The kernel size and the number of channels are marked at the bottom of the convolution layers.

the attention map A ∈ R5×14×14 as a gate, the output for the A-Net is calculated as (1) Gl = S l ⊗ Al where Gl , S l and Al are gated feature for A-Net, score map and attention map at l-th class, respectively, and ⊗ denotes element-wise multiplication. Then  we can l = i,j Gli,j . calculate the final score vector yA by sum pooling G globally, i.e., yA Crop Network (C-Net). We further improve the accuracy by zooming-in the suspicious attention regions. Specifically, given the gated attention maps G ∈ R5×14×14 , we first resize it to the same size as the input image. Then we use a greedy algorithm to sample the regions. At each iteration, we record the location of the top response on G, and then mask out the s × s region around it to avoid this region being selected again. We repeat this process until a total of N coordinates are recorded or the maximum attention response is reached. An example is shown in Fig. 3. With the recorded locations, we crop the corresponding patches on a higher resolution image for C-Net. The C-Net has a structure similar to [16]. However, it differs from [16] as it combines features dˆC of all patches at layer “global pool”. Since some patches contain no abnormalities, we apply element-wise max on the feature dˆC to extract the most informative feature. This feature is then concatenated to the feature dM from M-Net and classified by C-Net.

3

Attention Localization Evaluation and Understanding

Attention Localization Evaluation. To verify whether the high response regions contain clues indicating the disease level of the images, we asked an experienced ophthalmologist to label the lesions of 182 retinal images from EyePACS. The ophthalmologist is asked to draw bounding boxes to tightly cover the lesions related to diabetic retinopathy. A total of 306 lesions are labeled at last. We calculate the intersection over Minimum (IoM) between a ground truth box and a sampled box. The sampled boxes are the exactly same 4 boxes used in C-Net. If IoM is above a threshold, then we consider the sampled box is correct. In this way, we plot two curves of the recall for person and for box V.S. the threshold, respectively, in Table 1. The recall for person means that as long as one ground truth box of a person is retrieved by the sampled boxes, we treat this person to be correct. Therefore, it is higher than the recall for box. Note

Zoom-in-Net: Deep Mining Lesions for Diabetic Retinopathy Detection

271

Fig. 3. From left to right: image, gated attention maps of level 1–4 and the selected regions of the image. The level 0 gated attention map has no information and is ignored.

that we achieve a recall of 0.76 and 0.83 at IoM threshold equal to 0.3 for box and person, respectively. This indicates our A-Net can localize the lesions accurately given only image-level supervisions, and partly explains why the C-Net can help improve the predictions made by M-Net. This is remarkable given that our network is not trained on a single annotated box. We believe increasing the resolution of attention maps (14×14) could further boosts localization precision. Attention Visual Understanding. Furthermore, to better understand the net, we propose a clustering based method to visualize the top responses locations on the gated attention maps. We partition the features at the same locations on the feature maps M into clusters by the AP clustering algorithm [8], which is free of a pre-defined cluster number. We can retrieve their corresponding image regions as C-Net input and visualize some of them in Fig. 4. Several clusters are discovered with meaningful lesions such as Microaneurysms, blot/frame haemorrhages and hard/soft exudates. This may be very appealing as doctors may find new lesions by examining the clustering results by our method.

4

Quantitative Evaluation

Datasets and Evaluation Protocols. We have evaluated the effectiveness of our Zoom-in-Net on two datasets, the EyePACS and Messidor datasets. The Kaggle’s Diabetic Retinopathy Detection Challenge (EyePACS) is sponsored by the California Healthcare Foundation. It provides 35 k/11 k/43 k images for train/val/test set respectively, captured under various conditions and various devices. A left and right field is provided for every subject, and a clinician rated the presence of diabetic retinopathy in each image on a scale of 0 to 4. For comparison, we adopt the same official protocol called quadratic weighted kappa score for evaluation. The Messidor dataset is a public dataset provided by the Messidor program partners [7]. It consists of 1200 retinal images and for each image, two grades, retinopathy grade, and risk of macular edema, are provided. Only retinopathy grades are used in the present work. Implemenation Details. The preprocessing includes cropping the images to remove the black borders which contain no information. Data augmentation is done on the training set of EyePACS by random rotations (0◦ /90◦ /180◦ /270◦ )

272

Z. Wang et al.

Fig. 4. Examples of automatically discovered suspicious regions by clustering features at high respones locations. Some clusters are very meaningful such as microaneurysms, blot haemorrhages, flame haemorrhages, soft exudates and hard exudates. Table 1. AUC for normal/abnormal Table 2. Comparison to top-3 entries on Kaggle’ challenge. Algorithms

Val set Test set

Min-pooling [1]

0.86

0.849

o O

0.854

0.844

Reformed Gamblers 0.851

0.839

M-Net

0.832

0.825

M-Net+A-Net

0.837

0.832

Zoom-in-Net

0.857

0.849

Ensembles

0.865

0.854

and random flips. The training of the proposed Zoom-in-Net includes three phases. We first train M-Net, which is pretrained on ImageNet [13], and then train A-Net while fixing the parameters of M-Net. The C-Net is trained at last together with the other two to obtain the final Zoom-in-Net. During training, we adopt the mini-batch stochastic gradient descent for optimization. We use a gradually decreasing learning rate starting from 10−5 with a stepsize of 20 k and momentum of 0.9. The whole framework is trained with the Caffe library [11]. Experiment Results on the EyePACS Dataset. We thoroughly evaluate each component of Zoom-in-Net on the EyePACS dataset. As can be seen in Table 2, the M-Net alone achieves 0.832/0.825 on the val/test set, respectively. Adding the branch of A-Net only improves the score by 0.5% on both sets. This is not surprising as no additional information is added in the A-Net. Moreover, we use the gated attention maps generated by A-Net to extract suspicious regions and train C-Net. We observed that on an image resized to 492 × 492, the area of pathological regions is usually smaller than 200 × 200.

Zoom-in-Net: Deep Mining Lesions for Diabetic Retinopathy Detection

Table 3. AUC for referable/nonreferable

273

Table 4. AUC for normal/abnormal

Method

AUC

Acc

Method

AUC

Acc

Lesion-based [12]

0.760

-

Splat feature/kNN [17]

0.870

-

Fisher Vector [12]

0.863

-

VNXK/LGI [18]

0.870

0.871

VNXK/LGI [18]

0.887

0.893

CKML Net/LGI [18]

0.862

0.858

CKML Net/LGI [18]

0.891

0.897

Comprehensive CAD [14] 0.876

Comprehensive CAD [14] 0.91

-

Expert A [14]

0.922

-

Expert A [14]

0.94

-

Expert B [14]

0.865

-

Expert B [14]

0.92

-

Zoom-in-Net

0.921 0.905

Zoom-in-Net

0.957 0.911

-

Therefore, we set the region size s to be 200 and the number of cropped regions N to be 4 throughout the experiments. We cropped 384 × 384 patches from high resolution images of size 1230×1230 as input of C-Net. During the training of the complete Zoom-in-Net, one mini-batch contains a pair of whole images and 4 high resolution patches for each image, respectively. It almost reaches the up limit for a Tesla K40 GPU card, so we let the network update its parameters after every 12 mini-batches to match the training of M-Net. Finally the proposed Zoomin-Net achieves the kappa score of 0.857 and 0.849 on the two sets, comparable to the first rank entry Min-pooling (0.86/0.849) [1]. With an ensemble of three models, our final results ended up at 0.865/0.854. Experiment Results on the Messidor Dataset. To further evaluate the performance, the proposed Zoom-in-Net is applied to the independent dataset Messidor for DR screening. As Messidor has only 1200 images, the size of which is small to train CNNs, Holly et al. [18] suggested extracting features from the proposed net trained on other dataset like EyePACS to develop classifiers later. Since Messidor and EyePACS employ different annotation scales (Messidor: 0 to 3, EyePACS: 0 to 4), we follow a protocol similar to [18] and conduct two binary classification tasks (referable V.S. non-referable, normal V.S. abnormal) to realize the evaluation cross dataset and prior studies. We extract five dimensional probability feature vectors from the last layer of Zoom-in-Net, and use Support Vector Machines (SVM) with rbf kernal, implemented by the LibSVM library on MATLAB [5], for binary classification. For referable/nonreferable, Messidor Grade 0 and 1 is considered as nonreferable, while Grade 2 and 3 is defined to be referable to specialists. 10-fold crossvalidation on entire Messidor is introduced to be compatible with [12,18]. For normal/abnormal classification, the SVM is trained using extracted features from the training set of EyePACS and tested on entire Messidor. Only images graded 0 on EyePACS/Messidor are assigned as normal, otherwise as abnormal. The area under the receiver operating curve (AUC) is used to quantify the performance. Tables 3 and 4 show results of our methods compared with previous studies. To the best of our knowledge, we achieve the highest AUC for both

274

Z. Wang et al.

normal and referral classification on Messidor dataset. Zoom-in-Net performs comparably to two experts reported in [14]. At a specificity of 0.5, the sensitivity of Zoom-in-Net is 0.978 and 0.960, respectively for the normal and referral task.

5

Conclusions

In this work, we proposed a novel framework Zoom-in-Net which achieves stateof-the-art performance on two datasets. Trained with only image-level supervisions, Zoom-in-Net can generate attention maps which highlight the suspicious regions. The localization ability of the gated attention maps is validated and found to be promising. Further experiments show the high response regions on gated attentions correspond to potential lesions in DR, and thus can be used to further boost of performance for classification.

References 1. https://www.kaggle.com/c/diabetic-retinopathy-detection/leaderboard (2016) 2. Abr` amoff, M.D., Reinhardt, J.M., Russell, S.R., Folk, J.C., Mahajan, V.B., Niemeijer, M., Quellec, G.: Automated early detection of diabetic retinopathy. Ophthalmology 117(6), 1147–1154 (2010) 3. Abr` amoff, M.D., Lou, Y., Erginay, A., Clarida, W., Amelon, R., Folk, J.C., Niemeijer, M.: Improved automated detection of diabetic retinopathy on a publicly available dataset through integration of deep learning deep learning detection of diabetic retinopathy. IOVS 57(13), 5200–5206 (2016) 4. Chandrakumar, T., Kathirvel, R.: Classifying diabetic retinopathy using deep learning architecture. In: IJERT (2016) 5. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM TIST (2011). http://www.csie.ntu.edu.tw/∼cjlin/libsvm 6. Chaum, E., Karnowski, T.P., Govindasamy, V.P., Abdelrahman, M., Tobin, K.W.: Retina (2008) 7. Decenci`ere, E., Zhang, X., Cazuguel, G., La¨ y, B., Cochener, B., Trone, C., Gain, P., Ordonez, R., Massin, P., Erginay, A., et al.: Feedback on a publicly distributed image database: the messidor database. Image Anal. Stereology 33(3), 231–234 (2014) 8. Frey, B.J., Dueck, D.: Clustering by passing messages between data points. Science 315(5814), 972–976 (2007) 9. Gulshan, V., Peng, L., Coram, M., Stumpe, M.C., Wu, D., Narayanaswamy, A., Venugopalan, S., Widner, K., Madams, T., Cuadros, J., et al.: Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA 316(22), 2402–2410 (2016) 10. Jamaludin, A., Kadir, T., Zisserman, A.: SpineNet: automatically pinpointing classification evidence in spinal MRIs. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 166–175. Springer, Cham (2016). doi:10.1007/978-3-319-46723-8 20 11. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. In: ACM (2014)

Zoom-in-Net: Deep Mining Lesions for Diabetic Retinopathy Detection

275

12. Pires, R., Avila, S., Jelinek, H., Wainer, J., Valle, E., Rocha, A.: Beyond lesionbased diabetic retinopathy: a direct approach for referral. JBHI 21(1), 193–200 (2015) 13. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: ImageNet large scale visual recognition challenge. IJCV 115(3), 211–252 (2015) 14. S` anchez, C.I., Niemeijer, M., Dumitrescu, A.V., Suttorp-Schulten, M.S., Abr` amoff, M.D., Van, G.B.: Evaluation of a computer-aided diagnosis system for diabetic retinopathy screening on public data. IOVS 52(7), 4866–4871 (2011) 15. Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.: Inception-v4, inception-resnet and the impact of residual connections on learning. arXiv preprint arXiv:1602.07261 (2016) 16. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: CVPR (2016) 17. Tang, L., Niemeijer, M., Reinhardt, J.M., Garvin, M.K., Abr` amoff, M.D.: Splat feature classification with application to retinal hemorrhage detection in fundus images. TMI 32(2), 364–375 (2013) 18. Vo, H.H., Verma, A.: New deep neural nets for fine-grained diabetic retinopathy recognition on hybrid color space. In: ISM. IEEE (2016)

Full Quantification of Left Ventricle via Deep Multitask Learning Network Respecting Intra- and Inter-Task Relatedness Wufeng Xue1,2 , Andrea Lum1,2 , Ashley Mercado1,2 , Mark Landis1,2 , James Warrington1,2 , and Shuo Li1,2(B) 1

Department of Medical Imaging, Western University, London, ON, Canada [email protected] 2 Digital Imaging Group, London, ON, Canada

Abstract. Cardiac left ventricle (LV) quantification is among the most clinically important tasks for identification and diagnosis of cardiac diseases, yet still a challenge due to the high variability of cardiac structure and the complexity of temporal dynamics. Full quantification, i.e., to simultaneously quantify all LV indices including two areas (cavity and myocardium), six regional wall thicknesses (RWT), three LV dimensions, and one cardiac phase, is even more challenging since the uncertain relatedness intra and inter each type of indices may hinder the learning procedure from better convergence and generalization. In this paper, we propose a newly-designed multitask learning network (FullLVNet), which is constituted by a deep convolution neural network (CNN) for expressive feature embedding of cardiac structure; two followed parallel recurrent neural network (RNN) modules for temporal dynamic modeling; and four linear models for the final estimation. During the final estimation, both intra- and inter-task relatedness are modeled to enforce improvement of generalization: (1) respecting intra-task relatedness, group lasso is applied to each of the regression tasks for sparse and common feature selection and consistent prediction; (2) respecting inter-task relatedness, three phase-guided constraints are proposed to penalize violation of the temporal behavior of the obtained LV indices. Experiments on MR sequences of 145 subjects show that FullLVNet achieves high accurate prediction with our intra- and inter-task relatedness, leading to MAE of 190 mm2 , 1.41 mm, 2.68 mm for average areas, RWT, dimensions and error rate of 10.4% for the phase classification. This endows our method a great potential in comprehensive clinical assessment of global, regional and dynamic cardiac function. Keywords: Left ventricle quantification Multi-task learning · Task relatedness

1

·

Recurrent neural network

·

Introduction

Quantification of left ventricle (LV) from cardiac imaging is among the most clinically important and most frequently demanded tasks for identification and c Springer International Publishing AG 2017  M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 276–284, 2017. DOI: 10.1007/978-3-319-66179-7 32

Full Quantification of Left Ventricle via Deep Multitask Learning Network

277

Fig. 1. Illustration of LV indices to be quantified for short-axis view cardiac image. (a) Cavity (blue) and myocardium (orange) areas. (b) directional dimensions of cavity (red arrows). (c) Regional wall thicknesses (red arrows). A: anterior; AS: anterospetal; IS: inferoseptal; I: inferior; IL: inferolateral; AL: anterolateral. (d) Phase (systole or diastole).

diagnosis of cardiac disease [6], yet still a challenging task due to the high variability of cardiac structure across subjects and the complicated global/regional temporal dynamics. Full quantification, i.e., to simultaneously quantify all LV indices including two areas, six regional wall thicknesses (RWT), three LV dimension, and one phase (as shown in Fig. 1), providing more detailed information for comprehensive cardiac function assessment, is even more challenging since the uncertain relatedness intra and inter each type of indices may hinder the learning procedure from better convergence and generalization. In this work, we propose a newly-designed deep multitask learning network FullLVNet for full quantification of LV respecting both intra- and inter-task relatedness. In clinical practice, obtaining reliable quantification is subjected to measuring on segmented myocardium, which is usually obtained by manually contouring the borders of myocardium [13] or manual correction of contours [3,7] generated by LV segmentation algorithms [9]. However, manually contouring is time-consuming, of high inter-observer variability, and typically limited to the end diastolic (ED) and end systolic (ES) frames, which makes it insufficient for dynamic function analysis. LV segmentation algorithms, despite the recent advances, is still a difficult problem due to the lack of edge information and presence of shape variability. Most existing segmentation methods for cardiac MR images [4,9,10] requires strong prior information and user interaction to obtain reliable results, which may prevent them from efficient clinical application. In recent years, direct methods without segmentation have grown in popularity in cardiac volumes estimation [1,2,14,17–20]. Although these methods obtained effective performance by leveraging state-of-art machine learning techniques, they suffer from the following limitations. (1) Lack of powerful task-aware representation. The vulnerable hand-crafted or task-unaware features are not capable of capturing sufficient task-relevant cardiac structures. (2) Lack of temporal modeling. Independently handling each frame without assistance from neighbors can not guarantee the consistency and accuracy. (3) Not end-to-end learning. The separately learned representation and regression models cannot be optimal for each other. (4) Not full quantification. Only cardiac volume alone is not sufficient for comprehensive global, regional and dynamic function assessment.

278

W. Xue et al.

In this paper, we propose a newly-designed multitask learning network (FullLVNet), which is constituted by a specially tailored deep CNN for expressive feature embedding; two followed parallel RNN modules for temporal dynamic modeling; and four linear models for the final estimation. During the final estimation, FullLVNet is capable of improving the generalization by (1) modeling intratask relatedness through group lasso regularization within each regression task; and (2) modeling inter-task relatedness with three phase-guided constraints that penalize violation of the temporal behavior of LV indices. After being trained with a two-step strategy, FullLVNet is capable of delivering accurate results for all the considered indices of cardiac LV.

2

Multitask Learning for Full Quantification of Cardiac LV

The proposed FullLVNet models full quantification of cardiac LV as a multitask s,f s,f s,f , ydim , yrwt } and one classificalearning problem. Three regression tasks {yarea s,f tion task yphase are simultaneously learned to predict frame-wise values of the above mentioned LV indices from cardiac MR sequences X = {X s,f }, where s = 1 · · · S indexes the subject and f = 1 · · · F indexes the frame. The objective of FullLVNet is: 1  Lt (ˆ yts,f (X s,f |W ), yts,f ) + λR(W ) (1) Woptimal = min W S ×F t s,f

where t ∈ {area, dim, rwt, phase} denotes a specific task, yˆt is the estimated results for task t, Lt is the loss function of task t and R(W ) denotes regularization of parameters in the network.

Fig. 2. Overview of FullLVNet, which combines a deep CNN network (details shown in the left) for feature embedding, two RNN modules for temporal dynamic modeling, and four linear models for final estimation. Intra- and inter-task relatedness are modeled in the final estimation to improve generalization.

Full Quantification of Left Ventricle via Deep Multitask Learning Network

2.1

279

Architectures of FullLVNet

Figure 2 shows the overview of FullLVNet. A deep CNN is firstly designed to extract from cardiac images expressive and task-aware feature, which is then fed to the RNN modules for temporal dynamic modeling. Final estimations are given by four linear models with the output of RNN modules as input. To improve generalization of FullLVNet, both intra- and inter-task relatednesses are carefully modeled through group lasso and phase-guided constraints for the linear models. CNN for deep feature embedding. To obtain expressive and task-aware features, we design a specially tailored deep CNN for cardiac images, as shown in the left of Fig. 2. Powerful representations can be obtained by transfer learning [12] from well-known deep architectures in computer vision for applications with limited labeled data. However, transfer learning may incur (1) restriction of network architecture, resulting in incompatible or redundant model; and (2) restriction of input channel and dimension, leading to requirement of image resizing and channel expanding. We reduce the number of filters for each layer to avoid model redundancy. As for the kernel size of convolution and pooling, 5 × 5, instead of the frequently used 3 × 3, is deployed to introduce more shift invariance. Dropout and batch normalization are adopted to alleviate the training procedure. As can be seen in our experiments, our CNN is very effective for cardiac images even without transfer learning. As a feature embedding network, our CNN maps each cardiac image X s,f into a fixed-length low dimension vector es,f = fcnn (X s,f |wcnn ) ∈ R100 . RNNs for temporal dynamic modeling. Accurate modeling of cardiac temporal dynamic assistants the quantification of current frame with information from neighbors. RNN, especially when LSTM units [5] are deployed, is specialized in temporal dynamic modeling and has been employed in cardiac image segmentation [11] and key frame recognition [8] in cardiac sequences. In this work, two RNN modules, as shown by the green and yellow blocks in Fig. 2, are deployed for the regression tasks and the classification task. For the three regression tasks, the indices to be estimated are mainly related to the spatial structure of cardiac LV in each frame. For the classification task, the cardiac phase is mainly related to the structure difference between successive frames. Therefore, the two RNN modules are designed to capture these two kinds of dependencies. The outputs of s,F s,1 , ...es,F ]|wm ), m ∈ {rnn1, rnn2}. RNN modules are {hs,1 m , , , , hm } = frnn ([e Final estimation. With the outputs of RNN modules, all the LV indices can be estimated with a linear regression/classification model:  s,f yˆt = wt hs,f where t ∈ {area, dim, rwt} rnn1 + bt , (2) s,f 1 p(ˆ yt = 0) = 1+exp(w hs,f +b ) , t = phase t

rnn2

t

where wt and bt are the weight and bias term of the linear model for task t, s,f = 0 and 1 denote the two cardiac phase Diastole and Systole, and p(ˆ yphase

s,f 1) = 1 − p(ˆ yphase = 0). For the loss function in (1), Euclidean distance and

280

W. Xue et al.

cross-entropy are employed for the regression tasks and the classification task, respectively.  1 ˆ y s,f − yts,f 22 , where t ∈ {area, dim, rwt} Lt = 2 t (3) s,f s,f − log p(ˆ yt = yt ), t = phase 2.2

Intra-task and Inter-task Relatedness

Significant correlations exist between the multiple outputs of each task and those of different tasks, and are referred as intra- and inter-task relatedness. Intra-task relatedness can be effectively modeled by the well-known group lasso regularization, while inter-task relatedness is modeled by three phase-guided constraints. Improved generalization can be achieved with both of them fully leveraged in our FullLVNet. Intra-task relatedness based on group lasso. Group lasso, also known as L1/L2 regularization, can perfectly model relatedness within groups of outputs, i.e., the three regression tasks. It enforces common feature selection cross related outputs with the L2 norm, and encourages sparse selection of the most related features with the L1 norm for each task. In this way, the relevant features of different tasks can be well disentangled. To leverage this advantage, group lasso is applied to the weight parameters of the three regression models in (2).  wt (i)2 , f or t ∈ {area, dim, rwt} (4) Rintra = t

i

where wt (i) denotes the ith column of wt . Inter-task relatedness based on phase-guided constraints. Three phaseguided constraints are proposed to model inter-task relatedness, i.e., the cardiac phase and other LV indices. Cardiac phase indicates the temporal dynamics of LV myocardium in a cardiac cycle. Other LV indices change accordingly with cardiac phase: (1) cavity area and LV dimensions increase in the diastole phase and decrease in the systole phase; (2) myocardium area and RWT decrease in the diastole phase and increase in the systole phase. Effectively modeling such an intrinsic phase-guided relatedness would ensure that the estimated LV indices are consistent with the temporal dynamics of LV. To penalize violation of these intertask relatednesses, three phase-guided constraints are applied to the predicted results of areas, dimensions and RWT. Rarea inter =

 1 s,f s,f,1 s,f,2 [1(yphase = 0)(max(−zarea , 0) + max(zarea , 0)) 2S × F s,f

s,f s,f,1 s,f,2 = 1)(max(zarea , 0) + max(−zarea , 0))] +1(yphase

Rdim inter =

(5)

1  s,f s,f s,f s,f [1(yphase = 0) max(−¯ zdim , 0) + 1(yphase = 1) max(¯ zdim , 0)] S×F s,f

(6)

Full Quantification of Left Ventricle via Deep Multitask Learning Network

Rrwt inter =

281

1  s,f s,f s,f s,f [1(yphase = 0) max(¯ zrwt , 0) + 1(yphase = 1) max(−¯ zrwt , 0)] S×F s,f

(7) where 1(·) is the indicator function, zts,f = yˆts,f − yˆts,f −1 , f or t ∈ {area, dim, rwt}, zts,f,i denotes the ith output of zt and z¯t denotes the average value of zt across its multiple outputs. Totally, our regularization term becomes dim rwt R(W ) = λ1 Rintra + λ2 (Rarea inter + Rinter + Rinter )

3

(8)

Dataset and Configurations

Our FullLVNet is validated with short-axis cardiac MR images of 145 subjects. Temporal resolution is 20 frames per cardiac cycle, resulting in a total of 2900 images in the dataset. The pixel spacings range from 0.6836 mm/pixel to 2.0833 mm/pixel, with mode of 1.5625 mm/pixel. The ground truth values are computed from manually obtained contours of LV myocardium. Within each subject, frames are labeled as either Diastole phase or Systole phase, according to the obtained values of cavity area. In our experiments, two landmarks, i.e., junctions of the right ventricular wall with the left ventricular, are manually marked for each image to provide reference for ROI cropping and the LV myocardial segments division. The cropped images are resized to 80 × 80. The network is implemented by Caffe with SGD solver. Five-fold cross validation is employed for performance evaluation and comparison. Data augmentation is conducted by randomly cropping images of size 75 × 75 from the resized image. Two-step strategy training. We apply a two-step strategy for training our network to alleviate the difficulties caused by the different learning rate and loss function in multitask learning [15,16]. Firstly the CNN embedding, the first RNN module and the three regression models are learned together with no back propagation from the classification task, to obtain accuracy prediction for the regression tasks; with the obtained CNN embedding, the second RNN module and the linear classification model are then learned while the rest of the network are kept frozen. As shown in the experiments, such a strategy delivers excellent performance for all the considered tasks.

4

Results and Analysis

FullLVNet is extensively validated under different configurations in Table 1. From the last column, we can draw that FullLVNet successfully delivers accurate predictions for all the considered indices, with average Mean Absolute Error (MAE) of 1.41 ± 0.72 mm, 2.68 ± 1.64 mm, 190 ± 128 mm2 for RWT, dimension, and areas. For reference, the maximums of these indices in our dataset are 24.4 mm, 81.0 mm, 4936 mm2 . Error rate (1-accuracy) for phase identification is 10.4%. Besides, the effectivenesses of intra- and inter-task relatedness are also demonstrated by the results in the third and fourth column: intra-task

282

W. Xue et al.

Table 1. Performance of FullLVNet under different configurations (e.g., intra/N means only intra-task relatedness is included) and its competitor for LV quantification. Mean Absolute Error (MAE) is used for the three regression tasks and prediction error rate is used for the phase identification task. Method

Multi-features [18] FullLVNet N/N intra/N

intra/inter

IS

1.70 ± 1.47

1.42 ± 1.21 1.39 ± 1.10

1.32 ± 1.09

I

1.71 ± 1.34

1.53 ± 1.25 1.48 ± 1.16

1.38 ± 1.10

IL

1.97 ± 1.54

1.74 ± 1.43 1.61 ± 1.29

1.57 ± 1.35

AL

1.82 ± 1.41

1.59 ± 1.31 1.53 ± 1.06 1.60 ± 1.36

RWT (mm)

A

1.55 ± 1.33

1.36 ± 1.17 1.32 ± 1.06 1.34 ± 1.11

AS

1.68 ± 1.43

1.43 ± 1.24 1.37 ± 1.10

1.26 ± 1.10

Average

1.73 ± 0.97

1.51 ± 0.81 1.45 ± 0.69

1.41 ± 0.72

Dimension (mm) dim1

3.53 ± 2.77

2.87 ± 2.23 2.69 ± 2.05

2.62 ± 2.09

dim2

3.49 ± 2.87

2.96 ± 2.35 2.67 ± 2.15

2.64 ± 2.12

dim3

3.91 ± 3.23

2.92 ± 2.48 2.70 ± 2.22

2.77 ± 2.22

Average

3.64 ± 2.61

2.92 ± 1.89 2.69 ± 1.67

2.68 ± 1.64

Area (mm2 ) 231 ± 193

205 ± 182

182 ± 152

181 ± 155

myocardium 291 ± 246

204 ± 195

205 ± 168

199 ± 174

261 ± 165

205 ± 145

194 ± 125

190 ± 128

22.2

13.0

11.4

10.4

cavity Average Phase (%) phase

relatedness brings clearly improvements for all the tasks, while inter-task relatedness further brings moderate improvement. Compared to the recent direct multi-feature based method [18], which we adapt to our full quantification task, FullLVNet shows remarkable advantages even without intra- and inter-task relatedness.

5

Conclusions

We propose a multitask learning network FullLVNet for full quantification of LV, which includes three regression tasks and one classification task. By taking advantages of expressive feature embeddings from deep CNN and effective dynamic temporal modeling from RNN, and leveraging intra- and inter-task relatedness with group lasso regularization and phase-guided constraints, FullLVNet is capable of delivering state-of-art accuracy for all the tasks considered.

Full Quantification of Left Ventricle via Deep Multitask Learning Network

283

References 1. Afshin, M., Ayed, I.B., Islam, A., Goela, A., Peters, T.M., Li, S.: Global assessment of cardiac function using image statistics in MRI. In: Ayache, N., Delingette, H., Golland, P., Mori, K. (eds.) MICCAI 2012. LNCS, vol. 7511, pp. 535–543. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33418-4 66 2. Afshin, M., Ben Ayed, I., Punithakumar, K., Law, M., Islam, A., Goela, A., Peters, T.M., Li, S.: Regional assessment of cardiac left ventricular myocardial function via MRI statistical features. IEEE TMI 33(2), 481–494 (2014) 3. Attili, A.K., Schuster, A., Nagel, E., Reiber, J.H., van der Geest, R.J.: Quantification in cardiac MRI: advances in image acquisition and processing. Int. J. Cardiovasc. Imaging 26(1), 27–40 (2010) 4. Ayed, I.B., Chen, H.M., Punithakumar, K., Ross, I., Li, S.: Max-flow segmentation of the left ventricle by recovering subject-specific distributions via a bound of the Bhattacharyya measure. Med. Image Anal. 16(1), 87–100 (2012) 5. Graves, A.: Supervised sequence labelling. Supervised Sequence Labelling with Recurrent Neural Networks. SCI, vol. 385, pp. 5–13. Springer, Heidelberg (2012) 6. Karamitsos, T.D., Francis, J.M., Myerson, S., Selvanayagam, J.B., Neubauer, S.: The role of cardiovascular magnetic resonance imaging in heart failure. J. Am. Coll. Cardiol. 54(15), 1407–1424 (2009) 7. Kawel-Boehm, N., Maceira, A., Valsangiacomo-Buechel, E.R., Vogel-Claussen, J., Turkbey, E.B., Williams, R., Plein, S., Tee, M., Eng, J., Bluemke, D.A.: Normal values for cardiovascular magnetic resonance in adults and children. J. Cardiovasc. Magn. Reson. 17(1), 29 (2015) 8. Kong, B., Zhan, Y., Shin, M., Denny, T., Zhang, S.: Recognizing end-diastole and end-systole frames via deep temporal regression network. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9902, pp. 264–272. Springer, Cham (2016). doi:10.1007/978-3-319-46726-9 31 9. Peng, P., Lekadir, K., Gooya, A., Shao, L., Petersen, S.E., Frangi, A.F.: A review of heart chamber segmentation for structural and functional analysis using cardiac magnetic resonance imaging. Magn. Reson. Mater. Phys., Biol. Med. 29(2), 155– 195 (2016) 10. Petitjean, C., Dacher, J.N.: A review of segmentation methods in short axis cardiac MR images. Med. Image Anal. 15(2), 169–184 (2011) 11. Poudel, R.P., Lamata, P., Montana, G.: Recurrent fully convolutional neural networks for multi-slice MRI cardiac segmentation arXiv:1608.03974 (2016) 12. Shin, H.C., Roth, H.R., Gao, M., Lu, L., Xu, Z., Nogues, I., Yao, J., Mollura, D., Summers, R.M.: Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE TMI 35(5), 1285–1298 (2016) 13. Suinesiaputra, A., Bluemke, D.A., Cowan, B.R., Friedrich, M.G., Kramer, C.M., Kwong, R., Plein, S., Schulz-Menger, J., Westenberg, J.J., Young, A.A., et al.: Quantification of LV function and mass by cardiovascular magnetic resonance: multi-center variability and consensus contours. J. Cardiovasc. Magn. Reson. 17(1), 63 (2015) 14. Wang, Z., Ben Salah, M., Gu, B., Islam, A., Goela, A., Li, S.: Direct estimation of cardiac biventricular volumes with an adapted bayesian formulation. IEEE TBE 61(4), 1251–1260 (2014) 15. Zhang, Y., Yeung, D.Y.: A convex formulation for learning task relationships in multi-task learning. In: UAI, pp. 733–742 (2010)

284

W. Xue et al.

16. Zhang, Z., Luo, P., Loy, C.C., Tang, X.: Facial landmark detection by deep multi-task learning. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 94–108. Springer, Cham (2014). doi:10.1007/ 978-3-319-10599-4 7 17. Zhen, X., Islam, A., Bhaduri, M., Chan, I., Li, S.: Direct and simultaneous fourchamber volume estimation by multi-output regression. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9349, pp. 669–676. Springer, Cham (2015). doi:10.1007/978-3-319-24553-9 82 18. Zhen, X., Wang, Z., Islam, A., Bhaduri, M., Chan, I., Li, S.: Direct estimation of cardiac bi-ventricular volumes with regression forests. In: Golland, P., Hata, N., Barillot, C., Hornegger, J., Howe, R. (eds.) MICCAI 2014. LNCS, vol. 8674, pp. 586–593. Springer, Cham (2014). doi:10.1007/978-3-319-10470-6 73 19. Zhen, X., Wang, Z., Islam, A., Bhaduri, M., Chan, I., Li, S.: Multi-scale deep networks and regression forests for direct bi-ventricular volume estimation. Med. Image Anal. 30, 120–129 (2016) 20. Zhen, X., Zhang, H., Islam, A., Bhaduri, M., Chan, I., Li, S.: Direct and simultaneous estimation of cardiac four chamber volumes by multioutput sparse regression. Med. Image Anal. 36, 184–196 (2017)

Scalable Multimodal Convolutional Networks for Brain Tumour Segmentation Lucas Fidon1(B) , Wenqi Li1 , Luis C. Garcia-Peraza-Herrera1 , Jinendra Ekanayake2,3 , Neil Kitchen2 , Sebastien Ourselin1,3 , and Tom Vercauteren1,3 1

3

TIG, CMIC, University College London, London, UK [email protected] 2 NHNN, University College London Hospitals, London, UK Wellcome/EPSRC Centre for Surgical and Interventional Science, UCL, London, UK

Abstract. Brain tumour segmentation plays a key role in computerassisted surgery. Deep neural networks have increased the accuracy of automatic segmentation significantly, however these models tend to generalise poorly to different imaging modalities than those for which they have been designed, thereby limiting their applications. For example, a network architecture initially designed for brain parcellation of monomodal T1 MRI can not be easily translated into an efficient tumour segmentation network that jointly utilises T1, T1c, Flair and T2 MRI. To tackle this, we propose a novel scalable multimodal deep learning architecture using new nested structures that explicitly leverage deep features within or across modalities. This aims at making the early layers of the architecture structured and sparse so that the final architecture becomes scalable to the number of modalities. We evaluate the scalable architecture for brain tumour segmentation and give evidence of its regularisation effect compared to the conventional concatenation approach.

1

Introduction

Gliomas make up 80% of all malignant brain tumours. Tumour-related tissue changes can be captured by various MR modalities, including T1, T1-contrast, T2, and Fluid Attenuation Inversion Recovery (FLAIR). Automatic segmentation of gliomas from MR images is an active field of research that promises to speed up diagnosis, surgery planning, and follow-up evaluations. Deep Convolutional Neural Networks (CNNs) have recently achieved state-of-the-art results on this task [1,2,6,12]. Their success is partly attributed to their ability of automatically learning hierarchical visual features as opposed to conventional handcrafted features extraction. Most of the existing multimodal network architectures handle imaging modalities by concatenating the intensities as an input. The multimodal information is implicitly fused by training the network discriminatively. Experiments show that relying on multiple MR modalities consistently c Springer International Publishing AG 2017  M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 285–293, 2017. DOI: 10.1007/978-3-319-66179-7 33

286

L. Fidon et al.

is key to achieving highly accurate segmentations [3,9]. However, using classical modality concatenation to turn a given monomodal architecture into a multimodal CNN does not scale well because it either requires to dramatically augment the number of hidden channels and network parameters, or imposes a bottleneck on at least one of the network layers. This lack of scalability requires the design of dedicated multimodal architectures and makes it difficult and timeconsuming to adapt state-of-the-art network architectures. Recently, Havaei et al. [3] proposed an hetero-modal network architecture (HeMIS) that learns to embed the different modalities into a common latent space. Their work suggests that it is possible to impose more structure on the network. HeMIS separates the CNN into a backend that encodes modalityspecific features up to the common latent space, and a frontend that uses highlevel modality-agnostic feature abstractions. HeMIS is able to deal with missing modalities and shows promising segmentation results. However, the authors do not study the adaption of existing networks to additional imaging modalities and do not demonstrate an optimal fusion of information across modalities. We propose a scalable network framework (ScaleNets) that enables efficient refinement of an existing architecture to adapt it to an arbitrary number of MR modalities instead of building a new architecture from scratch. ScaleNets are CNNs split into a backend and frontend with across-modality information flowing through the backend thereby alleviating the need for a one-shot latent space merging. The proposed scalable backend takes advantage of a factorisation of the feature space into imaging modalities (M -space) and modality-conditioned features (F -space). By explicitly using this factorisation, we impose sparsity on the network structure with demonstrated improved generalisation. We evaluate our framework by starting from a high-resolution network initially designed for brain parcellation from T1 MRI [8] and readily adapting it to brain tumour segmentation from T1, T1c, Flair and T2 MRI. Finally, we explore the design of the modality-dependent backend by comparing several important factors, including the number of modality-dependent layers, the merging function, and convolutional kernel sizes. Our experiments show that the proposed networks are more efficient and scalable than the conventional CNNs and achieve competitive segmentation results on the BraTS 2013 challenge dataset.

2

Structural Transformations Across Features/modalities

Concatenating multimodal images as input is the simplest and most common approach in CNN-based segmentation [2,6]. We emphasise that the complete feature space F M can be factorised into a M-feature space M derived from imaging modalities, and a F-feature space F derived from scan intensity. However the concatenation strategy doesn’t take advantage of it. We propose to impose structural constraints that make this factorisation explicit. Let V ⊂ R3 be a discrete volume domain, and F (resp. M ) be a finite F-features (resp. M-features) domain, the set of feature maps associated to

Scalable Multimodal Convolutional Networks

287

Fig. 1. (a) The proposed scalable multimodal layer. (b) A classic CNN layer with multimodal images concatenated as input. Volumes are represented as slices, the colours correspond to the F-features (F1 , ..., Fp ) and (M1 , ..., Mn ) correspond to the M-features. In (a) transformations across F-features f and across M-features g are explicitly separated (as illustrated by the rooted structure) while in (b) there are implicitly both p+n . applied in fˆ. The ratio of the number of parameters in (a) compared to (b) is p×n

(V , F , M ) is defined as: G(V × F × M ) = {x : V × F × M → R}. This factorisation allows us to introduce new scalable layers that perform the transformation f˜ of the joint F M feature space in two steps (1). f (resp. g) typically uses convolutions across F -features (resp. across M -features). The proposed layer architecture, illustrated in Fig. 1, offers several advantages compared to classic ones: (1) cross F -feature layers remain to some extent independent of the number of modalities (2) cross M -feature layers allow the different modality branches to share complementary information (3) the total number of parameters is reduced. The HeMIS architecture [3], where one branch per modality is maintained until averaging merges the branches, is a special case of our framework where the cross M-features transformations g are identity mappings. f˜

G(V × F × M ) f

G(V × F  × M  ) g

G(V × F  × M ) Another important component of the proposed framework is the merging layer. It aims at recombining the F-features space and the M-features space together either by concatenating them or by applying a downsampling/pooling (averaging, maxout) on the M-features space to reduce its dimension to one:

G(V × F × M )

concat

G(V × F M ),

G(V × F × M )

pooling

G(V × F × {1})

As opposed to concatenation, relying on averaging or maxout for the merging layer at the interface between a backend and frontend makes the frontend

288

L. Fidon et al.

structurally independent of the number of modalities and more generally of the entire backend. The proposed ScaleNets rely on such merging strategies to offer scalability in the network design.

3

ScaleNets Implementation

The modularity of the proposed feature factorisation raises different questions: (1) Is the representative power of scalable F /M -structured multimodal CNN the same as classic ones? (2) What are the important parameters for the tradeoff between accuracy and complexity? (3) How can this modularity help readily transform existing architectures into scalable multimodal ones? To demonstrate that our scalable framework can provide, to a deep network, the flexibility of efficiently being reused for different sets of image modalities, we adapt a model originally built for brain parcellation from T1 MRI [8]. As illustrated in Fig. 2, the proposed ScaleNets splits the network into two parts: (i) a backend and (ii) a frontend. In following experiments, we explore different backend architectures allowing to scale the monomodal network into a multimodal network. We also add a merging operation that allows plugging any backend into the frontend and makes the frontend independent from the number of modalities used. As a result, the frontend will be the same for all our architectures. To readily adapt the backend from the monomodal network architecture [8] we duplicate the layers to get the across F -features transformations (one branch per M -features) and add an across M -features transformation after each of them (one branch per F -features) as shown in Fig. 2. In the frontend, only the number of outputs of the last layer is changed to match the number of classes for

Fig. 2. Scalable and Classic CNN architectures. Numbers in bold are for the number of branches and the other correspond to the number of features.

Scalable Multimodal Convolutional Networks

289

the new task. The proposed scalable models (SN31Ave1, SN31Ave2, SN31Ave3, SN33Ave2, SN31Max2) are named consistently. For example, SN31Ave2 stands for: “ScaleNet with 2 cross M -features residual blocks with 33 convolution and 13 convolution before averaging” and corresponds to the model (a) of Fig. 2. Baseline Monomodal Architecture. The baseline architecture used for our experiments is a high-resolution, compact network designed for volumetric image segmentation [8]. It has proved to reach state-of-the-art results for brain parcellation of T1 scans. This fully convolutional neural network makes an end-to-end mapping from a monomodal image volume to a voxel-level segmentation map mainly with convolutional blocks and residual connections. It also takes advantage of dilated convolutions to incorporate image features at multiple scales while maintaining the spatial resolution of the input images. The maximum receptive field is 87 × 87 × 87 voxels and is, therefore, able to catch multi-scale information in one path. By learning the variation between successive feature maps, the residual connections allow the initialisation of cross M-feature transformations closed to identity mappings. Thus it encourages information sharing across the modalities without changing their nature. Brain Tumour Segmentation. We compare the different models on the task of brain tumour segmentation using BraTS’15 training set that is composed of 274 multimodal images (T1, T1c, T2 and Flair). We divide it into 80% for training, 10% for validation and 10% for testing. Additionally, we evaluate one of our scalable network model on the challenge BraTS’13 dataset, for which an online evaluation platform is available1 , to compare it to state-of-the-art (all the models were trained on the BraTS’15 though). Implementation Details. We maximise the soft Dice score as proposed by [10]. We train all the networks with Adam Optimization method [7] with a learning rate lr = 0.01, β1 = 0.9 and β2 = 0.999. We also used early stopping on the validation set. Rotation of random small angles in the range [−10◦ , 10◦ ] are applied along each axis during training. All the scans of BraTS dataset are available after skull stripping, resampling to a 1 mm isotropic grid and co-registration of all the modalities to the T1-weighted images for each patient. Additionaly, we applied the histogram-based standardisation method [11]. The experiences have been performed using NiftyNet2 and one GPU Nvidia GTX Titan. Evaluation of Segmentation Performance. Results are evaluated using the Dice score of different tumour subparts: whole tumour, core tumour and enhanced tumour [9]. Additionally, we introduce a healthy tissue class to separate it from the background (zeroed out in the BraTS dataset). 1 2

https://www.virtualskeleton.ch/BraTS/. Our implementation of the ScaleNets and other CNNs used for comparison can be found at http://www.niftynet.io.

290

4

L. Fidon et al.

Experiments and Results

To demonstrate the usefulness of our framework, we compare two basic ScaleNets and a classic CNN. Table 1 highlights the benefits of ScaleNets in terms of number of parameters. We also explore some combinations of the important factors appearing in the choice of the architecture to try to address some key practical questions. How deep does the cross modalities layers have to be? When should we merge the different branches? Which merging operation should we use? Wilcoxon signed-rank p-values are reported to highlight significant improvements. Table 1. Comparison of ScaleNets and Classic concatenation-based CNNs for model adaptation on the testing set. Method

# Param. Mean(Std) Dice Score (%) Whole tumour Core tumour Active tumour

SN31Ave1

0.83M

87(8)

SN31Ave2

0.85M

87(7)

71(19)

70(28)

SN31Ave3

0.88M

88(6)

69 (17)

71(27)

SN31Max2

0.85M

85(9)

67(17)

71(28)

SN33Ave2

0.92M

87(7)

70(18)

67(27)

HeMIS-like

0.89M

86(12)

70(20)

69(28)

Classic CNN 1.15M

81(18)

64(28)

65(28)

73(22)

72(26)

ScaleNet with Basic Merging and Classic CNN. We compare three merging strategies (averaging: “SN31Ave2”, maxout: “SN31Max2” and concatenation: “Classic CNN”). To be as fair as possible, we carefully choose the size of the kernels so that the maximum receptive field remain the same across all architectures. Quantitative Dice score results Table 1 show that both SN31Ave2 and

Fig. 3. Qualitative comparison of different models output on a particular testing case. Colours correspond to different tissue regions. red: necrotic core, yellow: enhancing tumour, blue: non-enhancing tumour, green: edema, cyan: healthy tissues.

Scalable Multimodal Convolutional Networks

291

SN31Max2 outperform Classic CNN on the segmentation of all tumour region. SN31Ave2 outperforms SN31Max2 for core tumour and get similar results on whole tumour and enhanced tumour. We compare ScaleNets with resp. 1, 2 or 3 scalable multimodal layers before averaging (resp. named “SN31Ave1”, “SN31Ave2”, “SN31Ave3”). The results reported on Table 1 show similar performance for all of those models. This suggests that a short backend is enough to get a modality-agnostic sufficient representation for Gliomas segmentation using T1, T1c, FLAIR and T2. Furthermore, SN31Ave1 outperforms Classic CNN on all tumour regions (p ≤ 0.001). Qualitative results in a testing case with artifact deformation (Fig. 3) and the decreasing of Dice score standard deviation for whole and core tumour (Table 1) demonstrate the robustness of ScaleNets compared to classic CNNs and show the regularisation effect of the proposed scalable multimodal layers Fig. 1. Comparison to State-of-the-Art. We validate the usefulness of the cross M -feature layers by comparing our proposed network to an implementation of ScaleNets aiming at replicating the characteristics of the HeMIS network [3] by removing the cross M -feature layers. We refer to this latest network as HeMISlike. Dice score results in Table 1 illustrate improved results on the core tumour (p ≤ 0.03) and similar performance on whole and active tumour. Qualitative comparison in Fig. 3 clearly confirmed this trend. We compare our SN31Ave1 model to the state-of-the-art. The results obtained on Leaderboard and Challenge BraTS’13 dataset are reported in Tab. 2 and compared to the BraTS’13 Challenge Winners listed in [9]. We achieved similar results with no need of post-processing. Table 2. Dice score on Leaderboard and Challenge against BraTS’13 winners.

5

Method

Leaderboard Challenge Whole Core Enhanced Whole Core Enhanced

Tustison

79

65

53

87

Zaho

79

59

47

Meier

72

60

53

SN31Ave1 77

64

56

78

74

84

70

65

82

73

69

88

77

72

Conclusions

We have proposed a scalable deep learning framework that allows building more reusable and efficient deep models when multiple correlated sources are available. In the case of volumetric multimodal MRI for brain tumour segmentation, we proposed several scalable CNNs that integrate smoothly the complementary information about tumour tissues scattered across the different image modalities. ScaleNets impose a sparse structure to the backend of the architecture

292

L. Fidon et al.

where cross features and cross modalities transformations are separated. It is worth noticing that ScaleNets are related to the recently proposed implicit Conditional Networks [5] and Deep Rooted Networks [4] that use sparsely connected architecture but do not suggest the transposition of branches and grouped features. Both of these frameworks have been shown to improve the computational efficiency of state-of-the-art CNNs by reducing the number of parameters, the amount of computation and increasing the parallelisation of the convolutions. Using our proposed scalable layer architecture, we readily adapted a compact network for brain parcellation of monomodal T1 into a multimodal network for brain tumour segmentation with 4 different image modalities as input. Scalable structures, thanks to their sparsity, have a regularisation effect. Comparison of classic and scalable CNNs shows that scalable networks are more robust and use fewer parameters while maintaining similar or better accuracy for medical image segmentation. Scalable network structures have the potential to make deep network for medical images more reusable. We believe that scalable networks will play a key enabling role for efficient transfer learning in volumetric MRI analysis. Acknowledgements. This work was supported by the Wellcome Trust (WT101957, 203145Z/16/Z, HICF-T4-275, WT 97914), EPSRC (NS/A000027/1, EP/H046410/1, EP/J020990/1, EP/K005278, NS/A000050/1), the NIHR BRC UCLH/UCL, a UCL ORS/GRS Scholarship and a hardware donation from NVidia.

References 1. Chen, H., Dou, Q., Yu, L., Qin, J., Heng, P.A.: Voxresnet: Deep voxelwise residual networks for brain segmentation from 3D MR images. NeuroImage (2017). http:// www.sciencedirect.com/science/article/pii/S1053811917303348 2. Havaei, M., Davy, A., Warde-Farley, D., Biard, A., Courville, A., Bengio, Y., Pal, C., Jodoin, P.M., Larochelle, H.: Brain tumor segmentation with deep neural networks. Med. Image Anal. 35, 18–31 (2017) 3. Havaei, M., Guizard, N., Chapados, N., Bengio, Y.: HeMIS: hetero-modal image segmentation. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 469–477. Springer, Cham (2016). doi:10. 1007/978-3-319-46723-8 54 4. Ioannou, Y., Robertson, D., Cipolla, R., Criminisi, A.: Deep roots: Improving CNN efficiency with hierarchical filter groups. In: CVPR 2017 (2017) 5. Ioannou, Y., Robertson, D., Zikic, D., Kontschieder, P., Shotton, J., Brown, M., Criminisi, A.: Decision forests, convolutional networks and the models in-between arXiv:1603.01250 (2016) 6. Kamnitsas, K., Ledig, C., Newcombe, V.F., Simpson, J.P., Kane, A.D., Menon, D.K., Rueckert, D., Glocker, B.: Efficient multi-scale 3D CNN with fully connected CRF for accurate brain lesion segmentation. Med. Image Anal. 36, 61–78 (2017) 7. Kingma, D., Ba, J.: Adam: A method for stochastic optimization arXiv:1412.6980 (2014)

Scalable Multimodal Convolutional Networks

293

8. Li, W., Wang, G., Fidon, L., Ourselin, S., Cardoso, M.J., Vercauteren, T.: On the compactness, efficiency, and representation of 3D convolutional networks: brain parcellation as a pretext task. In: Niethammer, M., Styner, M., Aylward, S., Zhu, H., Oguz, I., Yap, P.-T., Shen, D. (eds.) IPMI 2017. LNCS, vol. 10265, pp. 348–360. Springer, Cham (2017). doi:10.1007/978-3-319-59050-9 28 9. Menze, B.H., Jakab, A., Bauer, S., Kalpathy-Cramer, J., Farahani, K., Kirby, J., Burren, Y., Porz, N., Slotboom, J., Wiest, R., et al.: The multimodal brain tumor image segmentation benchmark (BraTS). IEEE Trans. Med. Imag. 34(10), 1993– 2024 (2015) 10. Milletari, F., Navab, N., Ahmadi, S.A.: V-Net: fully convolutional neural networks for volumetric medical image segmentation. In: Proceeding of 3DV 2016, pp. 565– 571 (2016) 11. Nyul, L.G., Udupa, J.K., Zhang, X.: New variants of a method of MRI scale standardization. IEEE Trans. Med. Imag. 19(2), 143–150 (2000) 12. Pereira, S., Pinto, A., Alves, V., Silva, C.A.: Brain tumor segmentation using convolutional neural networks in MRI images. IEEE Trans. Med. Imag. 35(5), 1240–1251 (2016)

Pathological OCT Retinal Layer Segmentation Using Branch Residual U-Shape Networks Stefanos Apostolopoulos1(B) , Sandro De Zanet2 , Carlos Ciller2 , Sebastian Wolf3 , and Raphael Sznitman1 1

3

University of Bern, Bern, Switzerland [email protected] 2 RetinAI Medical GmbH, Bern, Switzerland University Hospital of Bern, Bern, Switzerland

Abstract. The automatic segmentation of retinal layer structures enables clinically-relevant quantification and monitoring of eye disorders over time in OCT imaging. Eyes with late-stage diseases are particularly challenging to segment, as their shape is highly warped due to pathological biomarkers. In this context, we propose a novel fully-Convolutional Neural Network (CNN) architecture which combines dilated residual blocks in an asymmetric U-shape configuration, and can segment multiple layers of highly pathological eyes in one shot. We validate our approach on a dataset of late-stage AMD patients and demonstrate lower computational costs and higher performance compared to other state-ofthe-art methods.

1

Introduction

Optical Coherence Tomography (OCT) is a non-invasive medical imaging modality that provides micrometer-resolution volumetric scans of biological tissue [1]. Since its introduction in 1991, OCT has seen widespread use in the field of ophthalmology, as it enables direct, non-invasive imaging of the retinal layers. As shown in Fig. 1, OCT allows for the visualization of both healthy tissue and pathological biomarkers such as drusen, cysts and fluid pockets within and underneath the retinal layers. Critically, these have been linked to diseases such as Age-related Macular Degeneration (AMD), Diabetic Retinopathy (DR) and Central Serous Chorioretinopathy (CSC) [2,3]. Given the widespread occurrence of these diseases, which is estimated at over 300 million people worldwide, medical image analysis methods for OCT imaging have gained popularity in recent years. The automatic segmentation of retinal layer structures is of particular interest as it allows for the quantification, characterization and monitoring of retinal disorders over time. This remains a challenging task, as retinal layers can be heavily distorted in the presence of pathological biomarkers. In this context, the present paper focuses on providing more accurate retinal layer segmentations in pathological eyes, at clinically-relevant speeds. Electronic supplementary material The online version of this chapter (doi:10. 1007/978-3-319-66179-7 34) contains supplementary material, which is available to authorized users. c Springer International Publishing AG 2017  M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 294–301, 2017. DOI: 10.1007/978-3-319-66179-7 34

Pathological OCT Retinal Layer Segmentation

295

Fig. 1. Example of OCT cross-sections with retinal layer boundaries highlighted for (left) healthy subject and (right) late-stage AMD patient. The images were manually annotated by an expert ophthalmologist.

A number of relevant methods on this topic can be found in the literature. Mayer et al. [4] propose the use of a series of edge filters and denoising steps to extract layers in OCT cross-sections. In [5], a Markov Random Field (MRF)based optimization with soft constraints is proposed to segment 7 retinal layers using volumetric information. Chen et al. [6] use a constrained graph-cut approach to segment layers and quantify fluid pockets in pathological OCTs. Overall, most of these methods face difficulties in segmenting all retinal layers accurately for subjects with pathological eyes. To this end, we present a novel strategy to overcome the above limitations and provide accurate results in a wider range of cases. Inspired by recent CNN approaches for semantic segmentation [7] and image classification [8], we introduce a novel CNN architecture that learns to segment retinal layers as a supervised regression problem. Our proposed network combines residual building blocks with dilated convolutions into an asymmetric U-shape configuration, and can segment multiple layers of highly pathological eyes in one shot. Using lower computational resources, our strategy achieves superior segmentation performance compared to both state-of-the-art deep learning architectures and other OCT segmentation methods.

2

Methods

Our goal is to segment retinal cell layers in OCT images. The main challenge in this task stems primarily from the highly variable and irregular shape of pathological eyes, and secondarily from the variable image quality (i. e., signal strength and speckle noise) of clinical OCT scans. Due to the image acquisition process, wherein each cross-section, or Bscan is acquired separately without a guaranteed global alignment, we opt to segment retinal layers at the Bscan level. This avoids the need for computationally intensive 3-dimensional convolutions [9] and volumetric pre-processing (i. e., registration and alignment).

296

S. Apostolopoulos et al.

In our approach, we treat the task of segmenting retinal layers as a regression problem. Given a Bscan image, I, we wish to find a function T : I → L, that maps each pixel in I to a label L ∈ {0, 1, 2, 3, 4, 5, 6} corresponding to an anatomical retinal cell layer region. As in [5], we consider the following six retinal layers: (1) Internal Limiting Membrane (ILM) to Nerve Fibre Layer (NFL), (2) NFL to Ganglion Cell Layer (GCL), (3) GCL and Inner Plexiform Layer (IPL), (4) Inner Nuclear Layer (INL) and Outer Plexiform Layer (OPL), (5) OPL to Inner Segment/Outer Segment (IS/OS) Junction and (6) IS/OS Junction to Bruch’s Membrane (BM). 2.1

Branch Residual U-Network

Fully convolutional U-net style networks have established themselves as the state-of-the-art for binary segmentation and have been successfully used in a variety of biomedical applications [7]. In such architectures, input images are convolved and downsampled level by level with exponentially increasing numbers of filters up to a predefined depth (descending branch), from which they are subsequently upsampled and convolved to the original size (ascending branch). Skip connections from corresponding levels transfer information from the descending to the ascending branch. A number of important limits arise from this architecture. First, the largest possible object that can be segmented is defined by the cumulative receptive field of the network. According to our experiments, a regular U-net with 3 px × 3 px convolutions and a depth of 5 layers [7], will start exhibiting holes when segmenting objects with discontinuities wider than 3 ∗ 25 = 96 px. Second, due to the exponential growth of trainable parameters, the maximum depth of such a network is limited to 5–7 layers before the computational demands become intractable. Third, the convergence rate of a U-net tends to decrease as the network grows in depth. We attribute this to the vanishing gradient problem that affects deeper networks. We have designed our network to address each of these problems: 1. We use a building block based on dilated convolutions with dilation rates of {1, 3, 5} to increase the effective receptive field of each network level and without increasing the number of trainable parameters. We enhance this block with residual connections [10,11] and batch normalization [12], which are summed together with the dilated convolutions. Depending on the branch direction, each block ends with a max-pooling or upsampling operation. We denote those blocks as BlockD and BlockU , respectively. 2. We insert bottleneck connections between blocks to control the number of trainable parameters [13–15]. Furthermore, we increase the number of filters based on a capped Fibonacci sequence. We chose this sequence after experimenting with zero, constant and quadratic growth, as a good trade-off between network capacity and segmentation performance. 3. Finally, we add connections from the input image, downscaled to the appropriate size, to all levels in the ascending and descending branches.

Pathological OCT Retinal Layer Segmentation

297

Fig. 2. (left) Branch Residual U-Network (BRU-net). The descending branch takes a single Bscan as input and performs consecutive BlockD operations. The ascending branch receives the output of the descending branch and performs consecutive BlockU operations. The numbers indicate the number of filters output by each block. Skip connections connect each descending to each ascending level, while the original Bscan is provided to each level for context. The final output is a regressed layer class for each pixel of the input image. (right) BlockD , BlockU , input and output blocks. The rectangles illustrate computations.

Combined, these result in a significant increase in the learning rate, segmentation accuracy and, due to the reduced number of parameters, processing speed. We name the resulting architecture Branch Residual U-shape Network (BRUnet). The precise architecture and building blocks are illustrated in Fig. 2. Throughout our network, we employ 3 × 3 convolutional kernels with n filters where n increases according to the Fibonacci sequence {32, 64, 96, 160, 256, 416}, capped to a maximum of 416 parameters per level. This avoids the larger growth of parameters encountered in traditional U-networks and allows for deeper networks. More specifically, our network requires 21 million parameters for a depth of 5 levels and grows to 55 million parameters for a depth of 6 levels. The corresponding U-net requires 44 million and 176 million parameters for the same depths, an increase of 2× and 3×, respectively. 2.2

Training

The block layout has been optimized using an evolutionary grid search strategy, by training two variants in parallel and selecting the best performer. To keep training time reasonable, the grid search is performed on a 4× subsampled dataset. This process is repeated 50 times, each one taking up to 30 min. To increase convergence rate and reduce training time, we pre-initialize our network by training it as an autoencoder for 10 epochs, using a small set of 50 OCT volumes of healthy eyes, acquired from the same OCT device. This set is

298

S. Apostolopoulos et al.

distinct from the volumes we use for segmentation. To avoid learning the identity function, we disable the skip connections of the network during this process. The output of the network is an image with the same size as the input Bscan. Each pixel of the output image is assigned a value between 0 and 6 which corresponds to the identity of its corresponding retinal layer. We train the network to minimize the pixel-wise Mean Square Error (MSE) loss between the predicted segmentation and the ground truth. This loss penalizes anatomically implausible segmentations (e.g. class 6 next to 0) more than plausible ones (e.g. class 1 next to 0). We rely this asymmetry to ensure segmentation continuity.i. e., The network parameters are updated via back-propagation and the Adam optimization process with the infinity norm [16]. Each fold is trained for a maximum of 150 epochs. We start training with an initial learning rate of 10−3 and reduce it by a factor of 2 if the MSE loss does not improve for 5 consecutive epochs, down to a minimum of 10−7 . We interrupt the training early if the MSE loss stops improving for 25 consecutive epochs. Using a dedicated validation set, comprising 10% of the training set, we evaluate the MSE loss to adaptively set the learning rate and perform early stopping. At the end of the training procedure, we use the network weights of the epoch with the lowest validation loss to evaluate images in the test set.

3

Experimental Results

A trained ophthalmologist collected 20 macular OCT volumes from pathological subjects using a Heidelberg Spectralis OCT device (Heidelberg Engineering AG, Heidelberg, Germany). Each volume comprises 49, 512 × 496 Bscans, with a lateral (x-y) resolution of 15 µm and an axial (z-) resolution of 3.9 µm. No volumes or Bscans were removed from our initial acquisition, to maintain the complete range of image quality observed in the clinic. For each Bscan in each volume, manually segmented ground truth layers were provided by the ophthalmologist. We split our dataset into 5 equally sized subsets, each using 16 patients for training and 4 for testing. We repeat this process for each of those subsets for a 5fold cross-validation. In each fold, the training set contains 784 training samples (Bscans), which we double to 1568 by flipping horizontally, taking advantage of the bilateral symmetry of the eye. The Bscans are first padded with a black border to a size of 512 × 512 pixels and then augmented with affine transformations, additive noise, Gaussian blur and gamma adjustments. Training is performed on batches of 8 Bscans at a time. Finally, the output image is quantized to integer values (0 to 6) without further post-processing. To evaluate BRU-net, we compare it with the 3D methods of Dufour [5], Chen et al. [6], and the 2D method of Mayer et al. [4] on the same dataset. Additionally, we train a traditional U-net configuration [7] using the procedure described above. Figure 3 provides a qualitative comparison of the results. To quantify those results, we make use of two metrics: (1) the Chamfer distance [17] between each ground truth layer boundary and the boundary produced by a given method and (2) the Dice score of each predicted layer surface.

Pathological OCT Retinal Layer Segmentation

299

Fig. 3. Qualitative comparison of each segmentation approach. Top row, left to right: ground truth, BRU-net, U-net; bottom row, left-to-right: Dufour et al., Chen et al., Mayer et al. Only BRU-net is able to segment the BM layer under the pathological region. The smaller receptive field of U-net results in discontinuities. Further qualitative results are provided in the supplementary material.

Fig. 4. Quantitative comparison of segmentation accuracy per layer, using (left) the Chamfer distance error and (right) the Dice score.

Note that BRU-net is not constrained to convex shapes. Since pathological retinal layers may be non-convex, other metrics that rely on pixel distances are ill-suited for this problem. Figure 4 demonstrates the performance of each of the evaluated methods.

300

S. Apostolopoulos et al.

Fig. 5. Training loss (left) and validation loss (right) comparison between BRU-net and U-net. BRU-net exhibits faster convergence speed and lower loss compared to U-net.

Figure 5 displays the mean training and validation loss of the 5-folds over time for both BRU-net and U-net. In both the training and validation sets, BRU-net achieves faster convergence and slightly better MSE loss. We evaluated the statistical significance of those results using paired t-tests between BRU-net and each baseline. The resulting p-values indicate statistically significant results between BRU-net and every other baseline except U-net:

p (Dice) p (Chamfer)

U-net

Dufour et al. Chen et al.

Mayer et al.

1.05e–01 1.09e–01

2.03e–04 2.70e–04

1.08e–05 9.49e–03

3.19e–05 3.18e–03

Finally, we estimated the total runtime for each method. To process a single volume, BRU-net requires 5 s (Python), compared to 7 s for U-net (Python), 85 s for Mayer et al. (Matlab), 150 s for Dufour et al. (C++), and 216 s for Chen et al. (C++). The results were calculated on the same system using a 3.9 GHz Intel 6600 K processor and a Nvidia 1080GTX GPU.

4

Conclusion

We have presented a method for performing layer segmentation on OCT scans of highly pathological retinas. Inspired by recent advances in computer vision, we have designed a novel fully-convolutional CNN architecture that can segment multiple layers in one shot. We have compared our method to several baselines and demonstrated qualitative and quantitative improvements in both segmentation accuracy and computational time on a dataset of late-stage AMD patients. Given the robustness of this approach on pathological cases, we plan to investigate how retinal layers change over time in the presence of specific diseases.

Pathological OCT Retinal Layer Segmentation

301

References 1. Huang, D., Swanson, E.A., Lin, C.P., Schuman, J.S., Stinson, W.G., Chang, W., Hee, M.R., Flotte, T., Gregory, K., Puliafito, C.A., Fujimoto, J.G.: Optical coherence tomography HHS public access. Science 22(2545035), 1178–1181 (1991) 2. Abramoff, M., Garvin, M., Sonka, M.: Retinal imaging and image analysis. IEEE Rev. Biomed. Eng. 3, 169–208 (2010) 3. Morgan, J.I.W.: The fundus photo has met its match: optical coherence tomography and adaptive optics ophthalmoscopy are here to stay. Ophthalmic Physiol. Opt. 36(3), 218–239 (2016) 4. Mayer, M.A., Hornegger, J., Mardin, C.Y., Tornow, R.P.: Retinal nerve fiber layer segmentation on FD-OCT scans of normal subjects and glaucoma patients. Biomed. Opt. Express 1(5), 1358–1383 (2010) 5. Dufour, P.A., Ceklic, L., Abdillahi, H., Schroder, S., De Zanet, S., Wolf-Schnurrbusch, U., Kowal, J.: Graph-based multi-surface segmentation of OCT data using trained hard and soft constraints. IEEE Trans. Med. Imaging 32(3), 531–543 (2013) 6. Chen, X., Niemeijer, M., Zhang, L., Lee, K., Abramoff, M.D., Sonka, M.: Threedimensional segmentation of fluid-associated abnormalities in retinal OCT: probability constrained graph-search-graph-cut. IEEE Trans. Med. Imaging 31(8), 1521–1531 (2012) 7. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). doi:10. 1007/978-3-319-24574-4 28 8. He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recognition. 7(3), 171–180 (2015). Arxiv.org ¨ Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O.: 3D U-net: 9. C ¸ i¸cek, O., learning dense volumetric segmentation from sparse annotation. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 424–432. Springer, Cham (2016). doi:10.1007/978-3-319-46723-8 49 10. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016 11. Wu, Z., Shen, C., van den Hengel, A.: Wider or deeper: revisiting the ResNet model for visual recognition (2016) 12. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift, pp. 1–11 (2015) 13. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9, June 2015 14. Szegedy, C., Ioffe, S., Vanhoucke, V.: Inception-v4, Inception-ResNet and the impact of residual connections on learning, p. 12 (2016) 15. Huang, G., Liu, Z., Weinberger, K.Q.: Densely connected convolutional networks, pp. 1–12 (2016) 16. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2014) 17. Butt, M., Maragos, P.: Optimum design of Chamfer distance transforms. IEEE Trans. Image Process. 7(10), 1477–1484 (1998)

Quality Assessment of Echocardiographic Cine Using Recurrent Neural Networks: Feasibility on Five Standard View Planes Amir H. Abdi1(B) , Christina Luong2 , Teresa Tsang2 , John Jue2 , Ken Gin2 , Darwin Yeung2 , Dale Hawley2 , Robert Rohling1 , and Purang Abolmaesumi1 1

Electrical and Computer Engineering Department, University of British Columbia, Vancouver, Canada [email protected] 2 Cardiology Lab, Vancouver General Hospital, Vancouver, Canada

Abstract. Echocardiography (echo) is a clinical imaging technique which is highly dependent on operator experience. We aim to reduce operator variability in data acquisition by automatically computing an echo quality score for real-time feedback. We achieve this with a deep neural network model, with convolutional layers to extract hierarchical features from the input echo cine and recurrent layers to leverage the sequential information in the echo cine loop. Using data from 509 separate patient studies, containing 2,450 echo cines across five standard echo imaging planes, we achieved a mean quality score accuracy of 85% compared to the gold-standard score assigned by experienced echosonographers. The proposed approach calculates the quality of a given 20 frame echo sequence within 10 ms, sufficient for real-time deployment. Keywords: Convolutional · Recurrent Neural Network · LSTM · Deep learning · Quality assessment · Echocardiography · Echo cine loop

1

Introduction

Despite advances in medicine and technology, cardiovascular disease remains the leading cause of mortality worldwide. Cardiac ultrasound, better known as echocardiography (echo), is the standard method for screening, detection, and monitoring of cardiovascular disease. This noninvasive imaging modality is widely available, cost-effective, and is used for evaluation of cardiac structure and function. Standard echo studies include assessment of chamber size and function as well as valvular stenosis and competence. However, the clinician’s ability to interpret an echo study highly depends on image quality, which is C. Luong—Co-first author. T. Tsang is the Director of the Vancouver General Hospital and University of British Columbia Echocardiography Laboratories, supervisor of the Cardiology Team, and Co-Principal Investigator of the CIHR-NSERC grant supporting this work. c Springer International Publishing AG 2017  M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 302–310, 2017. DOI: 10.1007/978-3-319-66179-7 35

Quality Assessment of Echocardiographic Cine Using RNN

AP2

AP3

AP4

PSAXA

303

PSAXPM

Fig. 1. The five standard echo view planes targeted in this study.

closely tied to the sonographer’s skill and patient characteristics. Suboptimal images compromise interpretation and can adversely alter patient care. Comprehensive evaluation with echo requires the acquisition of standardized views for 2D measurements and precise ultrasound transducer alignment. As ultrasound becomes increasingly available, less experienced clinicians are using this tool with potential hazards due to inconsistent image quality and limited expertise. Unlike other imaging modalities, ultrasound systems do not have automated image acquisition. The images obtained rely on scanner knowledge of the cardiac structures. To improve the consistency of ultrasound image acquisition, efforts have been invested in detecting shadows and aperture obstructions [8,10] and optimizing the image acquisition parameters [5]. However, those methods are generic and not specific to echo acquisition. Quality of echo data is also dependent on optimizing the imaging plane to obtain sharp edges of desired anatomic structures for each standard view. View-specific quality assessment has been investigated through searching for a binary atlas using generalized Hough transform [11] or defining a goodness-of-fit to a parametric template model [13]. However, those techniques mainly rely on presence of sharp edges in the image, hence are likely to fail in low contrast settings, which is very common in clinical echo data. In our previous work [1], we proposed an echo quality assessment approach using convolutional neural networks which focused only on the apical fourchamber view. However, the proposed method did not take advantage of the information available in sequential echo images (cine echo), and the assessment was limited to end-systolic frames. In this work, we propose a deep learning model for quality assessment of echo cine loops across five standard imaging planes, based on analyzing the entire cine echo. The five standard view planes we analyze in this work are apical 2-chamber (AP2), apical 3-chamber (AP3), apical 4-chamber (AP4), parasternal short axis at the aortic valve level (PSAXA ), and parasternal short axis at the papillary muscle level (PSAXP M ) (Fig. 1). We designed a deep neural network with convolutional and recurrent layers and, a shared architecture to leverage transfer learning. This model automatically extracts hierarchical features from different echo view planes and relates them to a quality score determined by expert echocardiographers. In this research, we use data from 509 separate patients

304

A.H. Abdi et al.

studies, with a total of 2,450 echo cines. Using GPU-computing, the network is able to assess an echo cine loop and assign a quality score in real time.

2

Materials and Method

2.1

Dataset and Gold Standard Quality Assessment

To train the deep learning model, we accessed an echo database on the Picture Archiving and Communication System at Vancouver General Hospital. Different ultrasound machines from Philips and GE, and different image acquisition parameters contributed to the dataset. The majority of studies were performed by certified sonographers with a small proportion scanned by cardiology and sonography trainees. The dataset was randomly selected from the database and is therefore expected to contain a uniform distribution among easy and difficult patient cases. For each patient, 2D cine loops were available from standard views. In this paper, we focused on five standard 2D views, i.e. AP2, AP3, AP4, PLAXA , and PLAXP M (Fig. 1). These views provide comprehensive evaluation of chamber size, systolic function, and gross valvular function. We used 2,450 cine loops from 509 echo studies with ethics approval of the Clinical Medical Research Ethics Board and consultation with the Information Privacy Office. The dataset was evaluated for image quality by one of two physicians trained in echocardiography. A semi-quantitative scoring system was defined for each view, modeled after a system proposed by Gaudet et al. [6], which is summarized in Table 1. The scores were obtained by semi-quantitative evaluation of component structures. Each component was assigned a quality score of up to 2 points that were summed to produce an overall view-specific image score, based on the following observations: 0 point) the structure was not imaged or was inadequate for assessment; 1 point) the structure was adequately viewed; 2 points) the view was optimized for the structure. Other components of the score included appropriate centering (1 point), correct depth setting (0.5 points), proper gain (0.5 points), and correct axis (1 point). Since the maximum possible score value was different for each view, the quality scores for all views were normalized to one. We refer to the normalized ground-truth values assigned by the trained echocardiographer as the Clinical Echo Score (CES). Table 1. Summary of dataset and criteria for quality assessment. Note that each echo cine can contain multiple sequences of 20 consecutive frames. View plane #Cines #Seqs Criteria for clinical quality assessment

Score range

AP2

478

1131

Centering, depth, gain, LV, LA, MV

AP3

455

1081

Centering, depth, gain, AV, MV, LA, LV, septum

0–7

AP4

575

1270

Centering, depth, gain, LV, RV, LA, RA, MV, TV

0–10

PLAXA

480

1148

Centering, depth, gain, AV and leaflets

0–4

PLAXP M

462

1189

Centering, depth, gain, papillary muscles, axis

0–5

2450

5819

Total

0–8

Quality Assessment of Echocardiographic Cine Using RNN

305

Fig. 2. The proposed multi-stream network architecture. Number of kernels in each layer and their corresponding sizes are presented above each layer.

2.2

Network Architecture

The proposed deep neural network is a regression model, consisting of convolutional (conv), pooling (pool), and Long Short Term Memory (LSTM) layers [4], and is simultaneously trained on the five echo view planes. The quality score estimate by the neural network is referred to as the Network Echo Score (NES). The architecture, depicted in Fig. 2, represents a multi-stream network, i.e., five regression models that share weights across the first few layers. Each stream of the network has its own view-specific layers and is trained based on the mean absolute error loss function (1 norm), via a stochastic gradient-based optimization algorithm. All conv layers have kernels with the size of 3 × 3 following the VGG architecture [12], with the number of kernels doubling for deeper conv layers, i.e., from 8 to 32 kernels. The conv layers extract hierarchical features in the image, with the first three shared layers modeling high level spatial correlations, and the next two conv layers focusing on view-specific quality features. Activation function of the conv layers are Rectified Linear Units (ReLUs). In this design, all the pool layers are 2×2 max-pooling with a stride of 2 to select only superior invariant features and divide the input feature-map size to half in both dimensions to reduce feature variance and train more generalized models. The conv and pool layers are applied to each frame of the echo cine, independently. To prevent co-adaptation of features and over-fitting on the training data, a dropout layer with the dropout probability of 0.5 was used after the third pooling layer. The feature map of the final pool layer is flattened and sent to an LSTM unit, a special flavor of Recurrent Neural Networks (RNN) that uses a gated technique to selectively add or remove information from the cell state [7]. A single LSTM cell analyzes 20 feature-sets corresponding to the 20 consecutive input frames, and only the last output of the sequence is used. The LSTM layer uses hard sigmoid functions for inner and output activations.

306

2.3

A.H. Abdi et al.

Training

We partitioned the data into two mutually exclusive sets of training-validation (80%) and test (20%). Network hyper-parameters were optimized by crossvalidation on the training-validation set to ensure that the network can sufficiently learn the distribution of all echo views without over-fitting to the training data. After finalizing the network architecture, the network was trained on the entire training-validation set and its performance was reported on the test set. Sequence Generation: Due to the variability in heart rate and frame acquisition rates the number of frames per cardiac cycle varied from study to study. We used a static sequence size of 20 frames, which encompasses nearly half the average cardiac cycle in our dataset. This duration was sufficient to capture the quality distribution of the echo imaging planes without adversely affecting the run-time of the model. As a result, frames in each echo sequence sample are not synced with the cardiac cycle, neither in the training-validation nor in the test data set. This design decision ensured that the estimated quality score for a given input sequence was independent of the starting phase of the cardiac data. After partitioning studies into training-validation and test sets, each echo cine loop was split into as many consecutive sequences of 20 frames as possible, all of which were assigned the same quality-score as the original echo cine. As a result, the average number of training-validation sequences per echo view was 935 (4,675 in total), and the average number of test sequences per echo view was 228 (1,144 in total), each with equal length of 20 frames (Table 1). Batch Selection: The five regression models were trained simultaneously and each batch consisted of eight sequences from each view. Each sequence was a set of 20 consecutive gray-scale frames, which were downsized to 200 × 200 pixels; no preprocessing was applied on the frames. Since distribution of samples for each view was not uniform and the dataset held more mid to high quality images, a stratified batch selection strategy was implemented to prevent biases towards the quality-levels with the majority of samples [14]. For each view plane of each mini-batch, eight quality-levels were randomly selected and a sample, corresponding to the quality-level, was randomly fetched. The above strategy benefited the training in two ways: (1) training samples did not follow a predefined order; (2) it guaranteed that, from the network’s perspective, the training samples have a uniform distribution among qualitylevels for all the five echo views. Data Augmentation: Data augmentation was applied to achieve a more generalized model and to reduce the probability of over-fitting. To promote rotational invariance, each sequence of each batch was rotated, on-the-fly, with a random value uniformly drawn from the range [−7, +7]. A cardiologist confirmed that

Quality Assessment of Echocardiographic Cine Using RNN

307

this amount of rotation does not degrade the clinical quality of the cine. Translational invariance was achieved by shifting each cine of each batch in the horizontal and vertical directions with a random value uniformly drawn from the range [−D/15, +D/15], where D is the width or height of the frame. Training: The deep learning model was trained using the adam optimizer with the same hyper-parameters as suggested in the original research [9]. The weight of the conv layers were initialized randomly from a zero-mean Gaussian distribution. To prevent the deep network from over-fitting on the training data, 2 norm regularization was added to the weights of the conv kernels. Keras deep learning library with TensorFlow backend, was used to train and test the models [3].

3

Experiments and Results

The error distribution for each echo view, calculated as N ES −CES, is depicted in Fig. 3a. Figure 3b shows the average accuracy percentage calculated as Accview = (1 −

T view

|N ES − CES|) × 100,

(1)

i

where Tview is the total number of test sequences for the echo view. Performance of the model on the test data shows an average accuracy of 85% ± 12 against the expert scores. The accuracy of the trained models for the views are in the same order ranging from 83%–89%. Example test results are shown in Fig. 4. By leveraging the GPU-based implementation of neural networks by TensorFlow, the trained model was able to estimate quality of an input echo cine with 20 frames of 200 × 200 pixels in 10 ms, suitable for real-time deployment.

View AP2 AP3 AP4 PSAXA PSAXP M Total

0.5

0

-0.5 AP2

AP3

AP4

PSAX(A) PSAX(PM)

Acc (%) 86 ± 9 89 ± 9 83 ± 14 84 ± 12 83 ± 13 85 ± 12

(b)

(a)

Fig. 3. (a) Distribution of error in each echo view. (b) Performance of the trained models for each view calculated via Eq. (1).

308

A.H. Abdi et al.

Fig. 4. Sample test results for the five standard echo imaging planes. The left bar in each sub-figure shows the gold-standard score by an expert echocardiographer (CES), and the right bar shows the estimated score by our approach (NES).

4

Discussion and Conclusion

Studies suggest that real-time feedback to sonographers can help optimize the image quality [13]. Here, we propose a deep learning framework to estimate the quality of a given echo cine and to provide feedback to the user in real time. The results show that the trained model works with an acceptable 85% average accuracy across all the five targeted echo view planes (Fig. 3), which is superior to the performance of 82% which was achieved in our previous study on single end-systolic frames of the AP4 view [1]. More importantly, as demonstrated in Fig. 3, performance of the model is the same across all the views with the error distributed evenly around zero. As a result of the stratified batch-selection technique (Sect. 2.3), the model observes a uniformly distributed trainingset, eliminating potential biases towards a quality-level with the majority of samples.

Quality Assessment of Echocardiographic Cine Using RNN

309

The five echo imaging planes were chosen based on their importance in echo studies. We did not provide any a priori information to the model regarding the visual perception of these views and the proposed method does not use viewspecific templates [11,13]; hence, we expect that this approach can be easily extended towards other echo imaging planes. More importantly, this is the first study to leverage from the sequential information in echo cine to estimate the quality of the cine loop. Moreover, by designing a cross-domain architecture (Fig. 2), we leverage transfer learning to share the training sequences of each view with other views [2]. As a result, the proposed approach requires fewer training samples per echo view to achieve the same accuracy. As the method does not rely on any pre-processing steps and takes advantage of GPU computing, the model can compute the quality of a 20 frame cine in real time. This is comparable to the speed achieved in our previous study [1] and faster than the Hough transform method suggested by Pavani et al. [11].

References 1. Abdi, A.H., et al.: Automatic quality assessment of apical four-chamber echocardiograms using deep convolutional neural networks. In: Proceedings of SPIE, vol. 10133, pp. 101330S–101330S-7 (2017) 2. Chen, H., Zheng, Y., Park, J.-H., Heng, P.-A., Zhou, S.K.: Iterative Multi-domain Regularized Deep Learning for Anatomical Structure Detection and Segmentation from Ultrasound Images. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 487–495. Springer, Cham (2016). doi:10.1007/978-3-319-46723-8 56 3. Chollet, F.: Keras (2015). https://github.com/fchollet/keras 4. Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 677–691 (2017) 5. El-Zehiry, N., Yan, M., Good, S., Fang, T., Zhou, S.K., Grady, L.: Learning the manifold of quality ultrasound acquisition. In: Mori, K., Sakuma, I., Sato, Y., Barillot, C., Navab, N. (eds.) MICCAI 2013. LNCS, vol. 8149, pp. 122–130. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40811-3 16 6. Gaudet, J., et al.: Focused critical care echocardiography: development and evaluation of an image acquisition assessment tool. Crit. Care Med. 44(6), e329–e335 (2016) 7. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 8. Huang, S.W., et al.: Detection and display of acoustic window for guiding and training cardiac ultrasound users. In: Progress in Biomedical Optics and Imaging - Proceedings of SPIE, vol. 9040, p. 904014 (2014) 9. Kingma, D.P., Ba, J.L.: Adam: a Method for Stochastic Optimization. In: International Conference on Learning Representations 2015, pp. 1–15 (2015) 10. Løvstakken, L., et al.: Real-time indication of acoustic window for phased-array transducers in ultrasound imaging. In: Proceedings of IEEE Ultrasonics Symposium, pp. 1549–1552 (2007) 11. Pavani, S.K., et al.: Quality metric for parasternal long axis B-mode echocardiograms. MICCAI 2015, 478–485 (2012)

310

A.H. Abdi et al.

12. Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks for Large-Scale Image Recognition. In: ICRL 2015, pp. 1–14 (2015) 13. Snare, S.R., et al.: Real-time scan assistant for echocardiography. IEEE Trans. Ultrason. Ferroelectr. Freq. Control 59(3), 583–589 (2012) 14. Zhao, P., Zhang, T.: Accelerating minibatch stochastic gradient descent using stratified sampling. arXiv preprint arXiv:1405.3080, pp. 1–13 (2014)

Semi-supervised Deep Learning for Fully Convolutional Networks Christoph Baur1(B) , Shadi Albarqouni1 , and Nassir Navab1,2 1

2

Computer Aided Medical Procedures (CAMP), Technische Universit¨ at M¨ unchen, Munich, Germany [email protected] Whiting School of Engineering, Johns Hopkins University, Baltimore, USA

Abstract. Deep learning usually requires large amounts of labeled training data, but annotating data is costly and tedious. The framework of semi-supervised learning provides the means to use both labeled data and arbitrary amounts of unlabeled data for training. Recently, semi-supervised deep learning has been intensively studied for standard CNN architectures. However, Fully Convolutional Networks (FCNs) set the state-of-the-art for many image segmentation tasks. To the best of our knowledge, there is no existing semi-supervised learning method for such FCNs yet. We lift the concept of auxiliary manifold embedding for semi-supervised learning to FCNs with the help of Random Feature Embedding. In our experiments on the challenging task of MS Lesion Segmentation, we leverage the proposed framework for the purpose of domain adaptation and report substantial improvements over the baseline model.

1

Introduction

In order to train deep neural networks, usually huge amounts of labeled data are necessary. In the medical field, however, labeled data is scarce as manual annotation is time-consuming and tedious. At the same time, when training models using a limited amount of labeled data, there is no guarantee that these models will generalize well on unseen data that is distributed slightly different. A prominent example in this context is Multiple Sclerosis (MS) lesion segmentation in MR images, which suffers from both a lack of ground-truth and distribution-shift across images from different devices [2]. However, vast amounts of unlabeled data can often be provided comparably easy. Semi-supervised learning provides the means to leverage both a limited amount of labeled and arbitrary amounts of unlabeled data for training deep networks [11]. In recent years, various frameworks for semi-supervised deep learning have been proposed: In 2012, Weston et al. [11] presented a framework for shallow networks based on auxiliary manifold embedding. Using an additional embedding loss function attached to hidden layers and graph adjacency among input samples, they forced the feature representations of neighboring labeled and unlabeled samples to become more similar, leading to improved C. Baur and S. Albarqouni contributed equally towards this work. c Springer International Publishing AG 2017  M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 311–319, 2017. DOI: 10.1007/978-3-319-66179-7 36

312

C. Baur et al.

Fig. 1. Illustration of the semi-supervised deep learning framework. (1) Labeled data is used to optimize the network for a primary objective LP . (2) The training batches are further augmented with unlabeled samples and their prior maps, which altogether contribute to the embedding loss LE . Note: Unlabeled data does not influence LP .

generalization. In 2013, Lee et al. [4] also reported improved generalization when fine-tuning a model from predictions on unlabeled data using an entropy regularization loss. More recently, in 2015, Rasmus et al. [8] introduced the ladder network architecture for semi-supervised deep learning. And lately, Yang et al. [12] presented a framework also based on graph embeddings, with both transductive and inductive variants for shallow neural networks. All of these methods show promising results, but they are tailored to classic CNN architectures and often only examined on small computer vision datasets. In challenging problems such as biomedical image segmentation, Fully Convolutional Networks (FCNs) [5] are preferable as they are efficient and show the ability to learn context [3,6]. As far as we know, there is no existing semi-supervised learning method for such FCNs yet. In this paper, we lift the concept of auxiliary manifold embedding to FCNs with a, to the best of our knowledge, novel strategy called “Random Feature Embedding”. Subsequently, we successfully perform semi-supervised fine-tuning of FCNs for domain adaptation with our proposed embedding technique on the challenging task of MS lesion segmentation.

2

Methodology

Semi-supervised learning is based on the concept of training a model, f (·), using both labeled and unlabeled data XL and XU , respectively. In our framework for FCNs, training data has the form of D = {X, Y}, where X = {XL ∪ XU } = {x1 , ..., xNL , xNL+1 , ..., xNL+U } ∈ RH×W ×D×NL+U are D-channel images, and Y = {y1 , ..., yNL } ∈ RH×W ×1×NL are the corresponding label maps which are only available for the labeled data. Since we deal with FCNs, both images and label maps have the same dimensions, H ×W , allowing a distinct label per pixel.

Semi-supervised Deep Learning for FCNs

2.1

313

Auxiliary Manifold Embedding

In our framework, to model f (·), we employ a modified version of the U-Net architecture [9] that processes images of arbitrary sizes and outputs a label map of the same size as the input (Fig. 1). We train the network to minimize the primary objective LP , i.e. the Dice-Loss [6], for our segmentation task (see Fig. 1) from labeled data only (Fig. 1, step 1). Simultaneously, to leverage the unlabeled data, we employ an auxiliary manifold embedding loss LE on the latent feature representations h(·) of both XL and XU to minimize the discrepancy between similar inputs in the latent space. Thereby, similarity among h(·) of unlabeled data is given by prior knowledge (Fig. 1, step 2). The overall objective function can be written using Lagrangian multipliers as:  λl · LEl (1) L = LP + l

where λl is the regularization parameter associated with the embedding loss El at hidden layer l. Typically, this objective [11] aims at minimizing the distance among latent representations of similar hl (xi ) and hl (xj ) of neighboring data samples xi and xj , and otherwise tries to push them apart if their distance is within a margin m:  nE  nE  d(hl (xi ), hl (xj )), if aij = 1 LEl (X, A) = , (2) l l max(0, m − d(h (xi ), h (xj ))), if aij = 0 i j Thereby, A ∈ RnE ×nE is an adjacency matrix between all embedding samples nE within a training batch, and d(·, ·) ∈ R1 is an arbitrary distance metric measuring the distance between two latent representations. Unlike the typical 2 -norm distance employed in [11], we opt for the angular cosine distance (ACD) hl (x )T hl (x )

1 − hl (xi )i 2 hl (xjj )2 for two reasons; first, it is naturally bounded between [0, 1], hence limits the searching area for the marginal distance parameter m, and second, it shows superior performance on high-dimensional feature representations in deep architectures [7]. The definition of A is left to the user and can be arbitrary. We define aij to be 1 if the respective embeddings share the same label (labeled data) or prior (unlabeled data), and set it to 0 otherwise. In our real experiments, the prior is obtained via template matching with NCC, similar to [1], acting as a noisy surrogate for the label maps. 2.2

Random Feature Embedding

In standard CNNs, the embedding loss can be directly attached to the fully connected layers, where a latent feature vector h(xi ) represents a single input image xi . In the case of FCNs, however, an input of arbitrary size can produce a multi-channel feature map of arbitrary size. Since FCNs make predictions at the pixel level, meaningful embeddings can be obtained per pixel along the channel

314

C. Baur et al.

Fig. 2. For FCNs, we sample embeddings h(xi ) of single pixels from feature maps along the channel dimension. Randomly sampled embeddings should be representative for the entire population.

dimension (Fig. 2). However, sampling and comparing all h(·) of all pixels in large images is computationally infeasible: a W  × H  × D × Nb feature map tensor for a batch of size Nb will yield nE = W  · H  · Nb and therefore n2E comparisons. This quickly becomes intractable to compute. Instead, we suggest to do Random Feature Embedding (RFE): we randomly sample a limited number nE of pixels from the feature maps of the current batch according to some sampling strategy discussed in the next section to limit the number of comparisons. The loss remains valid since we propagate back only the gradients of selected pixels. Sampling Strategy. Ideally, the distribution of randomly sampled embeddings should mimic the one of all embeddings (Fig. 2), while at the same time paying attention to the class distribution such that unwanted bias is not introduced to the model. Therefore, we investigate the following sampling strategies: – 50/50 RFE : For each class in the prior the same amount of embeddings is randomly extracted to represent h(·) from different classes equally. For unbalanced classes, this might lead to oversampling of one class. – Distribution-Aware RFE : Embeddings are sampled from the given training batch according to the ratio of negative and positive classes in the prior to preserve the actual class distribution. When classes are unbalanced and nE is too small, this might lead to undersampling of one class. – 80/20 RFE : As a trade-off, embeddings can be randomly sampled from a predefined ratio of 80% background and 20% foreground pixels.

3

Experiments and Results

Our experiments on MS lesion segmentation are motivated by the fact that existing automatic segmentation methods often fail to generalize well to MRI data acquired with different devices [2]. In this context, we leverage our semisupervised learning framework for domain adaptation, i.e. we try to improve generalization of a baseline model by fine-tuning it with unlabeled data from

Semi-supervised Deep Learning for FCNs

315

Table 1. Overview of our MS lesion data and our training/testing split Dataset Domain Patients (train/test) Resolution

Scanner

MSSEG A

3/2

144 × 512 × 512 3T Philips Ingenia

B

3/2

144 × 512 × 512 1.5T Siemens Aera

C MSKRI D

3/2

144 × 512 × 512 3T Siemens Verio

3/10

300 × 256 × 256 3T Philips Achieva

other domains. Therefore, using an optimal prior for the adjacency matrix A, we first assess different sampling strategies, the impact of different numbers of embeddings as well as different distance measures for RFE. In succession, we utilize the most promising sampling strategy and distance measure together with a real prior for domain adaptation. Dataset. Our MRI MS Lesion data is a combination of the publicly available MSSEG1 training dataset and the non-publicly available in-house MSKRI dataset (c.f. Table 1). For all patients there are co-registered T1, T2 and FLAIR volumes as well as a corresponding ground-truth segmentation volume. Patients are grouped into domains A, B, C and D based on the scanner they have been acquired with. For training & fine-tuning, we randomly crop 128 × 128 × 3 sized patches and label maps around lesions from corresponding T1, T2 and FLAIR axial slices, and randomly split them into training & validation sets with a ratio of 7:3. Actual testing is performed on full slices of the testing volumes. Implementation. Our framework is built on top of MatConvNet [10]. All models were trained in batches of 12 and the learning rate of the primary objective fixed to 10e–6. For embedding with 2 , we set λ = 0.01 and the margin parameter m = 1000 (empirically chosen), for embedding with ACD we use λ = 1 and m = 1, such that h(·) from different classes become orthogonal. 3.1

Baseline Models

In order to measure the impact of our auxiliary manifold embedding, we first train so called lower bound and upper bound baseline models in a totally supervised fashion from labeled data only. Per domain, we thereby approx. crop 6000 patches from the respective three training patients. We train the Lower Bound Model AL from domain A training data for 50 epochs to obtain a model which produces decent segmentations on data from domain A, but does not generalize well to other domains. Further, we train so-called Upper Bound Models by taking AL after 35 epochs and fine-tuning it until epoch 50 using mixed, labeled training data from domain A and d ∈ [B, C, D]. We obtain three different models which should ideally be able to segment MS lesion volumes from domain A & B, A & C or A & D respectively. 1

https://portal.fli-iam.irisa.fr/msseg-challenge/overview.

316

3.2

C. Baur et al.

Semi-supervised Embedding

For the purpose of semi-supervised deep learning, we now assume there is labeled data from domain A, and unlabeled/barely labeled MRI data from the other domains d ∈ [B, C, D]. All of the models trained in the following experiments originally build upon epoch 35 of the lower bound model AL . One embedding loss is attached to the second-last convolutional layer of the network (Fig. 1). This choice is due to the fact that this particular layer produces feature maps at the same resolution as the input images, thus there is no need to downsample the prior required for embedding and thus no risk involved in losing heavily underrepresented, small lesion pixels (comprising less than 1% of all pixels). Proof of Concept. We first assume a perfect prior, i.e. we use the label maps of all data we consider as unlabeled for embedding, and concentrate on investigating the best choice of sampling strategy, number of embeddings nE and distance metric. We fine-tune models for target domain B using (i) 50/50, DistributionAware and 80/20 RFE with (ii) the 2 -norm and the ACD as a distance metric and (iii) different number of embeddings nE ∈ {20, 100, 200, 500, 1000, 2000}, yielding a total of 36 different models. For fine-tuning in these proof-of-concept experiments, we use only a subset of 200 images from domain A and target domain B each, rather than the full training set. Our results show that, as we ramp up nE , the distribution of randomly sampled embeddings more and more resembles the full distribution (Fig. 3(b)), which renders the random sampling generally valid. Moreover, with increasing nE , we notice consistent segmentation improvements on target domain B when using ACD as a distance metric (Fig. 3(a)). In repeated experiments, we always obtain similar results. The improvements are most pronounced with the 80/20 sampling strategy. However, the 2 -norm performs poorly and seems unstable as the number of embeddings

(a)

(b)

Fig. 3. (a) Average F-Scores reported on domain B testing data for models trained with different settings and (b) the impact of increasing nE on the Jensen-Shannon divergence between randomly sampled embeddings and the space of all embeddings.

Semi-supervised Deep Learning for FCNs

317

nE increases. We believe this is because the 2 -norm penalizes the magnitudes of the vectors being compared, whereas ACD is scale-invariant and only constrains their direction. Real Prior. Motivated by previous results, we now train models with a real, noisy prior using ACD and 80/20 RFE. For a target domain d ∈ [B, C, D], we assume that one out of the three training patients is labeled and use it to compute a prior for the other two. Before fine-tuning, the first labeled FLAIR training volume V1 of d is selected and 30 different 5 × 5 × 5 voxel sized 3D templates are randomly extracted around MS lesions. Then, we perform NCC on the remaining volumes V2 and V3 of the current domain d. The choice of 3D template matching is to ensure consistency among neighboring MR slices. For thresholding the template matching output, the same matching is applied to V1 itself and the threshold which maximizes the Dice-Coefficient between the ground-truth labels and the geometric mean of the responses on V1 is computed. Using this noisy prior, we fine-tune models for domain B, C and D using approx. 4000 training patches from volumes V2 and V3 and all labeled training patches from domain A. We set nE = 100 to obtain an overall number of embeddings similar to the proof of concept experiment (nE = 2000), but with lower computational cost. The models show consistent improvements over the lower bound model AL for all target domains (see Fig. 4(a)). Visual inspection (Fig. 4(b)) reveals that the semi-supervised embedding seems to dramatically reduce the number of false positives (FP). Interestingly, it detects some lesions (encircled) where the upper bound model fails, but it fails to spot very small lesions. This is probably because embeddings of smaller lesions are more unlikely to be sampled.

(a)

(b)

Fig. 4. (a) Comparing the lower bound model, the semi-supervised models fine-tuned with NCC & ACD, and the upper bound models on the respective target domain; (b) The semi-supervised model for domain C (middle) produces much fewer FP than the lower bound model (left image) and is only slightly inferior to the respective upper bound model (right image).

318

4

C. Baur et al.

Discussion and Conclusion

In summary, we presented the concept of auxiliary manifold embedding and successfully integrated it into a semi-supervised deep learning framework for FCNs. At the heart of this framework is Random Feature Embedding, a simple method for sampling feature representations which serve as a statistic for nonlinear embedding. Our experiments on MS lesion segmentation revealed that the method can improve generalization capabilities of existing models when using ACD as a distance metric. Yet, there is a lot of room for follow-up investigations. For instance, in future work, the effect of attaching multiple embedding objectives at different layers of an FCN could be investigated. In general, the proposed method should be applied to other problems as well to reveal its full potential. Acknowledgements. We thank our clinical partners, in particular Dr. med. Paul Eichinger and Dr. med. Benedikt Wiestler, from the Neuroradiology Department of Klinikum Rechts der Isar for providing us with their MRI MS Lesion dataset. Further, we want to thank Rohde & Schwarz GmbH & Co KG for funding the project.

References 1. Berm´ udez-Chac´ on, R., Becker, C., Salzmann, M., Fua, P.: Scalable unsupervised domain adaptation for electron microscopy. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 326–334. Springer, Cham (2016). doi:10.1007/978-3-319-46723-8 38 2. Garc´ıa-Lorenzo, D., Francis, S., Narayanan, S., Arnold, D.L., Collins, D.L.: Review of automatic segmentation methods of multiple sclerosis white matter lesions on conventional magnetic resonance imaging. Med. Image Anal. 17(1), 1–18 (2013) 3. Kamnitsas, K., Ledig, C., Newcombe, V.F., Simpson, J.P., Kane, A.D., Menon, D.K., Rueckert, D., Glocker, B.: Efficient multi-scale 3D CNN with fully connected CRF for accurate brain lesion segmentation. Med. Image Anal. 36, 61–78 (2017) 4. Lee, D.H.: Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In: Workshop on Challenges in Representation Learning, ICML, vol. 3 (2013) 5. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015) 6. Milletari, F., Navab, N., Ahmadi, S.A.: V-net: fully convolutional neural networks for volumetric medical image segmentation. In: 2016 Fourth International Conference on 3D Vision (3DV), pp. 565–571. IEEE (2016) 7. Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: ICML, pp. 807–814 (2010) 8. Rasmus, A., Berglund, M., Honkala, M., Valpola, H., Raiko, T.: Semi-supervised learning with ladder networks. In: NIPS, pp. 3546–3554 (2015) 9. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). doi:10. 1007/978-3-319-24574-4 28

Semi-supervised Deep Learning for FCNs

319

10. Vedaldi, A., Lenc, K.: Matconvnet - convolutional neural networks for matlab. In: Proceeding of the ACM International Conference on Multimedia (2015) 11. Weston, J., Ratle, F., Mobahi, H., Collobert, R.: Deep learning via semi-supervised embedding. In: Montavon, G., Orr, G.B., M¨ uller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 639–655. Springer, Heidelberg (2012). doi:10.1007/978-3-642-35289-8 34 12. Yang, Z., Cohen, W., Salakhudinov, R.: Revisiting semi-supervised learning with graph embeddings. In: ICML, pp. 40–48 (2016)

TandemNet: Distilling Knowledge from Medical Images Using Diagnostic Reports as Optional Semantic References Zizhao Zhang, Pingjun Chen, Manish Sapkota, and Lin Yang(B) University of Florida, Gainesville, USA [email protected]

Abstract. In this paper, we introduce the semantic knowledge of medical images from their diagnostic reports to provide an inspirational network training and an interpretable prediction mechanism with our proposed novel multimodal neural network, namely TandemNet. Inside TandemNet, a language model is used to represent report text, which cooperates with the image model in a tandem scheme. We propose a novel dual-attention model that facilitates high-level interactions between visual and semantic information and effectively distills useful features for prediction. In the testing stage, TandemNet can make accurate image prediction with an optional report text input. It also interprets its prediction by producing attention on the image and text informative feature pieces, and further generating diagnostic report paragraphs. Based on a pathological bladder cancer images and their diagnostic reports (BCIDR) dataset, sufficient experiments demonstrate that our method effectively learns and integrates knowledge from multimodalities and obtains significantly improved performance than comparing baselines.

1

Introduction

In medical image understanding, convolutional neural networks (CNNs) gradually become the paradigm for various problems [1]. Training CNNs to diagnose medical images primarily follows pure engineering trends in an end-to-end fashion. However, the principles of CNNs during training and testing is difficult to interpret and justify. In clinical practice, domain experts teach learners by explaining findings and observations to make a disease decision rather than leaving learners to find clues from images themselves. Inspired by this fact, in this paper, we explore the usage of semantic knowledge of medical images from their diagnostic reports to provide explanatory supports for CNN-based image understanding. The proposed network learns to provide interpretable diagnostic predictions in the form of attention and natural language descriptions. The diagnostic report is a common type of medical record in clinics, which is comprised of semantic descriptions about the observations of biological features. Recently, we have witnessed rapid development in c Springer International Publishing AG 2017  M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 320–328, 2017. DOI: 10.1007/978-3-319-66179-7 37

TandemNet : Text/image feature piece

Image

ResNet

ResNet

Repor t

: Average-pooled feature

LSTM

LSTM

Word 1

Word 2

321

Dual-attention

Prediction

Interaction

MLP Softmax X

Drop Interaction

+

Visual skip

Fig. 1. The illustration of the TandemNet.

multimodal deep learning research [2,3]. We believe the joint study of multimodal data is essential towards intelligent computer-aided diagnosis. However, only a dearth of related work exists [4,5]. To take advantage of the language modality, we propose a multimodal network that jointly learns from medical images and their diagnostic reports. Semantic information is interacted with visual information to improve the image understanding ability by teaching the network to distill informative features. We propose a novel dual-attention model to facilitate such high-level interaction. The training stage uses both images and texts. In the testing stage, our network can take an image and provide accurate prediction with an optional (i.e. with or without) text input. Therefore, the language and image models inside our network cooperate with one another in a tandem scheme to either single(images)or double(image-text)-drive the prediction process. We refer to our proposed network as TandemNet. Figure 1 illustrates the overall framework. To validate our method, we cooperate with pathologists and doctors to collect the BCIDR dataset. Sufficient experimental studies on BCIDR demonstrate the advantages of TandemNet. Furthermore, by coupling visual features with the language model and fine-tuning the network using backpropagation through time (BPTT), TandemNet learns to automatically generate diagnostic reports. The rich outputs (i.e. attention and reports) of TandemNet have valuable meanings: providing explanations and justifications for its diagnostic prediction and making this process interpretable to pathologists.

2

Method

CNN for image modeling. We adopt the (new pre-activated) residual network (ResNet) [6] as our image model. The identity mapping in ResNet significantly improves the network generalization ability. There are many architecture variants of ResNet. We adopt the wide ResNet (WRN) [7] which has shown better

322

Z. Zhang et al.

performance and higher efficiency with much less layers. It also offers scalability of the network (number of parameters) by adjusting a widen factor (i.e. the channel of feature maps) and depth. We extract the output of the layer before average pooling as our image representation, denoted as V ∈ RC×G . The input image size is 224 × 224, so G = 14 × 14. C depends on the widen factor. LSTM for language modeling. We adopt Long Short-Term Memory (LSTM) [8] to model diagnostic report sentences. LSTM improves vanilla recurrent neural networks (RNNs) for natural language processing and is also widely-used for multimodal applications such as image captioning [2,9]. It has a sophisticated unit design, which enables long-term dependency and greatly reduces the gradient vanishing problem in RNNs [10]. Given a sequence of words {x1 , ..., xn }, LSTM reads the words one at a time and maintains a memory state mt ∈ RD and a hidden state ht ∈ RD . At each time step, LSTM updates them by ht , mt = LSTM(xt , ht−1 , mt−1 ),

(1)

where xt ∈ RK is an input word, which is computed by firstly encoding it as a one-hot vector and then multiplied by a learned word embedding matrix. The hidden state is a vector encoding of sentences. The treatment of it varies from problems. For example, in image captioning, a multilayer perceptron (MLP) is used to decode it as a predicted word at each time step. In machine translation [11], all hidden states could be used. A medical report is more formal than a natural image caption. It usually describes multiple types of biological features structured by a series of sentences. It is important to represent all feature descriptions but maintain the variety and independence among them. To this end, we extract the hidden state of every feature description (in our implementation, it is achieved by adding a special token at the end of each sentence beforehand and extracting the hidden states at all the placed tokens). In this way, we obtain a text representation matrix S = [h1 , ..., hN ] ∈ RD×N for N types of feature descriptions. This strategy has more advantages: it enables the network to adaptively select useful semantic features and determine respective feature importance to disease labels (as shown in experiments). Dual-attention model. The attention mechanism [11,12] is an active topic in both computer vision and natural language communities. Briefly, it gives networks the ability to generate attention on parts of the inputs (like visual attention in the brain cortex), which is achieved by computing a context vector with attended information preserved. Different from most existing approaches that study attention on images or text, given the image representation V and the report representation S 1 , our dual-attention model can generate attention on important image regions and sentence parts simultaneously. Specifically, we define the attention function fatt to compute a piece-wise weight vector α as e = fatt (V , S), 1

exp(ei ) , αi =  i exp(ei )

(2)

The two matrices are firstly embedded through a 1×1 convolutional layer with Tanh.

TandemNet

323

where α ∈ RG+N has individual weights for visual and semantic features (i.e. V and S). fatt is specifically defined as follows: zs→v = tanh(Wv V + (Ws Δ(S))1Tv ), zv→s = tanh(Ws S + (Wv Δ(V ))1Ts ),

(3)

T

e = w [zs→v ; zv→s ] + b, where Wv , Wv ∈ RM ×C and Ws , Ws ∈ RM ×D are parameters to be learned to compute zs→v ∈ RM ×G and zv→s ∈ RM ×N , and w, b ∈ RM . 1v ∈ RG and 1s ∈ RN are vectors with all elements to be one. Δ denotes the global averagepooling operator on the last dimension of V and S. [; ] denotes the concatenation operator. Finally, we obtain a context vector c ∈ RM by c = Oα =

G  i=1

αi Vi +

G+N 

αj Sj , where O = [V ; S].

(4)

j=G+1

In our formulation, the computation of image and text attention is mutually dependent and conducts high-level interactions. The image attention is conditioned on the global text vector Δ(S) and the text attention is conditioned on the global image vector Δ(V ). When computing the weight vector α, both information contributes through zs→v and zv→s . We also consider extra configurations: computing two e by two w, and then concatenate them to compute α with one softmax or compute two α with two softmax functions. Both configurations underperform ours. We conclude that our configuration is optimal for the visual and semantic information to interact with each other. Intuitively, our dual-attention mechanism encourages better alignment of visual information with semantic information piecewise, which thereby improves the ability of TandemNet to discriminate useful features for attention computation. We will validate this experimentally. Prediction module. To improve the model generalization, we propose two effective techniques for the prediction module of the dual-attention model. (1) Visual skip-connection. The probability of a disease label p is computed as p = MLP(c + Δ(V )).

(5)

The image feature Δ(V ) skips the dual-attention model and is directly added onto c (see Fig. 1). During backpropagation, this skip-connection directly passes gradients for the loss layer to the CNN, which prevents possible gradient vanishing in the dual-attention model from obstructing CNN training. (2) Stochastic modality adaptation. We propose to stochastically “abandon” text information during training. This strategy generalizes TandemNet to make accurate prediction with absent text. Our proposed strategy is inspired by Dropout and the stochastic depth network [13], which are effective for model generalization. Specifically, we define a drop rate r as the probability to remove (zero-out)

324

Z. Zhang et al.

Table 1. The quantitative evaluation (averaged on 3 trials). The first block shows standard CNNs so text is irrelevant. Method

Accuracy (%) w/o text w/text

WRN16-4

75.4



ResNet18-TL

79.4



TandemNet-WVS 79.4

85.6

TandemNet

82.4

89.9

TandemNet-TL

84.9

88.6

Fig. 2. The confusion matrices of two compared methods ResNet18-TL and TandemNet-TL (w/o text) in Table 1.

the text part S during the entire network training stage. Thus, based to the principle of Dropout, S will be scaled by 1 − r if text is given in testing. The effects of these two techniques are discussed in experiments.

3

Experiments

Dataset. To collect the BCIDR dataset, whole-slide images were taken using a 20X objective from hematoxylin and eosin (H&E) stained sections of bladder tissue extracted from a cohort of 32 patients at risk of a papillary urothelial neoplasm. From these slides, 1,000 500 × 500 RGB images were extracted randomly close to urothelial regions (each patient’s slide yields a slightly different number of images). For each of these images, the pathologist then provided a paragraph describing the disease state. Each paragraph addresses five types of cell appearance features, namely the state of nuclear pleomorphism, cell crowding, cell polarity, mitosis, and prominence of nucleoli (thus N = 5). Then a conclusion is decided for each image-text pair, which is comprised of four classes, i.e. normal tissue, low-grade (papillary urothelial neoplasm of low malignant potential) carcinoma, high-grade carcinoma, and insufficient information. Following the same procedure, four doctors (not experts in the bladder cancer) wrote additional four descriptions for each image. They also refer to the pathologist’s description to make sure their annotation accuracy. Thus there are five ground-truth reports per image and 5, 000 image-text pairs in total. Each report varies in length between 30 and 59 words. We randomly split 20% (6/32) of patients including 1, 000 samples as the testing set and the remaining 80% of patients including 4, 000 samples (20% as the validation set for model selection) for training. We subtract the data RGB mean and augment through clip, mirror and rotation. Implementation details. Our implementation is based on Torch7. We use a small WRN with depth = 16 and widen-factor = 4 (denoted as WRN16-4), resulting in 2.7M parameters and C = 256. We use dropout with 0.3 after each convolution. We use D = 256 for LSTM, M = 256, and K = 128. We use

TandemNet

325

SGD with a learning rate 1e−2 for the CNN (used likewise for standard CNN training for comparison) and Adam with 1e−4 for the dual-attention model, which are multiplied by 0.9 per epoch. We also limit the gradient magnitude of the dual-attention model to 0.1 by normalization [10]. Diagnostic prediction evaluation. Table 1 and Fig. 2 show the quantitative evaluation of TandemNet. For comparison with CNNs, we train a WRN164 and also a ResNet18 (has 11M parameters) pre-trained on ImageNet2 . We found transfer learning is beneficial. To test this effect in TandemNet, we replace WRN16-4 with a pre-trained ResNet18 (TandemNet-TL). As can be observed, TandemNet and TandemNet-TL significantly improve WRN16-4 and ResNet18-TL when only images are provided. We observe TandemNet-TL slightly underperforms TandemNet when text is provided with multiple trails. We hypothesize that it is because fine-tuning a model pre-trained on a complete different natural image domain is relatively hard to get aligned with medical reports in the dual-attention model. From Fig. 2, high grade (label id 3) is more likely to be misclassified as low grade (2) and some insufficient information (4) is confused with normal (1). We analyze the text drop rate in Fig. 3 (left). When the drop rate is low, the model obsessively uses text information, so it achieves low accuracy without text. When the drop rate is high, the text can not be well adapted, resulting in decreased accuracy with or without text. The drop rate of 0.5 performs best and thereby is used in this paper. As illustrated in Fig. 3, we found that the classification of text is easier than images, therefore its accuracy is much higher. However, please note that the primary aim of this paper is to use text information only at the training stage. While at the testing stage, the goal is to accurately classify images without text. In Eq. (5), one question that may arise is that, when testing without text, whether it is merely Δ(V ) from the CNN that produces useful features rather than c from the dual-attention model (since the removal (zero-out) of S could possibly destroy the attention ability). To validate the actual role of c, we remove the visual skip-connection and train the model (denoted as TandemNet-WVS

Fig. 3. Left: The accuracy with varying drop rates. Right: The averaged text attention per feature type (and overall) to each disease label. The feature type is specified in the text of dataset introduction (in order). 2

Provided by https://github.com/facebook/fb.resnet.torch.

326

Z. Zhang et al.

in Table 1) and it improves ResNet16-4 by 4% without text. The qualitative evaluation below also validates the effectiveness of the dual-attention model. Additionally, we use the (t-distributed Stochastic Neighbor Embedding) t-SNE dimensionality reduction technique to examine the input of MLP in Fig. 4.

Fig. 4. The t-SNE visualization of the MLP input. Each point is a test sample. The embeddings with text (right) results in better distribution.

Attention analysis. We visualize the attention weights to show how TandemNet captures image and text information to support its prediction (the image attention map is computed by upsampling the G = 14 × 14 weights of α to the image space). To validate the visual attention, without notifying our results beforehand, we ask the pathologist to highlight regions of some test images they think are important. Figure 5 illustrates the performance. Our attention maps show surprisingly high consistency with pathologist’s annotations. The attention without text is also fairly promising, although it is less accurate than the results with text. Therefore, we can conclude that TandemNet effectively uses semantic information to improve visual attention and substantially maintains such attention capability though the semantic information is not provided. The text attention is shown in the last column of Fig. 5. We can see that our text attention result is quite selective in only picking up useful semantic features. Furthermore, the text attention statistics over the dataset provides particular insights into the pathologists’ diagnosis. We can investigate which feature contributes the most to which disease label (see Fig. 3 (right)). For example, nuclear pleomorphism (feature type 1) shows small effects on the low-grade disease label. cell crowding (2) has large effects on high-grade. We can justify the reason of text attention by closely looking at images of Fig. 5: high grade images have obvious high cell crowding degree. Moreover, this result strongly demonstrates the successful image-text alignment of our dual-attention model.

TandemNet Test Image

Groundtruth

Visual attention

327

Visual attention and text attention

High Grade

pictured nuclei are severely pleomorphic. moderate crowding of the nuclei can be seen. polarity along the basement membrane is completely lost. mitosis appears to be rare the. nucleoli of nuclei are inconspicuous.

Low Grade

the nuclei are moderately pleomorphic. there is a mild degree of crowding. polarity along the basement membrane is negligibly lost. there are rarely mitotic figures throughout the tissue. the nucleoli are mostly inconspicuous.

High Grade

severe pleomorphism is present in the nuclei. nuclei are moderately crowded together. polarity is completely lost. mitosis is rare throughout the tissue. the nucleoli are mostly inconspicuous.

Fig. 5. From left to right: Test images (the bottom shows disease labels), pathologist’s annotations, visual attention w/o text. visual attention and corresponding text attention (the bottom shows text inputs). Best viewed in color.

Image report generation. We fine-tune TandemNet using BPTT as an extra supervision and use the visual feature Δ(V ) as the input of LSTM at the first time step3 . We direct readers to [9] about detailed LSTM training for image captioning. Figure 6 shows our promising results compared with pathologist’s descriptions. We leave the full report generation task as a future study. the nuclei are severely pleomorphic. the nuclei are crowded to a moderate degree. polarity of nuclei is negligibly lost. mitosis is rare throughout the tissue. the nuclei have inconspicuous nucleoli. conclusion high grade.

there is moderate pleomorphism in the nuclei. nuclei are mildly crowded together. polarity of nuclei is negligibly lost. mitotic figures are rare. the nuclei have inconspicuous nucleoli. conclusion low grade

the nuclei are severely pleomorphic. the nuclei are moderately crowded. polarity along the basement membrane is completely lost. mitotic figures are infrequent. the nuclei have inconspicuous nucleoli. conclusion high grade.

the nuclei are moderately pleomorphic. there is moderate crowding of nuclei present. polarity of nuclei is negligibly lost. mitosis is rare throughout the tissue. the nucleoli are mostly inconspicuous. conclusion low grade.

Fig. 6. The pathologist’s annotations are in black and the automatic results of TandemNet are in green, which accurately describe the semantic concepts.

4

Conclusion

This paper proposes a novel multimodal network, TandemNet, which can jointly learn from medical images and diagnostic reports and predict in an interpretable 3

We freeze the CNN for the whole training and the dual-attention model for the first 5 epochs, and then fine-tune with a smaller learning rate, 5e−5.

328

Z. Zhang et al.

scheme through a novel dual-attention mechanism. Sufficient and comprehensive experiments on BCIDR demonstrate that TandemNet is favorable for more intelligent computer-aided medical image diagnosis.

References 1. Greenspan, H., van Ginneken, B., Summers, R.M.: Guest editorial deep learning in medical imaging: overview and future promise of an exciting new technique. TMI 35(5), 1153–1159 (2016) 2. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR, pp. 3156–3164 (2015) 3. Xu, T., Zhang, H., Huang, X., Zhang, S., Metaxas, D.N.: Multimodal deep learning for cervical dysplasia diagnosis. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 115–123. Springer, Cham (2016). doi:10.1007/978-3-319-46723-8 14 4. Shin, H.C., Roberts, K., Lu, L., Demner-Fushman, D., Yao, J., Summers, R.M.: Learning to read chest x-rays: recurrent neural cascade model for automated image annotation. In: CVPR, pp. 2497–2506 (2016) 5. Zhang, Z., Xie, Y., Xing, F., Mcgough, M., Yang, L.: MDNet: a semantically and visually interpretable medical image diagnosis network. In: CVPR (2017) 6. He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 630–645. Springer, Cham (2016). doi:10.1007/978-3-319-46493-0 38 7. Zagoruyko, S., Komodakis, N.: Wide residual networks. In: BMVC (2016) 8. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 9. Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR, pp. 3128–3137 (2015) 10. Pascanu, R., Mikolov, T., Bengio, Y.: On the difficulty of training recurrent neural networks. In: ICML, pp. 1310–1318 (2013) 11. Luong, M.T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. In: EMNLP, pp. 1412–1421 (2015) 12. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R.S., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: ICML, pp. 2048–2057 (2015) 13. Huang, G., Sun, Y., Liu, Z., Sedra, D., Weinberger, K.Q.: Deep networks with stochastic depth. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 646–661. Springer, Cham (2016). doi:10.1007/ 978-3-319-46493-0 39

BRIEFnet: Deep Pancreas Segmentation Using Binary Sparse Convolutions Mattias P. Heinrich1(B) and Ozan Oktay2 1 2

Institute of Medical Informatics, University of L¨ ubeck, L¨ ubeck, Germany [email protected] Biomedical Image Analysis Group, Imperial College London, London, UK https://www.mpheinrich.de

Abstract. Dense prediction using deep convolutional neural networks (CNNs) has recently advanced the field of segmentation in computer vision and medical imaging. In contrast to patch-based classification, it requires only a single path through a deep network to segment every voxel in an image. However, it is difficult to incorporate contextual information without using contracting (pooling) layers, which would reduce the spatial accuracy for thinner structures. Consequently, huge receptive fields are required which might lead to disproportionate computational demand. Here, we propose to use binary sparse convolutions in the first layer as a particularly effective approach to reduce complexity while achieving high accuracy. The concept is inspired by the successful BRIEF descriptors and complemented with 1 × 1 convolutions (cf. network in network) to further reduce the number of trainable parameters. Sparsity is in particular important for small datasets often found in medical imaging. Our experimental validation demonstrates accuracies for pancreas segmentation in CT that are comparable with state-of-the-art deep learning approaches and registration based multi-atlas segmentation with label fusion. The whole network, which also includes a classic CNN path to improve local details, can be trained in 10 min. Segmenting a new scan takes 3 s even without using a GPU. Keywords: Context features

1

· Dilated convolutions · Dense prediction

Introduction

The automatic segmentation of medical volumes relies on methods that are able to delineate objects boundaries on a local detail level, but also to avoid over-segmentation of similar neighbouring structures within the field-of-view. A robust method should therefore capture a large regional context. The segmentation of the pancreas in computer tomography (CT) is very important for computer assisted diagnosis of inflammation (pancreatitis) or cancer. However, this task is challenging due to the highly variable shape, a relatively poor contrast and similar neighouring abdominal structures. c Springer International Publishing AG 2017  M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 329–337, 2017. DOI: 10.1007/978-3-319-66179-7 38

330

M.P. Heinrich and O. Oktay

In recent years, convolutional neural networks (CNN) have shown immense progress in image recognition [1], by aggregating activations of object occurrences across the whole image using deep architectures and contracting pooling layers. Employing classic deep networks for segmentation using sliding patches results in many redundant computations and long inference times. Dense prediction (i.e. a parallel voxelwise segmentation) of whole images using the classical contracting architecture was extended by so-called upconvolutional layers in [2], which result in the loss of some detail information during spatial downsampling. Additional links transferring higher resolution information have therefore been proposed [3]. Fully-convolutional networks (FCN) have also been adapted for medical image segmentation [4]. The ease of adapting existing deep network for segmentation is inviting, however, may result in overly complex architectures with many trainable weights that are potentially reliant on pre-training [5] and increase the computational time for training and inference. While fully-convolutional networks can reach impressive segmentation accuracy when designed properly and trained with enough data, it is interesting to explore whether a completely new approach for including context into dense prediction could be considered. A multiresolution deep network has been proposed in [6] for brain lesion segmentation, which uses two parallel paths for both high-resolution and low-resolution inputs. However, this approach still has to perform a series of convolutions to increase the capture range of the receptive field and might yield a higher correlation of weights across paths. In order to segment smaller abdominal organs, in particular the pancreas that has poor gray value contrast, we believe that it is of great importance to efficiently encode information about its surroundings within the CT scan. The use of sparse longrange features [7], such as local binary patterns and BRIEF [8], has shown great success within classic machine learning approaches. However, robustly and discriminately representing vastly different anatomical shapes across subjects is challenging and pancreas segmentation has not been successfully addressed using random forests or ferns. We therefore conclude that while long-range comparisons are powerful in practice, learning their optimal combination within the context of deep learning can further enhance their usefulness. In [9], a new concept to aggregate context was introduced by using dilated convolutions with the advantage that very wide kernels with fewer trainable weights can be realised and has been used for MR segmentation in [10]. We propose to extend this idea to binary sparse convolutions and thereby model sparse long-range contextual relations with DCNNs to obtain simple and efficient, yet very powerful models for medical image segmentation. Our approach, which we call BRIEFnet, starts with a sparse convolution filter with a huge receptive field, so that each neuron in the following layer has only two non-zero input weights (which are restricted to be ∈ ±1). In order to learn relations across these binary comparisons, we succeed this layer by a 1 × 1 convolution, originally presented as network in network [11]. We will show that this sparse sampling enables the rapid training of expressive networks with very few parameters that outperform many recent alternative ideas for dense prediction. Reducing model complexity using binary weights is a new

BRIEFnet: Deep Pancreas Segmentation Using Binary Sparse Convolutions

331

promising concept [12] that can substantially reduce training and inference times. Similar to [6,9], we design a network for dense prediction without a need for contracting pooling layers. BRIEFnet can be seen as a complimentary solution to fullyconvolutional architectures with multi-resolution paths [4,6] or holistically nested networks [5,13]. The additional sparsity constraint in our method is in particular useful when few training scans are available. We will show in our experiments that our network, which also includes a classic CNN path for improved delineation of local details reaches comparable performance to the much more complex DCNN technique of [5] and the best multi-atlas registration with label fusion on a public abdominal CT dataset [14].

2

Method

The input to the proposed network will be a region of interest around the pancreas. While our approach is fully convolutional and the output therefore invariant to translational offsets (except for boundary effects), a bounding box initialisation helps to obtain roughly comparable organ sizes across scans and reduces the computational complexity. Since this detection is not the focus of this work, we use a manual box that is enlarged by about 300% (in volume) around the pancreas. Several accurate algorithms exist for automatic bounding box and organ localisation, e.g. [15,16], which could be adapted for this task. The BRIEFnet is designed to use a stack of 3D slices as input and output a dense label prediction for every pixel within a stack of 2D slices. Thus a 3D volume is generated by applying the network to all slices within the region of interest. Finally, an edgepreserving smoothing of the predicted probability maps is performed, which are then thresholded to yield binary segmentations. The key idea behind BRIEFnet is to use binary sparse convolutions in order to realise larger receptive fields while keeping the model complexity low. The overview of our complete network is given in Fig. 1. We use 3D stacks of slices of the CT scan within the bounding box as input to our framework. The images are padded to keep the same dimensions for all layers using the lowest intensity value (−1000 HU). In total, 2 × 1536 nonzero weights are determined, one ±1 pair per kernel, by sampling the receptive field using a uniformly random distribution (with a stride of three voxels). This random sparsity of connections in the first layer is inspired by the irregular k-space sampling found in compressed sensing. Note that it is important to increase the throughput of the central pixel as discussed in [9], which is also similar in spirit to residual learning [1]. We increase the probability of drawing the centre voxel within the receptive field so that every third weight pair contains it. Subsequently, the 1536 channels (for each voxel in the grid) are passed through tanh activation. To avoid a complete saturation of the nonlinearity, we divide the inputs by a constant (here 100). Optimising binary weights can be challenging [12] and would in this scenario lead to a very many degrees of freedom, motivating the random selection at model construction. Why is such a sparse sampling sufficient to gather all necessary image data? The key lies in having the following 1 × 1 convolution that combines the output

332

M.P. Heinrich and O. Oktay

Fig. 1. Architecture of BRIEFnet (the novel part is highlighted with red box). A huge receptive field of 99 × 99 × 99 is realised by our sparse sampling (here of 1536 pixel pairs). The weights of this first layer are not learned, restricted to +1 or −1 and followed by a tanh activation. Importantly, a following 1 × 1 convolution combines information across channels. To reduce computation time a stride of 3 is used for the first contextual layer, requiring an up-convolution at the end. The additional local CNN path uses six traditional small convolutions with 64 channels, which are merged with the context information using a pointwise multiplication layer. Given an input 3D stack, a dense probabilistic segmentation of the centre slice is obtained. Note that each ReLU is preceded by batchnormalisation. In total there are only ≈1 million trainable weights.

of multiple weight pairs. This locally fully-connected layer (with shared weights across all spatial position as in [11]) is able to find the optimal combination of sparse pixel pair activations and enables a meaningful dimensionality reduction. When multiplying the trained weights of both layers, one notices patterns that are similar to classic large convolution filters, but our framework removes redundancy and achieves a much lower complexity than most other approaches. The FCN network for semantic segmentation in [2] has over 100 million parameters and the holistically nested network of [13] over ten million. We only require one million, most of them for the 1536 × 256 matrix multiplication, which is particularly efficient to compute. In order to compensate local errors (e.g. due to the stride of 3 voxels), we additionally include a classic local CNN path for dense prediction, which is combined with the contextual layers using a pointwise multiplication (as done in [17]). While the input to our network is 3D, all following spatial convolutions are 2D (due to memory constraints), we therefore predict the segmentation of each 2D slice individually and stack them together, as also done for pancreas segmentation in [5]. The inference for all slices of one 3D image takes only about 3 s on a CPU (10 frames) among the three methods. The median error of the RNN-based method is similar to the results reported in [8]. While they achieved a mean error of less than one frame, their detection step requires the knowledge of complete sequences, hence it will not work in a prospective scenario. The learning-based method in [9] can be used for prospective detection, but some of the proposed features were heuristically designed for

Fig. 4. An example to illustrate the RNN-based method. From left to right are the original XA frame (left), the vessel layer after layer separation (middle left), the vesselness image (middle right), the contrast signal for the whole sequence (right). The color markers in the signal show the prediction of BCF with LSTM (red) and the ground truth (green). Note that the artefact of diaphragm does not appear in the vesselness image thanks to layer separation.

Fast Prospective Detection of Contrast Inflow in X-ray Angiograms

459

Table 1. The statistics of the absolute error for the 3 methods. The two columns in the middle show the mean, standard deviation, median of the absolute errors and the median of non-absolute errors (*) in frames. The last two columns show the number of sequences on which the method made an absolute error no larger than 3 frames or larger than 10 frames. Methods

mean (std) median (*) #(error 3) #(error >10)

Condurache et al. [3] 6.2 (7.1)

5 (4)

29/73

10/73

CNN-based

3.9 (4.9)

2.5 (1)

55/80

5/80

RNN-based

3.6 (4.6)

2 (−0.5)

55/80

7/80

X-ray images of LA for EP procedure, which have different image features from the XA of coronary interventions. Compared to these methods, our approaches were designed for prospective settings and the CNN-based method is a general framework that could potentially be applied in different clinical procedures. The RNN-based learning with a handcrafted feature has slightly lower mean and median error than the CNN-based method, although the latter has a more complex and deeper architecture. This might contradict to what is commonly known about the performance of deep learning. The possible reasons may be twofold. First, the size of training data was small, even with data augmentation and a reduced CNN model, some over-fitting was observed. Second, the CNN treats frames independently rather than modeling their temporal relations. Although CNNs perform excellent in many classification tasks, detecting BCF requires a classifier that has good accuracy for data on the border between two classes. In terms of computation efficiency (test time), the method of Condurache et al. needed 111 ms to 443 ms to process a frame. While the CNN-based method ran very fast and used on average only 14 ms to process one frame. The RNNbased method ran on average 64 ms/frame on images of the original size 512 × 512 or 1024 × 1024, and 140 ms/frame on images of the original size 776 × 776. As the test time of the RNN-based method was based on a MATLAB implementation with a single CPU core, it has large potential to run in realtime (6) and non-relevant (Gleason score ≤6). We use a cascaded 3D elastic registration as a first step of preprocessing to compensate for any motion that may have occurred during acquisitions [13]. In order to increase robustness, a pairwise registration between T2-weighted image and the corresponding low-b diffusion image as the representative of DWI set is performed. We then apply the computed deformation to compensate motion in both ADC map and high-b diffusion image. Similarly, we perform a pairwise registration between T2-weighted image and late enhanced contrast image as the representative of DCE set. Additionally, an 80 mm × 80 mm region of interest (ROI) mask was applied on each slice to ensure only prostate and surrounding areas were considered. After intra-patient registration, all images are then reformatted into a T2weighted 100 mm × 100 mm × 60 mm image grid, which corresponds to roughly 200 × 200 pixel 2D slices. Two ground truth maps corresponding to benign and malignant tumor labels were created for each dataset by creating Gaussian distribution (with 3σ of 10 mm) at each lesion point in 2D as shown in Fig. 1. The Gaussian distribution was also propagated through-plane with a standard deviation adjusted to the acquisition slice thickness. Only slices containing any tumor labels were selected for processing. This final set totals 824 slices from the 202 patient cases.

Fig. 1. Sample slices from T2 images coupled with marked tumor point locations as Gaussian responses. The green and red regions correspond to benign and malignant tumors, respectively.

2.2

Network Design and Training

We designed three convolutional-deconvolutional image-to-image networks (Models 0, 1, and 2) with increasing complexity in terms of number of features and layers as shown in Fig. 2. Compared to the 13 convolutional and 13 deconvolutional layers of SegNet [2], these models contain fewer layers and features to avoid over-fitting. Each model’s output consists of two channels signifying the malignant and benign tumor categories. Batch normalization was used after each convolutional layer during training. A 256 × 256 input image was used. In addition to the three networks, the following modifications were also evaluated:

Encoder-Decoders for Prostate Cancer Classification

493

Fig. 2. The three networks used evaluated in the proposed method.

– Input images available (T2, ADC, High B-value, K-trans) – Activation function • Rectified Liner Unit (ReLU) • Leaky ReLU (α = 0.01) • Very Leaky ReLU (α = 0.3) - improved classification performance in [1] – Adding skip-connections [4] – Training data augmentation (Gaussian noise addition, rotation, shifting) All networks were trained using Theano [11] with batch gradient decent (Batch size = 10). A mean-squared error loss function computed within a mask of the original image slice size was used. Training was performed for a maximum of 100 epochs and a minimal loss on a small set of validation data was used to select the model. A constant learning rate of 0.005 was used throughout. In order to assess the sampling variability we performed 5-fold cross validation bootstrapped five times with different sets of data chosen randomly for training and testing, hence 20% of the data was used for testing in each fold. Using this approach we are able to get a range of results and can compute rather a sampling independent average performance. As a performance indicator, we use the area under the curve (AUC) for each classification run. We also make sure that no slices from a single patient fall into both training/validation and test datasets. Classification was determined by the intensity ratios from each channel at the given location.

3

Results

Our first aim is to assess performance in lesion characterization. The second aim is to better understand the contribution of the different mpMRI contrasts in the overall characterization performance. It is desirable to have a compromise between the acquisition length (smaller number of channels) and the performance. The performance results using varying number of multi-parametric

494

A.P. Kiraly et al.

Table 1. Average AUC results of the three networks used with different combinations of input channels without data augmentation. Network T2+ADC+HighB+KTRANS T2+ADC+HighB T2+ADC ADC M0

83.2%

77.2%

79.0%

78.3%

M1

83.4%

80.9%

77.7%

77.9%

M2

81.2%

79.3%

78.5%

80.4%

Fig. 3. AUC results of the three networks with a differing input modalities.

channels are shown in Table 1 and plotted in Fig. 3. It is clear that the aggregate of all modalities produced the best result across all models. However, it is clinically desirable to eliminate the dynamic sequence scan, both to save time and contrast agent injection. The performance in this case may still provide a clinically acceptable negative predictive value (NPV) to rule out malignant lesions and avoid invasive biopsies (by selecting an appropriate operating point on the ROC curve). This hypothesis must be further investigated and validated. Model 1 produces the best average AUC with the least variability while Model 0 has an optimal single AUC score among all the folds tested. Sample AUCs are shown in Figs. 4 and 5. Results based on all four input channels with variations of adding skip connects or changing the response function are shown in Table 2 and Fig. 3. Using the leaky and very leaky ReLUs resulted in inferior performance compared to ReLUs. However, skip connections resulted in improved performance for the most complex model with an average AUC of 83.3% and reduced variability across folds. Training data augmentation by translation and rotation coupled with Gaussian noise resulted in a consistent improvement. An average AUC of 95% was reached, when we applied the data augmentation along with skip connections to model M1. We also coupled the image to image localization and classification network with a discriminator that aims to identify real and generated probability maps. The resulting network drives the evolution of the image to image localization and classification network by the weighted sum of the regression

Encoder-Decoders for Prostate Cancer Classification

Fig. 4. ROC Curve of Model 1 using all four MRI modalities in our dataset.

495

Fig. 5. ROC Curve of Model 1 using skip connections and training data augmentation.

cost and binary classification cost stemming from the use of the discriminator. The training is conducted using the approaches recently provided in the generative adversarial networks (GAN) literature [5,8]. In our limited experiments, we found out that this adversarial setup yielded similar performance compared to the image to image network. Use of different adversarial approaches is a part of our future directions (Fig. 6). Table 2. Average AUC results of the three networks used with architecture changes. Network Skip Connect Leaky Rectify Very Leaky Rectify M0

81.6%

80.5%

79.4%

M1

81.8%

81.2%

80.9%

M2

83.3%

81.0%

79.4%

Fig. 6. Average AUC results of the three networks with skip connects and modified transfer functions. In this case, the leaky and very leaky ReLus had alpha parameters of 0.01 and 0.3, respectively.

496

4

A.P. Kiraly et al.

Conclusions and Future Work

We have presented a convolutional image-to-image deep learning pipeline for performing classification without fully connected layers as in conventional classification pipelines. The same network could also be used for localization of suspicious regions by examining the responses across different channels. We have experimented and shown results by varying input channels and network parameters to arrive at a recommendation architecture for an optimal performance. An average AUC of 83.4% for classification without data augmentation is promising and improvements are possible, for instance, by inclusion of a prostate segmentation region. This will allow the network to focus solely on regions within the prostate and not get penalized for responses outside this region. We also plan to develop and evaluate localization of tumors by the individual channel responses. Although optimal classification was achieved using four input images, in a practice it is undesirable to inject patients with contrast to obtain K-trans and DCE images. Therefore we hypothesize that methods developed without use of K-trans or DCE images could find more utility in early diagnosis scenario as a gatekeeper of a more invasive biopsy approach. Acknowledgments. Data used in this research were obtained from The Cancer Imaging Archive (TCIA) sponsored by the SPIE, NCI/NIH, AAPM, and Radboud University [7].

References 1. Anthimopoulos, M., Christodoulidis, S., Ebner, L., Christe, A., Mougiakakou, S.: Lung pattern classification for interstitial lung diseases using a deep convolutional neural network. IEEE TMI 35(5), 1207–1216 (2016) 2. Badrinarayanan, V., Kendall, A., Cipolla, R.: SegNet: a deep convolutional encoder-decoder architecture for image segmentation. In: CVPR 2015, p. 5 (2015) 3. Chung, A.G., Shafiee, M.J., Kumar, D.: Discovery radiomics for multi-parametric MRI prostate cancer detection. In: Computer Vision and Pattern Recognition, pp. 1–8 (2015) 4. Drozdzal, M., Vorontsov, E., Chartrand, G., Kadoury, S., Pal, C.: The importance of skip connections in biomedical image segmentation. In: Carneiro, G., Mateus, D., Peter, L., Bradley, A., Tavares, J.M.R.S., Belagiannis, V., Papa, J.P., Nascimento, J.C., Loog, M., Lu, Z., Cardoso, J.S., Cornebise, J. (eds.) LABELS/DLMIA -2016. LNCS, vol. 10008, pp. 179–187. Springer, Cham (2016). doi:10.1007/978-3-319-46976-8 19 5. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. and Bengio, Y.: Generative adversarial nets. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 27, pp. 2672–2680. Curran Associates Inc. (2014) 6. Kainz, P., Urschler, M., Schulter, S., Wohlhart, P., Lepetit, V.: You should use regression to detect cells. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 276–283. Springer, Cham (2015). doi:10. 1007/978-3-319-24574-4 33

Encoder-Decoders for Prostate Cancer Classification

497

7. Litjens, G., Debats, O., Barentsz, J., Karssemeijer, N., Huisman, H.: Computeraided detection of prostate cancer in MRI. IEEE TMI 33(5), 1083–1092 (2014) 8. Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR, abs/1511.06434 (2015) 9. Reda, I., Shalaby, A., Khalifa, F., Elmogy, M., Aboulfotouh, A., El-Ghar, M.A., Hosseini-Asl, E., Werghi, N., Keynton, R., El-Baz, A.: Computer-aided diagnostic tool for early detection of prostate cancer. In: 2016 IEEE International Conference on Image Processing (ICIP), pp. 2668–2672, September 2016 10. Siegel, R.L., Miller, K.D., Jemal, A.: Cancer statistics, 2016. CA Cancer J. Clin. 66(1), 7–30 (2016) 11. Theano Development Team. Theano: a Python framework for fast computation of mathematical expressions. arXiv e-prints, abs/1605.02688, May 2016 12. Tofts, P.S., Kermode, A.G.: Measurement of the blood-brain barrier permeability and leakage space using dynamic mr imaging. 1. fundamental concepts. Magn. Reson. Med. 17(2), 357–367 (1991) 13. Wells, W.M., Viola, P., Atsumi, H., Nakajima, S., Kikinis, R.: Multi-modal volume registration by maximization of mutual information. Med. Image Anal. 1(1), 35–51 (1996)

Deep Image-to-Image Recurrent Network with Shape Basis Learning for Automatic Vertebra Labeling in Large-Scale 3D CT Volumes Dong Yang1 , Tao Xiong2 , Daguang Xu3(B) , S. Kevin Zhou3 , Zhoubing Xu3 , Mingqing Chen3 , JinHyeong Park3 , Sasa Grbic3 , Trac D. Tran2 , Sang Peter Chin2 , Dimitris Metaxas1 , and Dorin Comaniciu3 1

Department of Computer Science, Rutgers University, Piscataway, NJ 08854, USA 2 Department of Electrical and Computer Engineering, The Johns Hopkins University, Baltimore, MD 21218, USA 3 Medical Imaging Technologies, Siemens Healthcare Technology Center, Princeton, NJ 08540, USA {daguang.xu,shaohua.zhou,sasa.grbic, dorin.comaniciu}@siemens-healthineers.com

Abstract. Automatic vertebra localization and identification in 3D medical images plays an important role in many clinical tasks, including pathological diagnosis, surgical planning and postoperative assessment. In this paper, we propose an automatic and efficient algorithm to localize and label the vertebra centroids in 3D CT volumes. First, a deep image-to-image network (DI2IN) is deployed to initialize vertebra locations, employing the convolutional encoder-decoder architecture. Next, the centroid probability maps from DI2IN are modeled as a sequence according to the spatial relationship of vertebrae, and evolved with the convolutional long short-term memory (ConvLSTM) model. Finally, the landmark positions are further refined and regularized by another neural network with a learned shape basis. The whole pipeline can be conducted in the end-to-end manner. The proposed method outperforms other state-of-the-art methods on a public database of 302 spine CT volumes with various pathologies. To further boost the performance and validate that large labeled training data can benefit the deep learning algorithms, we leverage the knowledge of additional 1000 3D CT volumes from different patients. Our experimental results show that training with a large database improves the performance of proposed framework by a large margin and achieves an identification rate of 89%.

1

Introduction

Accurate and automatic localization and identification of human vertebrae have become of great importance in 3D spinal imaging for clinical tasks such as pathological diagnosis, surgical planning and post-operative assessment of pathologies. D. Yang and T. Xiong—Authors contributed equally. c Springer International Publishing AG 2017  M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 498–506, 2017. DOI: 10.1007/978-3-319-66179-7 57

Deep Image-to-Image Recurrent Network with Shape Basis Learning

499

Fig. 1. Proposed method consisting of three major components: DI2IN, ConvLSTM and shape-based Network.

Specific applications such as vertebrae segmentation, fracture detection, tumor detection and localization, registration and statistical shape analysis can benefit from the efficient and precise vertebrae detection and labeling algorithms. However, designing such an algorithm requires addressing various challenges such as pathological cases, image artifacts and limited field-of-view (FOV). In the past, many approaches have been developed to address these limitations in spine detection problems. In [1], Glocker et al. presented a two-stage approach for localization and identification of vertebrae in CT, which has achieved an identification rate of 81%. This approach uses a regression forests and a generative model for prediction and it requires handcrafted feature vectors in pre-processing. Then, Glocker et al. [2] further extended the vertebrae localization to handle pathological spine CT. This supervised classification forests based approach achieves an identification rate of 70% and outperforms state-of-the art on a pathological database. Recently, deep convolutional neural network has also been highlighted in the research of human vertebrae detection. A joint learning model with deep neural networks (J-CNN) [3] has been designed to effectively identify the type of vertebra and improved the identification rate (85%) with a large margin. They trained a random forest classifier to coarsely detect the vertebral centroids instead of directly performing neural network on the whole CT volumes. Suzani et al. [4] also presented a deep neural network for fast vertebrae detection. This approach first extracts the intensity-based features; then uses a deep neural network to localize the vertebrae. Although this approach has achieved high detection rate, it suffers from the large mean error compared to other approaches. To meet the requirements of both accuracy and efficiency and take advantage of deep neural networks, we present an approach, shown in Fig. 1, with following contributions: (a) Deep Image-to-Image Network (DI2IN) for VoxelWise Regression: Instead of extracting handcrafted features or adopting coarse classifiers, the proposed deep image-to-image network directly performs on the 3D CT volumes and outputs the multichannel probability maps associated with different vertebrae centers. The high responses in probability maps intuitively indicate the location and label of vertebrae. The training is formulated as a multichannel voxel-wise regression. Since the DI2IN is implemented in a fully convolutional way, it is significantly efficient in time compared to the slidingwindow approaches. (b) Response Enhancement with ConvLSTM: Inspired by [5], we introduce a recurrent neural network (RNN) to model the spatial

500

D. Yang et al.

relationship of vertebra responses from DI2IN. The vertebrae can be interpreted in a chain structure from head to hip according to their related positions. The sequential order of the chain-structured model enables the vertebra responses to communicate with each other using recurrent model, such as RNN. The popular architecture, ConvLSTM, is adopted as our RNN to capture the spatial correlation between vertebra prediction. The ConvLSTM studies the pair-wise relation of vertebra responses and regularize the output of the DI2IN. (c) Refinement using a Shape Basis Network: To further refine the coordinates of vertebrae, we incorporate a shape basis network which takes advantage of the holistic structure of spine. Instead of learning a quadratic regression model to fit the spinal shape, we adopt the coordinates of spines in training samples to construct a shape-based dictionary and formulate the training process as a regression problem. The shape-based neural network extracts the coordinates from the previous stage as input and generates the coefficients associated with the dictionary, which indicates the linear combination of atoms from the shape-based dictionary. By embedding the shape regularity in the training of neural network, ambiguous coordinates are removed and the representation is optimized, which further improves the localization and identification performance. Compared to previous method [3] which applies classic refinement method as a post-processing step, our algorithm introduces an end-to-end training network in the refinement step for the first time, which allows us to train each component separately and then fine-tuned together in an end-to-end manner.

2 2.1

Method Deep Image-to-Image Network (DI2IN)

In this section, we present the architecture and details of the proposed deep image-to-image network, as shown in Fig. 2. The basic architecture is designed as a convolutional encoder-decoder network [6]. Compared to sliding-window approach, the DI2IN is implemented in a voxel-wise fully convolutional end-toend learning. It performs the network on 3D CT volumes directly. Basically, the DI2IN takes the 3D CT volume as input and generates the multichannel probability maps simultaneously. The ground truth probability maps are gen2 2 erated by Gaussian distribution Igt = σ√12π e−x−μ /2σ , where x ∈ R3 and μ denote the voxel coordinates and ground truth location, respectively. σ is predefined to control the scale of the Gaussian distribution. Each channel’s prediction Iprediction is associated with the centroid location and type of vertebra. The 2 loss function is defined as |Iprediction − Igt | for each voxel. Therefore, the whole learning problem is formulated as a multichannel voxel-wise regression. Instead of using classification formulation for detection, regression is tremendously helpful for determining predicted coordinates and it relieves the issue of imbalanced training samples, which is very common in semantic segmentation. The encoder is composed of convolution, max-pooling and rectified linear unit (ReLU) layers while the decoder is composed of convolution, ReLU and upsampling layers. Max-pooling layers are of great importance to increase receptive field

Deep Image-to-Image Recurrent Network with Shape Basis Learning

501

Fig. 2. Proposed deep image-to-image network (DI2IN). The front part is a convolutional encoder-decoder network with feature concatenation, and the backend is multi-level deep supervision network. Numbers next to convolutional layers are channel numbers. Extra 26-channel convolution layers are implicitly used in deep supervision.

and extract large contextual information. Upsampling layers are designed with the bilinear interpolation to enlarge and densify the activation, which also further enables the end-to-end voxel-wise training without losing resolution details. The convolutional filter size is 1 × 1 × 1 in the output layer and 3 × 3 × 3 in other layers. The max-pooling filter size is 2 × 2 × 2 for down-sampling by half in each dimension. In upsampling layers, the input features are upsampled by a factor of 2 in each dimension. The stride is set as 1 in order to maintain the same size in each channel. Additionally, we incorporate the feature concatenation and deep supervision in DI2IN. In feature concatenation, a bridge is built directly from the encoder layer to the decoder layer, which passes forward the feature information from the encoder and then concatenates it with the decoder layer [7]. As a result, the DI2IN benefits from both local and global contextual information. Deep supervision has been adopted in [8–10] to achieve good boundary detection and organ segmentation. In the DI2IN, we incorporated a more complex deep supervision approach to further improve the performance. Several branches are diverged from the middle layers of the decoder network. With the appropriate upsampling and convolutional operations, the output size of all branches matches the size of 26-channel ground truth. In order to take advantage of deep supervision, the total loss function losstotal of DI2IN is defined as the combination of loss li for all output branches as follows.  lossi + lossf inal (1) losstotal = i

502

2.2

D. Yang et al.

Response Enhancement Using Multi-layer ConvLSTM

Given the image I, the DI2IN generates a probability map P (vi |I) for the centroid of each vertebra i with high confidence. The vertebrae are localized at the peak positions vi of probability maps. However, we find that these probability maps are not perfect yet: some probability maps don’t have response or have very low response at the ground truth locations because of similar image appearances of several vertebrae (e.g. T 1 ∼ T 12). In order to handle the problem of missing response, we propose a RNN to effectively enhance the probability maps by incorporating prior knowledge of the spinal structure. RNN has been widely developed and used in many applications, such as natural language processing, video analysis. It is capable to handle arbitrary sequences of input, and performs the same processing on every element of the sequence with memory of the previous computation. In our case, the spatial relation of vertebrae naturally forms a chain structure from top to bottom. Each element of the chain is the response map of a vertebra centroid. The proposed RNN model treats the chain as a sequence and enables vertebra responses of DI2IN to communicate with each other. In order to adjust the 3D response maps of vertebrae, we apply the convolutional LSTM (ConvLSTM) as our RNN model shown in Fig. 3. Because the z direction is the most informative dimension, the x, y dimensions are set to 1 for all the convolution kernels. During inference, we pass information forward and backward to regularize the output of DI2IN. The passing process can be conducted k iterations (k = 2 in our experiments). All input-to-hidden and hidden-to-hidden operations are convolution. Therefore, the response distributions can be adjusted with necessary displacement or enhanced by the neighbors’ responses. Equation (2) describes how the LSTM unit is updated at each time step. X1 , X2 , ... and Xt are input states for vertebrae, cell states are C1 , C2 , ... and Ct , and the hidden states are H1 , H2 , ... and Ht . it , ft and ot are the gates of

Fig. 3. The multi-layer ConvLSTM architecture for updating the vertebra response.

Deep Image-to-Image Recurrent Network with Shape Basis Learning

503

ConvLSTM. We use several sub-networks G to update Xt , and Ht , which differs from the original ConvLSTM setting (original work only uses single kernel). Each G is consist of three convolutional layers with 1 × 1 × 9 kernels, and filter numbers are 9, 1 and 1. The sub-networks are more flexible and have a larger receptive field compared to that uses a single kernel. Therefore, it is helpful to capture the spatial relationship of all vertebrae. it = σ (Gxi (Xt ) + Ghi (Ht−1 ) + Wci  Ct−1 + bi ) ft = σ (Gxf (Xt ) + Ghf (Ht−1 ) + Wcf  Ct−1 + bf ) Ct = ft  Ct−1 + it  tanh (Gxc (Xt ) + Ghc (Ht−1 ) + bc ) ot = σ (Gxo (Xt ) + Gho (Ht−1 ) + Wco  Ct + bo ))

(2)

Ht = ot  tanh (Ct ) 2.3

Shape Basis Network for Refinement

As shown in Fig. 4, the ConvLSTM generates clear probability maps, where the high response in the map indicates the potential location of the landmark (centroid of the vertebrae). However, sometimes due to image artifacts and low image resolution, it is difficult to guarantee there is no false positive. Therefore, we present a shape basis network to help refine the coordinates inspired by [11].

Fig. 4. Probability map examples from DI2IN (left in each case) and ConvLSTM (right in each case). The prediction in “Good Cases” is close to ground truth location. In “Bad Cases”, some false positives exist remotely besides the response at the ground truth location.

Given a pre-defined shape-based dictionary D ∈ RN ×M and coordinate vector y ∈ RN generated by ConvLSTM, the proposed shape basis network takes y as input and outputs the coefficient vector x ∈ RM associated with dictioˆ is defined as Dx. In practice, nary D. Therefore, the refined coordinate vector y the shape-based dictionary D is simply learned from the training samples. For example, the dictionary Dz associated with the vertical axis is constructed by

504

D. Yang et al.

the z coordinate of vertebrae centroids in the training sample. N and M indicate the number of vertebrae and number of atoms in dictionary, respectively. The proposed shape basis network consists of several fully connected layers. Instead of regressing the refined coordinates, the network is trained to regress the coefficients x associated with the shape-based dictionary D. The learning problem is formulated as a regression model and the loss function is defined as:  ||Dxi − yi ||22 + λ||xi ||1 (3) lossshape = i

xi and yi denote the coefficient vector and ground truth coordinate vector of ith training sample. λ is the 1 norm coefficient to leverage the sparsity and residual. Intuitively, the shape-based neural network is learned to find out the best linear combination in the dictionary to refine the coordinates. In our case, we focus on the refinement of vertical coordinates. The input of shape basis network is obtained directly from the output of ConvLSTM using a non-trainable fully connected layer. The layer has uniform weights and no bias term, and it generates the correct coordinates when the response is clear. Such setting enables the endto-end scheme for fast inference instead of solving the loss function directly.

3

Experiments

First, we evaluate the proposed method on database introduced in [2] which consists of 302 CT scans with various types of lesions. The dataset has some cases with unusual appearance, such as abnormal spinal structure and bright visual artifacts due to metal implants by post-operative procedures. Furthermore, the FOV of each CT image varies greatly in terms of vertical cropping, image noise and physical resolution [1]. Most cases contain only part of the entire spine. The overall spinal structure can be seen only in a few examples. Large changes in lesions and limited FOV increase the complexity of the appearance of the vertebrae. It is difficult to accurately localize and identify the spinal column. The ground truth is marked on the center of gravity of each vertebra and annotated by the clinical experts. In previous work [1,3,4], two different settings have been conducted on this database: the first one uses 112 images as training and other 112 images as testing. The second one takes all data in first setting plus extra 18 images as the training data (overall 242 training images), and 60 unseen images are used as the testing data. For fair comparison, we follow the same configuration, which are referred as Set 1 and Set 2 respectively, in the experiments. Table 1 compares our result with the numerical results reported in previous methods [2–4] in terms of the Euclidean distance error (mm) and identification rate (Id.Rates) defined by [1]. The average mean errors of these two databases are 10.6 mm and 8.7 mm, respectively, and the identification rates are 78% and 85%, respectively. Overall, the proposed method is superior to the state-of-the-art methods on the same database with respect to mean error and identification rate.

Deep Image-to-Image Recurrent Network with Shape Basis Learning

505

Table 1. Comparison of localization errors in mm and identification rates among different methods. Our method is trained and tested using default data setting in “Set 1” and “Set 2”, while “+1000” indicates training with additional 1000 labeled spine data and evaluated on the same testing data. Region Method All

Glocker et al. [2] Suzani et al [4] Chen et al. [3] Our method Our method +1000

Set 1 Set 2 Mean Std Id.Rates Mean Std Id.Rates 12.4 18.2 10.6 9.0

11.2 11.4 8.7 8.8

70% 78% 83%

13.2 8.8 8.7 6.9

17.8 13.0 8.5 7.6

74% 84% 85% 89%

We collect additional 1000 CT volumes and train the proposed DI2IN from scratch to verify whether training a neural network with more labeled data will improve its performance. This data set covers large visual changes of the spinal column (e.g. age, abnormality, FOV, contrast etc.). We evaluated on the same database and reported the results in Table 1 (shown as “Our method + 1000 training data”). As can be seen, adding more training data will greatly improve the performance of the proposed method, verifying that a large amount of labeled data will effectively boost the power of DI2IN.

4

Conclusion

In this paper, we presented an accurate and automatic method for human vertebrae localization and identification in 3D CT volumes. Our approach outperformed other state-of-the-art methods of spine detection and labeling in terms of localization mean error and identification rate. Acknowledgements. We thank Dr. David Liu who provided insight and expertise that greatly assisted the research.

References 1. Glocker, B., Feulner, J., Criminisi, A., Haynor, D.R., Konukoglu, E.: Automatic localization and identification of vertebrae in arbitrary field-of-view CT scans. In: Ayache, N., Delingette, H., Golland, P., Mori, K. (eds.) MICCAI 2012. LNCS, vol. 7512, pp. 590–598. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33454-2 73 2. Glocker, B., Zikic, D., Konukoglu, E., Haynor, D.R., Criminisi, A.: Vertebrae localization in pathological spine CT via dense classification from sparse annotations. In: Mori, K., Sakuma, I., Sato, Y., Barillot, C., Navab, N. (eds.) MICCAI 2013. LNCS, vol. 8150, pp. 262–270. Springer, Heidelberg (2013). doi:10.1007/ 978-3-642-40763-5 33

506

D. Yang et al.

3. Chen, H., Shen, C., Qin, J., Ni, D., Shi, L., Cheng, J.C.Y., Heng, P.-A.: Automatic localization and identification of vertebrae in spine CT via a joint learning model with deep neural networks. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9349, pp. 515–522. Springer, Cham (2015). doi:10.1007/978-3-319-24553-9 63 4. Suzani, A., Seitel, A., Liu, Y., Fels, S., Rohling, R.N., Abolmaesumi, P.: Fast automatic vertebrae detection and localization in pathological CT scans - a deep learning approach. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 678–686. Springer, Cham (2015). doi:10.1007/ 978-3-319-24574-4 81 ˇ 5. Payer, C., Stern, D., Bischof, H., Urschler, M.: Regressing heatmaps for multiple landmark localization using CNNs. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 230–238. Springer, Cham (2016). doi:10.1007/978-3-319-46723-8 27 6. Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: a deep convolutional encoderdecoder architecture for image segmentation. arXiv preprint arXiv:1511.00561 (2015) 7. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). doi:10. 1007/978-3-319-24574-4 28 8. Xie, S., Tu, Z.: Holistically-nested edge detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1395–1403 (2015) 9. Merkow, J., Kriegman, D., Marsden, A., Tu, Z.: Dense volume-to-volume vascular boundary detection. arXiv preprint arXiv:1605.08401 (2016) 10. Dou, Q., Chen, H., Jin, Y., Yu, L., Qin, J., Heng, P.-A.: 3D deeply supervised network for automatic liver segmentation from CT volumes. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 149–157. Springer, Cham (2016). doi:10.1007/978-3-319-46723-8 18 11. Yu, X., Zhou, F., Chandraker, M.: Deep deformation network for object landmark localization. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 52–70. Springer, Cham (2016). doi:10.1007/ 978-3-319-46454-1 4

Automatic Liver Segmentation Using an Adversarial Image-to-Image Network Dong Yang1 , Daguang Xu2(B) , S. Kevin Zhou2 , Bogdan Georgescu2 , Mingqing Chen2 , Sasa Grbic2 , Dimitris Metaxas1 , and Dorin Comaniciu2 1

2

Department of Computer Science, Rutgers University, Piscataway, NJ 08854, USA Medical Imaging Technologies, Siemens Healthcare Technology Center, Princeton, NJ 08540, USA [email protected]

Abstract. Automatic liver segmentation in 3D medical images is essential in many clinical applications, such as pathological diagnosis of hepatic diseases, surgical planning, and postoperative assessment. However, it is still a very challenging task due to the complex background, fuzzy boundary, and various appearance of liver. In this paper, we propose an automatic and efficient algorithm to segment liver from 3D CT volumes. A deep image-to-image network (DI2IN) is first deployed to generate the liver segmentation, employing a convolutional encoderdecoder architecture combined with multi-level feature concatenation and deep supervision. Then an adversarial network is utilized during training process to discriminate the output of DI2IN from ground truth, which further boosts the performance of DI2IN. The proposed method is trained on an annotated dataset of 1000 CT volumes with various different scanning protocols (e.g., contrast and non-contrast, various resolution and position) and large variations in populations (e.g., ages and pathology). Our approach outperforms the state-of-the-art solutions in terms of segmentation accuracy and computing efficiency.

1

Introduction

Accurate liver segmentation from three dimensional (3D) medical images, e.g. computed tomography (CT) or magnetic resonance imaging (MRI) is essential in many clinical applications, such as pathological diagnosis of hepatic diseases, surgical planning, and postoperative assessment. However, automatic liver segmentation is still a highly challenging task due to the complex background, fuzzy boundary, and various appearance of liver in medical images. To date, several methods have been proposed for automatic liver segmentation from 3D CT scans. Generally, they can be categorized into non-learningbased and learning-based approaches. Non-learning-based approaches usually Electronic supplementary material The online version of this chapter (doi:10. 1007/978-3-319-66179-7 58) contains supplementary material, which is available to authorized users. c Springer International Publishing AG 2017  M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 507–515, 2017. DOI: 10.1007/978-3-319-66179-7 58

508

D. Yang et al.

Fig. 1. Proposed deep image-to-image network (DI2IN). The front part is a convolutional encoder-decoder network with feature concatenation, and the backend is deep supervision network through multi-level. Blocks inside DI2IN consist of convolutional and upscaling layers.

rely on the statistical distribution of the intensity, including atlas-based [1], active shape model (ASM)-based [2], level-set-based [3], and graph-cut-based [4] methods, etc. On the other hand, learning-based approaches take the advantage of hand-crafted features to train the classifiers to achieve good segmentation. For example, in [5], the proposed hierarchical framework applies marginal space learning with steerable features to handle the complicated texture pattern near the liver boundary. Until recently, deep learning has been shown to achieve superior performance in various challenging tasks, such as classification, segmentation, and detection. Several automatic liver segmentation approaches based on convolutional neural network (CNN) have been proposed. Dou et al. [6] demonstrated a fully convolutional network (FCN) with deep supervision, which can perform end-to-end learning and inference. The output of FCN is refined with a fully connected conditional random field (CRF) approach. Similarly, Christ et al. [7] proposed cascaded FCNs followed by CRF refinement. Lu et al. [8] used a FCN with graph-cut based refinement. Although these methods demonstrated good performance, they all used pre-defined refinement approaches. For example, both CRF and graph-cut methods are limited to the use of pairwise models, and timeconsuming as well. They may cause serious leakage at boundary regions with low contrast, which is common in liver segmentation. Meanwhile, Generative Adversarial Network (GAN) [9] has emerged as a powerful framework in various tasks. It consists of two parts: generator and discriminator. The generator tries to produce the output that is close to the real samples, while the discriminator attempts to distinguish between real and generated samples. Inspired by [10], we propose an automatic liver segmentation approach using an adversarial image-to-image network (DI2IN-AN). A deep image-to-image network (DI2IN) is served as the generator to produce the liver

Automatic Liver Segmentation Using an Adversarial

509

segmentation. It employs a convolutional encoder-decoder architecture combined with multi-level feature concatenation and deep supervision. Our network tries to optimize a conventional multi-class cross-entropy loss together with an adversarial term that aims to distinguish between the output of DI2IN and ground truth. Ideally, the discriminator pushes the generator’s output towards the distribution of ground truth, so that it has the potential to enhance generator’s performance by refining its output. Since the discriminator is usually a CNN which takes the joint configuration of many input variables, it embeds the higher-order potentials into the network (the geometric difference between prediction and ground truth is represented by the trainable network model instead of heuristic hints). The proposed method also achieves higher computing efficiency since the discriminator does not need to be executed at inference. All previous liver segmentation approaches were trained using dozens of volumes which did not take the full advantage of CNN. In contrast, our network leverages the knowledge of an annotated dataset of 1000+ CT volumes with various different scanning protocols (e.g., contrast and non-contrast, various resolution and position) and large variations in populations (e.g., ages and pathology). To the best of our knowledge, our experiment is the first time that more than 1000 annotated 3D CT volumes are adopted in liver segmentation tasks. The experimental result shows that training with such a large dataset significantly improves the performance and enhances the robustness of the network.

2 2.1

Methodology Deep Image-to-Image Network (DI2IN) for Liver Segmentation

In this section, we present a deep image-to-image network (DI2IN), which is a multi-layer convolutional neural network (CNN), for the liver segmentation. The segmentation task is defined as the voxel-wise binary classification. DI2IN takes the entire 3D CT volumes as input, and outputs the probability maps that indicate how likely voxels belong to the liver region. As shown in Fig. 1, the main structure of DI2IN is designed following a symmetric way as a convolutional encoder-decoder. All blocks in DI2IN consist of 3D convolutional and bilinear upscaling layers. The details of the network is described in Fig. 3. In the encoder part of DI2IN, only the convolution layers are used in all blocks. In order to increase the receptive field of neurons and lower the GPU memory consumption, we set stride as 2 at some layers and reduce the size of feature maps. Moreover, larger receptive field covers more contextual information and helps to preserve liver shape information in the prediction. The decoder of DI2IN consists of convolutional and bilinear upscaling layers. To enble end-toend prediction and training, the upscaling layers are implemented as bilinear interpolation to enlarge the activation maps. All convolutional kernels are 3 × 3 × 3. The upscaling factor in decoder is 2 for x, y, z dimension. The leaky rectified linear unit (Leaky ReLU) and batch normalization are adopted in all convolutional layers for proper gradient back-propagation.

510

D. Yang et al.

In order to further improve the performance of DI2IN, we adopt several mainstream technologies with the necessary changes [6,11,12]. First, we use the feature layer concatenation in DI2IN. Fast bridges are built directly from the encoder layers to the decoder layers. The bridges pass the information from the encoder forward and then concatenate it with the decoder feature layers. The combined feature is used as the input for the next convolution layer. Following the steps above to explicitly combine advanced and low-level features, DI2IN benefits from local and global contextual information. The deep supervision of the neural network during end-to-end training is shown to achieve good boundary detection and segmentation results. In the network, we introduced a more complex deep supervision scheme to improve performance. Several branches are separated from layers of the decoder section of main DI2IN. With the appropriate upscaling and convolution operations, the output size of each channel for all branches matches the size of the input image (Upscaling factors are 16, 4, 1 in block 10, 11, 12 respectively). By calculating the loss item li with the same ground truth data, the supervision is enforced at the end of each branch i. In order to further utilize the results of different branches, the final output is determined by the convolution operations of all branches with the leaky ReLU. During training, we apply binary cross entropy loss to each voxel of the output layers. The total loss ltotal is the weighted combination of loss terms for all output layers, including the final output layer and the output layers for all branches, as follows:  wi · li + wf inal · lf inal ltotal = i

2.2

Network Improvement with Adversarial Training

We adopt the prevailing idea of the generative adversarial networks to boost the performance of DI2IN. The proposed scheme is shown in Fig. 2. An adversarial network is adopted to capture the high-order appearance information, which distinguishes between the ground truth and the output from DI2IN. In order to guide the generator to better prediction, the adversarial network provides an extra loss function for updating the parameters of generator during training. The purpose of the extra loss is to make the prediction as close as possible to the ground truth labeling. We adopt the binary cross-entropy loss for training of the

Fig. 2. Proposed adversarial training scheme. The generator produces the segmentation prediction, and discriminator classifies the prediction and ground truth during training.

Automatic Liver Segmentation Using an Adversarial

511

adversarial network. D and G represent the discriminator  and generator (DI2IN, in the context), respectively. For the discriminator D Y ; θD , the  ground truth G label Ygt is assigned as one, and the prediction Ypred = G X; θ is assigned as zero where X is the input CT volumes. The structure of discriminator network D is shown in Fig. 3. The following objective function is used in training the adversarial network:       lD = −Ey∼pgt log D y; θD − Ey ∼ppred log 1 − D y  ; θD         = −Ey∼pgt log D y; θD − Ex∼pdata log 1 − D G x; θG ; θD

(1)

During the training of network D, the gradient of loss lD is propagated back to update the parameters of the generator network (DI2IN). At this stage, the loss for G has two components shown in the Eq. 2. The first component is the conventional segmentation loss lb : voxel-wise binary cross-entropy between the prediction and ground truth. Minimizing the second loss component enables the discriminator D to confuse the ground truth with the prediction from G.    lG = Ey∼ppred ,y ∼pgt [lseg (y, y  )] + λEy∼ppred log 1 − D y; θD   (2)    = Ey∼ppred ,y ∼pgt [lseg (y, y  )] + λEx∼pdata log 1 − D G x; θG ; θD Following suggestions in [9], we replace − log (1 − D (G (x))) with log (D (G (X))). In another word, we would like to maximize the probability that prediction to be the ground truth in Eq. 2, instead of minimizing the probability that prediction not to be the generated label map. Such replacement provides strong gradient during training of G and speed up the training process in practice.     (3) lG = Ey∼ppred ,y ∼pgt [lseg (y, y  )] − λEx∼pdata log D G x; θG ; θD The generator and discriminator are trained alternatively for several times shown in Algorithm 1, until the discriminator is not able to easily distinguish between ground truth label and the output of DI2IN. After the training process, the adversarial network is no longer required at inference. The generator itself can provide high quality segmentation results and its performance is improved.

3

Experiments

Most public dataset for liver segmentation only consists of tens of cases. For example, the MICCAI-SLiver07 [13] dataset only contains 20 CT volumes for training and 10 CT volumes for testing. All the data are contrast enhanced. Such a small dataset is not suitable to show the power of CNN: it has been well known that neural network trained with more labelled data can usually achieve much better performance. Thus, in this paper, we collected more than 1000 CT volumes. The liver of each volume was delineated by human experts. These data covers large variations in populations, contrast phases, scanning ranges, pathologies, and field of view (FOV), etc. The inter-slice distance varies from

512

D. Yang et al.

Algorithm 1. Adversarial training of generator and discriminator.

1 2 3 4 5 6 7 8 9 10 11 12 13

Input : pre-trained generator (DI2IN) with weights θ0G Output: updated generator weights θ1G for number of training iterations do for kD steps do sample a mini-batch of training images x ∼ pdata ;   generate prediction ypred for x with G x; θ0G ; θD ← propagate back the stochastic gradient lD (ygt , ypred ); end for kG steps do sample a mini-batch of training images x ∼ pdata ;    generate ypred for x with G x ; θ0G and compute D (G (x ));     ; θ1G ← propagate back the stochastic gradient lG ygt , ypred end θ0G ← θ1G end

0.5 mm to 7.0 mm. All scans covers the abdominal regions but may extend to head and feet. Tumor can be found in multiple cases. The volumes may also have various other disease. For example, pleural effusion, which brights the lung region and changes the pattern of upper boundary of liver. Then we collected additional 50 volumes from clinical sites for the independent testing. The livers of these data were also annotated by human experts for the purpose of evaluation. We down-sampled the dataset into 3.0 mm resolution isotropically to speed up the processing and lower the consumption of computing memory without loss of accuracy. Training DI2IN from scratch takes 200 iterations using stochastic gradient descent with a batch size of 4 samples. The learning rate is 0.01 in the beginning and divided by 10 after 100 iterations. In the adversarial training (DI2IN-AN), we set λ to 0.01, and the number of training iterations is 100. For training D, kD is 10 and the mini-batch size is 8. For training G, kG is 1 and the mini-batch size is 4. Less training iterations are required for G than that for D because G is pre-trained before adversarial training. wi is set as 1 in the loss. Table 1 compares the performance of five different methods. The first method, the hierarchical, learning-based algorithm proposed in [5], was trained using 400 CT volumes. More training data did not show performance improvement for this method. For comparison purpose, the DI2IN network, which is similar to deep learning based algorithms proposed in [6–8] without post-processing steps, and the DI2IN-AN were trained using the same 400 cases. Both the DI2IN network and the DI2IN-AN were also trained using all 1000+ CT volumes. The average symmetric surface distance (ASD) and dice coefficients are computed for all methods on the test data. As shown in Table 1, DI2IN-AN achieves the best performance in both evaluation metrics. All deep learning based algorithms outperform the classic learning based algorithm with the hand-craft features, which shows the power of CNN. The results show that more training data enhances the

Automatic Liver Segmentation Using an Adversarial

513

Fig. 3. Parametric setting of blocks in neural network. s stands for the stride, f is filter number. Conv. is convolution, and Up. is bilinear upscaling.

performance of both DI2IN and DI2IN-AN. Take DI2IN for example, training with 1000+ labelled data improves the mean ASD by 0.23 mm and the max ASD by 3.84 mm compared to training with 400 labelled data. Table 1 also shows that the adversarial structure can further boost the performance of DI2IN. The maximum ASD error is also reduced. Typical test samples are provided in Fig. 4. We also tried CRF and graph cut to refine the output of DI2IN. However, the results became worse, since a large portion of testing data had no contrast and the boundary of liver bottom at many locations was very fuzzy. CRF and graph cut both suffer from serious leakage in these situations. Using an NVIDIA TITAN X GPU and the Theano/Lasagne library, the run time of our algorithm is less than one second, which is significantly faster than most of the current approaches. For example, it requires 1.5 min for one case in [6]. More experimental results can be found in the supplementary material. Our proposed DI2IN has clear advantages comparing with other prevailing methods. First, previous studies show that DI2IN, which incorporates the encoder-decoder structure, skip connections, and deep supervision scheme within one framework, has better structure design than U-Net or deep supervised network (DSN) for 3D volumetric datasets [6,12]. DI2IN is a different design from other prevailing networks, but it gathers the merits of them. Second, the CNNbased methods (no upsampling or deconvolution) are often time-consuming at inference, and their performance is sensitive to the selection of training sample. We examined the aforementioned networks with internal implementation, and DI2IN achieved better performance (20% improvement in terms of average symmetric surface distance).

514

D. Yang et al. Table 1. Comparison of five methods on 50 unseen CT data. Method

ASD (mm) Mean Std Max

Dice Median Mean Std Min

Median

Ling et al. (400) [5] 2.89

5.10 37.63 2.01

0.92

0.11 0.20 0.95

DI2IN (400)

2.25

1.28 10.06 2.0

0.94

0.03 0.79 0.94

DI2IN-AN (400)

2.00

0.95 7.82

1.80

0.94

0.02 0.85 0.95

DI2IN (1000)

2.15

0.81 6.51

1.95

0.94

0.02 0.87 0.95

DI2IN-AN (1000)

1.90

0.74 6.32 1.74

0.95

0.02 0.88 0.95

Fig. 4. Visual Results from different views. Yellow meshes are ground truth. Red ones are the prediction from DI2IN-AN.

4

Conclusion

In this paper, we proposed an automatic liver segmentation algorithm based on an adversarial image-to-image network. Our method achieves good segmentation quality as well as faster processing speed. The network is trained on an annotated dataset of 1000+ 3D CT volumes. We demonstrate that training with such a large dataset can improve the performance of CNN by a large margin.

Automatic Liver Segmentation Using an Adversarial

515

References 1. Linguraru, M.G., Sandberg, J.K., Li, Z., Pura, J.A., Summers, R.M.: Atlas-based automated segmentation of spleen and liver using adaptive enhancement estimation. In: Yang, G.-Z., Hawkes, D., Rueckert, D., Noble, A., Taylor, C. (eds.) MICCAI 2009. LNCS, vol. 5762, pp. 1001–1008. Springer, Heidelberg (2009). doi:10. 1007/978-3-642-04271-3 121 2. Kainmuller, D., Lange, T., Lamecker, H.: Shape constrained automatic segmentation of the liver based on a heuristic intensity model. In: Proceedings of MICCAI Workshop 3D Segmentation in the Clinic: A Grand Challenge, pp. 109–116 (2007) 3. Lee, J., Kim, N., Lee, H., Seo, J.B., Won, H.J., Shin, Y.M., Shin, Y.G., Kim, S.H.: Efficient liver segmentation using a level-set method with optimal detection of the initial liver boundary from level-set speed images. Comput. Methods Programs Biomed. 88(1), 26–28 (2007) 4. Massoptier, L., Casciaro, S.: Fully automatic liver segmentation through graphcut technique. In: Proceedings of the 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC 2007), pp. 5243–5246 (2007) 5. Ling, H., Zhou, S.K., Zheng, Y., Georgescu, B., Suehling, M., Comaniciu, D.: Hierarchical, learning-based automatic liver segmentation. In: CVPR, pp. 1–8 (2008) 6. Dou, Q., Chen, H., Jin, Y., Yu, L., Qin, J., Heng, P.-A.: 3D deeply supervised network for automatic liver segmentation from CT volumes. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 149–157. Springer, Cham (2016). doi:10.1007/978-3-319-46723-8 18 7. Christ, P.F., et al.: Automatic liver and lesion segmentation in CT using cascaded fully convolutional neural networks and 3D conditional random fields. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 415–423. Springer, Cham (2016). doi:10.1007/978-3-319-46723-8 48 8. Lu, F., Wu, F., Hu, P., Peng, Z., Kong, D.: Automatic 3D liver location and segmentation via convolutional neural network and graph cut. Int. J. Comput. Assist. Radiol. Surg. 12(2), 171–182 (2017) 9. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: NIPS, pp. 2672–2680 (2014) 10. Luc, P., Couprie, C., Chintala, S.: Semantic Segmentation using Adversarial Networks. arXiv preprint. arXiv:1611.08408 (2016) 11. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). doi:10. 1007/978-3-319-24574-4 28 12. Merkow, J., Marsden, A., Kriegman, D., Tu, Z.: Dense volume-to-volume vascular boundary detection. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9902, pp. 371–379. Springer, Cham (2016). doi:10.1007/978-3-319-46726-9 43 13. Heimann, T., Van Ginneken, B., Styner, M.A., Arzhaeva, Y., Aurich, V., Bauer, C., Beck, A.: Comparison and evaluation of methods for liver segmentation from CT datasets. IEEE Trans. Med. Imaging 28(8), 1251–1265 (2009)

Transfer Learning for Domain Adaptation in MRI: Application in Brain Lesion Segmentation Mohsen Ghafoorian1,2,3 , Alireza Mehrtash2,4(B) , Tina Kapur2 , Nico Karssemeijer1 , Elena Marchiori3 , Mehran Pesteie4 , Charles R.G. Guttmann2 , Frank-Erik de Leeuw5 , Clare M. Tempany2 , Bram van Ginneken1 , Andriy Fedorov2 , Purang Abolmaesumi4 , Bram Platel1 , and William M. Wells III2

5

1 Diagnostic Image Analysis Group, Radboud University Medical Center, Nijmegen, The Netherlands [email protected] 2 Radiology Department, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA [email protected] 3 Institute for Computing and Information Sciences, Radboud University, Nijmegen, The Netherlands 4 University of British Columbia, Vancouver, BC, Canada Department of Neurology, Donders Institute for Brain, Cognition and Behaviour, Radboud University Medical Center, Nijmegen, The Netherlands

Abstract. Magnetic Resonance Imaging (MRI) is widely used in routine clinical diagnosis and treatment. However, variations in MRI acquisition protocols result in different appearances of normal and diseased tissue in the images. Convolutional neural networks (CNNs), which have shown to be successful in many medical image analysis tasks, are typically sensitive to the variations in imaging protocols. Therefore, in many cases, networks trained on data acquired with one MRI protocol, do not perform satisfactorily on data acquired with different protocols. This limits the use of models trained with large annotated legacy datasets on a new dataset with a different domain which is often a recurring situation in clinical settings. In this study, we aim to answer the following central questions regarding domain adaptation in medical image analysis: Given a fitted legacy model, (1) How much data from the new domain is required for a decent adaptation of the original network?; and, (2) What portion of the pre-trained model parameters should be retrained given a certain number of the new domain training samples? To address these questions, we conducted extensive experiments in white matter hyperintensity segmentation task. We trained a CNN on legacy MR images of brain and evaluated the performance of the domain-adapted network on the same task with images from a different domain. We then compared the performance of the model to the surrogate scenarios where either the same trained network is used or a new network is trained from scratch on the new dataset. The domain-adapted network tuned only by two training examples achieved a Dice score of 0.63 M. Ghafoorian and A. Mehrtash—Contributed equally to this work. c Springer International Publishing AG 2017  M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 516–524, 2017. DOI: 10.1007/978-3-319-66179-7 59

Transfer Learning for Domain Adaptation in MRI

517

substantially outperforming a similar network trained on the same set of examples from scratch.

1

Introduction

Deep neural networks have been extensively used in medical image analysis and have outperformed the conventional methods for specific tasks such as segmentation, classification and detection [1]. For instance on brain MR analysis, convolutional neural networks (CNN) have been shown to achieve outstanding performance for various tasks including white matter hyperintensities (WMH) segmentation [2], tumor segmentation [3], microbleed detection [4], and lacune detection [5]. Although many studies report excellent results on specific domains and image acquisition protocols, the generalizability of these models on test data with different distributions are often not investigated and evaluated. Therefore, to ensure the usability of the trained models in real world practice, which involves imaging data from various scanners and protocols, domain adaptation remains a valuable field of study. This becomes even more important when dealing with Magnetic Resonance Imaging (MRI), which demonstrates high variations in soft tissue appearances and contrasts among different protocols and settings. Mathematically, a domain D can be expressed by a feature space χ and a marginal probability distribution P (X), where X = {x1 , ..., xn } ∈ χ [6]. A supervised learning task on a specific domain D = {χ, P (X)}, consists of a pair of a label space Y and an objective predictive function f (.) (denoted by T = {Y, f (.)}). The objective function f (.) can be learned from the training data, which consists of pairs {xi , yi }, where xi ∈ X and yi ∈ Y . After the training process, the learned model denoted by f˜(.) is used to predict the label for a new instance x. Given a source domain DS with a learning task TS and a target domain DT with learning task TT , transfer learning is defined as the process of improving the learning of the target predictive function fT (.) in DT using the information in DS and TS , where DS = DT , or TS = TT [6]. We denote f˜ST (.) as the predictive model initially trained on the source domain DS , and domain-adapted to the target domain DT . In the medical image analysis literature, transfer classifiers such as adaptive SVM and transfer AdaBoost, are shown to outperform the common supervised learning approaches in segmenting brain MRI, trained only on a small set of target domain images [7]. In another study a machine learning based sample weighting strategy was shown to be capable of handling multi-center chronic obstructive pulmonary disease images [8]. Recently, also several studies have investigated transfer learning methodologies on deep neural networks applied to medical image analysis tasks. A number of studies used networks pre-trained on natural images to extract features and followed by another classifier, such as a Support Vector Machine (SVM) or a random forest [9]. Other studies [10,11] performed layer fine-tuning on the pre-trained networks for adapting the learned features to the target domain.

518

M. Ghafoorian et al.

Considering the hierarchical feature learning fashion in CNN, we expect the first few layers to learn features for general simple visual building blocks, such as edges, corners and simple blob-like structures, while the deeper layers learn more complicated abstract task-dependent features. In general, the ability to learn domain-dependent high-level representations is an advantage enabling CNNs to achieve great recognition capabilities. However, it is not obvious how these qualities are preserved during the transfer learning process for domain adaptation. For example, it would be practically important to determine how much data on the target domain is required for domain adaptation with sufficient accuracy for a given task, or how many layers from a model fitted on the source domain can be effectively transferred to the target domain. Or more interestingly, given a number of available samples on the target domain, what layer types and how many of those can we afford to fine-tune. Moreover, there is a common scenario in which a large set of annotated legacy data is available, often collected in a time-consuming and costly process. Upgrades in the scanners, acquisition protocols, etc., as we will show, might make the direct application of models trained on the legacy data unsuccessful. To what extent these legacy data can contribute to a better analysis of new datasets, or vice versa, is another question worth investigating. In this study, we aim towards answering the questions discussed above. We use transfer learning methodology for domain adaptation of models trained on legacy MRI data on brain WMH segmentation.

2 2.1

Materials and Method Dataset

Radboud University Nijmegen Diffusion tensor and Magnetic resonance imaging Cohort (RUN DMC) [12] is a longitudinal study of patients diagnosed with small vessel disease. The baseline scans acquired in 2006 consisted of fluid-attenuated inversion recovery (FLAIR) images with voxel size of 1.0 × 1.2 × 5.0 mm and an inter-slice gap of 1.0 mm, scanned with a 1.5 T Siemens scanner. However, the follow-up scans in 2011 were acquired differently with a voxel size of 1.0 × 1.2 × 3.0 mm, including a slice gap of 0.5 mm. The follow-up scans demonstrate a higher contrast as the partial volume effect is less of an issue due to thinner slices. For each subject, we also used 3D T1 magnetization-prepared rapid gradient-echo (MPRAGE) with voxel size of 1.0 × 1.0 × 1.0 mm which is the same among the two datasets. We should note that even though the two scanning protocols are only different on the FLAIR scans, it is generally accepted that the FLAIR is by far the most contributing modality for WMH segmentation. Reference WMH annotations on both datasets were provided semi-automatically, by manually editing segmentations provided by a WMH segmentation method [13] wherever needed. The T1 images were linearly registered to FLAIR scans, followed by brain extraction and bias-filed correction operations. We then normalized the image intensities to be within the range of [0, 1].

Transfer Learning for Domain Adaptation in MRI

519

Fig. 1. Architecture of the convolutional neural network used in our experiments. The shallowest i layers are frozen and the rest d − i layers are fine-tuned. d is the depth of the network which was 15 in our experiments. Table 1. Number of patients for the domain adaptation experiments. Source domain Set

Target domain

Train Validation Test Train Validation Test

Size 200

30

50

100

26

33

In this study, we used 280 patient acquisitions with WMH annotations from the baseline as the source domain, and 159 scans from all the patients that were rescanned in the follow-up as the target domain. Table 1 shows the data split into the training, validation and test sets. It should be noted that the same patient-level partitioning which was used on the baseline, was respected on the follow-up dataset to prevent potential label leakages. 2.2

Sampling

We sampled 32 × 32 patches to capture local neighborhoods around WMH and normal voxels from both FLAIR and T1 images. We assigned each patch with the label of the corresponding central voxel. To be more precise, we randomly selected 25% of all voxels within the WMH masks, and randomly selected the same number of negative samples from the normal appearing voxels inside the brain mask. We augmented the dataset by flipping the patches along the y axis. This procedure resulted in training and validation datasets of size ∼1.2 m and ∼150 k on the baseline, and ∼1.75 m and ∼200 k on the followup. 2.3

Network Architecture and Training

We stacked the FLAIR and T1 patches as the input channels and used a 15layer architecture consisting of 12 convolutional layers of 3 × 3 filters and 3 dense layers of 256, 128 and 2 neurons, and a final softmax layer. We avoided using pooling layers as they would result in a shift-invariance property that is not desirable in segmentation tasks, where the spatial information of the features are important to be preserved. The network architecture is illustrated in Fig. 1. To tune the weights in the network, we used the Adam update rule [14] with a mini-batch size of 128 and a binary cross-entropy loss function. We used the

520

M. Ghafoorian et al.

Rectified Linear Unit (ReLU) activation function as the non-linearity and the 2 He method [15] that randomly initializes the weights drawn from a N (0, m ) distribution, where m is the number of inputs to a neuron. Activations of all layers were batch-normalized to speed up the convergence [16]. A decaying learning rate was used with a starting value of 0.0001 for the optimization process. To avoid over-fitting, we regularized our networks with a drop-out rate of 0.3 as well as the L2 weight decay with λ2 = 0.0001. We trained our networks for a maximum of 100 epochs with an early stopping policy. For each experiment, we picked the model with the highest area under the curve on the validation set. We trained our networks with a patch-based approach. At segmentation time, however, we converted the dense layers to their equivalent convolutional counterparts to form a fully convolutional network (FCN). FCNs are much more efficient as they avoid the repetitive computations on neighboring patches by feeding the whole image into the network. We prefer the conceptual distinction between dense and convolutional layers at the training time, to keep the generality of experiments for classification problems as well (e.g., testing the benefits of fine-tuning the convolutional layers in addition to the dense layers). Patch-based training allows class-specific data augmentation to handle domains with hugely imbalanced class ratios (e.g., WMH segmentation domain). 2.4

Domain Adaptation

To build the model f˜ST (.), we transferred the learned weights from f˜S , then we froze shallowest i layers and fine-tuned the remaining d − i deeper layers with the training data from DT , where d is the depth of the trained CNN. This is illustrated in Fig. 1. We used the same optimization update-rule, loss function, and regularization techniques as described in Sect. 2.3. 2.5

Experiments

On the WMH segmentation domain, we investigated and compared three different scenarios: (1) Training a model on the source domain and directly applying it on the target domain; (2) Training networks on the target domain data from scratch; and (3) Transferring model learned on the source domain onto the target domain with fine-tuning. In order to identify the target domain dataset sizes where transfer learning is most useful, the second and third scenarios were explored with different training set sizes of 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 25, 50 and 100 cases. We extensively expanded the third scenario investigating the best freezing/tuning cut-off for each of the mentioned target domain training set sizes. We used the same network architecture and training procedure among the different experiments. The reported metric for the segmentation quality assessment is the Dice score.

Transfer Learning for Domain Adaptation in MRI

521

Fig. 2. (a) The comparison of Dice scores on the target domain with and without transfer learning. A logarithmic scale is used on the x axis. (b) Given a deep CNN with d = 15 layers, transfer learning was performed by freezing the i initial layers and fine-tuning the last d − i layers. The Dice scores on the test set are illustrated with the color-coded heatmap. On the map, the number of fine-tuned layers are shown horizontally, whereas the target domain training set size is shown vertically.

Fig. 3. Examples of the brain WMH MRI segmentations. (a) Axial T1-weighted image. (b) FLAIR image. (c-f ) FLAIR images with WMH segmented labels: (c) reference (green) WMH. (d) WMH (red) from a domain adapted model (f˜ST (.)) fine-tuned on five target training samples. (e) WMH (yellow) from model trained from scratch (f˜T (.)) on 100 target training samples. (f ) WMH (orange) from model trained from scratch (f˜T (.)) on 5 target training samples.

522

3

M. Ghafoorian et al.

Results

The model trained on the set of images from the source domain (f˜S ), achieved a Dice score of 0.76. The same model, without fine-tuning, failed on the target domain with a Dice score of 0.005. Figure 2(a) demonstrates and compares the Dice scores obtained with three domain-adapted models to a network trained from scratch on different target training set sizes. Figure 2(b) illustrates the target domain test set Dice scores as a function of target domain training set size and the number of abstract layers that were fine-tuned. Figure 3 presents and compares qualitative results of WMH segmentation of several different models of a single sample slice.

4

Discussion and Conclusions

We observed that while f˜S demonstrated a decent performance on DS , it totally failed on DT . Although the same set of learned representations is expected to be useful for both as the two tasks are similar, the failure comes to no surprise as the distribution of the responses to these features are different. Observing the comparisons presented by Fig. 2(a), it turns out that given only a small set of training examples on DT , the domain adapted model substantially outperforms the model trained from scratch with the same size of training data. For instance, given only two training images, f˜ST achieved a Dice score of 0.63 on a test set of 33 target domain test images, while f˜T resulted in a dice of 0.15. As Fig. 2(b) suggests, with only a few DT training cases available, best results can be achieved by fine-tuning only the last dense layers, otherwise enormous number of parameters compared to the training sample size would result in over-fitting. As soon as more training data becomes available, it makes more sense to finetune the shallower representations (e.g., the last convolutional layers). It is also interesting to note that tuning the first few convolutional layers is rarely useful considering their domain-independent characteristics. Even though we did not experiment with training-time fully convolutional networks such as Unet [17], arguments can be made that the same conclusions would be applicable to such architectures. Acknowledgements. Research reported in this publication was supported by NIH Grant No. P41EB015898, Natural Sciences and Engineering Research Council (NSERC) of Canada and the Canadian Institutes of Health Research (CIHR) and a VIDI innovational grant from the Netherlands Organisation for Scientific Research (NWO, grant 016.126.351).

References 1. Litjens, G., Kooi, T., Ehteshami Bejnordi, B., Setio, A.A.A., Ciompi, F., Ghafoorian, M., van der Laak, J.A.W.M., van Ginneken, B., S´ anchez, C.I.: A survey on deep learning in medical image analysis. arXiv preprint arXiv:1702.05747 (2017)

Transfer Learning for Domain Adaptation in MRI

523

2. Ghafoorian, M., Karssemeijer, N., Heskes, T., van Uden, I., Sanchez, C., Litjens, G., de Leeuw, F., van Ginneken, B., Marchiori, E., Platel, B.: Location sensitive deep convolutional neural networks for segmentation of white matter hyperintensities. arXiv preprint arXiv:1610.04834 (2016) 3. Kamnitsas, K., Ledig, C., Newcombe, V., Simpson, J.P., Kane, A.D., Menon, D.K., Rueckert, D., Glocker, B.: Efficient multi-scale 3d cnn with fully connected crf for accurate brain lesion segmentation. Med. Image Anal. 36, 61–78 (2017) 4. Dou, Q., Chen, H., Yu, L., Zhao, L., Qin, J., Wang, D., Mok, V.C.T., Shi, L., Heng, P.A.: Automatic detection of cerebral microbleeds from mr images via 3d convolutional neural networks. IEEE Trans. Med. Imaging 35(5), 1182–1195 (2016) 5. Ghafoorian, M., Karssemeijer, N., Heskes, T., Bergkamp, M., Wissink, J., Obels, J., Keizer, K., de Leeuw, F.E., van Ginneken, B., Marchiori, E., Platel, B.: Deep multiscale location-aware 3d convolutional neural networks for automated detection of lacunes of presumed vascular origin. NeuroImage Clin. 14, 391–399 (2017) 6. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2010) 7. Van Opbroek, A., Ikram, M.A., Vernooij, M.W., De Bruijne, M.: Transfer learning improves supervised image segmentation across imaging protocols. IEEE Trans. Med. Imaging 34(5), 1018–1030 (2015) 8. Cheplygina, V., Pena, I.P., Pedersen, J.H., Lynch, D.A., Sørensen, L., de Bruijne, M.: Transfer learning for multi-center classification of chronic obstructive pulmonary disease. arXiv preprint arXiv:1701.05013 (2017) 9. Esteva, A., Kuprel, B., Novoa, R.A., Ko, J., Swetter, S.M., Blau, H.M., Thrun, S.: Dermatologist-level classification of skin cancer with deep neural networks. Nature 542(7639), 115–118 (2017) 10. Tajbakhsh, N., Shin, J.Y., Gurudu, S.R., Todd Hurst, R., Kendall, C.B., Gotway, M.B., Liang, J.: Convolutional neural networks for medical image analysis: full training or fine tuning? IEEE Trans. Med. Imaging 35(5), 1299–1312 (2016) 11. Shin, H.C., Roth, H.R., Gao, M., Lu, L., Xu, Z., Nogues, I., Yao, J., Mollura, D., Summers, R.M.: Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Trans. Med. Imaging 35(5), 1285–1298 (2016) 12. van Norden, A.G., de Laat, K.F., Gons, R.A., van Uden, I.W., van Dijk, E.J., van Oudheusden, L.J., Esselink, R.A., Bloem, B.R., van Engelen, B.G., Zwarts, M.J., Tendolkar, I., Olde-Rikkert, M.G., van der Vlugt, M.J., Zwiers, M.P., Norris, D.G., de Leeuw, F.E.: Causes and consequences of cerebral small vessel disease. The RUN DMC study: a prospective cohort study. Study rationale and protocol. BMC Neurol. 11, 29 (2011) 13. Ghafoorian, M., Karssemeijer, N., van Uden, I., de Leeuw, F.E., Heskes, T., Marchiori, E., Platel, B.: Automated detection of white matter hyperintensities of all sizes in cerebral small vessel disease. Med. Phys. 43(12), 6246–6258 (2016) 14. Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 15. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing humanlevel performance on imagenet classification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034 (2015)

524

M. Ghafoorian et al.

16. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015) 17. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). doi:10. 1007/978-3-319-24574-4 28

Retinal Microaneurysm Detection Using Clinical Report Guided Multi-sieving CNN Ling Dai1 , Bin Sheng1(B) , Qiang Wu2 , Huating Li2(B) , Xuhong Hou2 , Weiping Jia2 , and Ruogu Fang3 1

3

Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China [email protected] 2 Shanghai Jiao Tong University, Sixth People’s Hospital, Shanghai, China [email protected] Department of Biomedical Engineering, University of Florida, Gainesville, USA

Abstract. Timely detection and treatment of microaneurysms (MA) is a critical step to prevent the development of vision-threatening eye diseases such as diabetic retinopathy. However, detecting MAs in fundus images is a highly challenging task due to the large variation of imaging conditions. In this paper, we focus on developing an interleaved deep mining technique to cope intelligently with the unbalanced MA detection problem. Specifically, we present a clinical report guided multi-sieving convolutional neural network (MS-CNN) which leverages a small amount of supervised information in clinical reports to identify the potential MA regions via a text-to-image mapping in the feature space. These potential MA regions are then interleaved with the fundus image information for multi-sieving deep mining in a highly unbalanced classification problem. Critically, the clinical reports are employed to bridge the semantic gap between low-level image features and high-level diagnostic information. Extensive evaluations show our framework achieves 99.7% precision and 87.8% recall, comparing favorably with the state-of-the-art algorithms. Integration of expert domain knowledge and image information demonstrates the feasibility to reduce the training difficulty of the classifiers under extremely unbalanced data distribution.

1

Introduction

Diabetic retinopathy (DR) is the leading cause of blindness globally. Among an estimated 285 million people with diabetes mellitus worldwide, nearly one-third have signs of DR [1]. Fortunately, the risk of vision loss caused by DR can be notably reduced by early detection and timely treatment [2]. Micro-aneurysm (MA), the earliest clinical sign of DR, is defined as a tiny aneurysm occurring secondary to the capillary wall and appear as small red dots in the superficial retina layers. MA counts is an important measure of progression of retinopathy in the early stage and may serve as a surrogate end point for severe change L. Dai and R. Fang—These authors contributed equally to this work. c Springer International Publishing AG 2017  M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 525–532, 2017. DOI: 10.1007/978-3-319-66179-7 60

526

L. Dai et al.

Fig. 1. Difficulty cases in MA detection. (a) A normal and obvious MA. (b) Blood vessel joints similar to MA (blue arrow). (c) Light and texture varies. (d) Hemorrhage (green arrow) may cause false positive detection (e) Blurred fundus image, which makes of MA (white arrow) detection more difficult. (f) Reflection noise (yellow arrow).

in some clinical trials [3]. However, manual segmentation and counting of MA is time-consuming, subjective, error-prone and infeasible for large-scale fundus image analysis and diagnose. Therefore automatic detection and counting of MA is a core component of any computer aided retinopathy diagnosis system. However, several factors, including the variation in image lighting, the variability of image clarity, the occurrence of another red lesion, extremely low contrast and highly variable image background texture, made the segmentation of MA difficult for an automated system. Figure 1 shows some examples of fundus images containing challenging MAs for detection. To address the above challenges, we proposed a multi-modal framework utilizing both expert knowledge from text reports and color images. Different from previous methods, our proposed framework is able to (1) integrate non-image information from experts through clinical reports; and (2) accommodate highly unbalanced classification in medical data problem.

2

Methodology

Our framework consists of two phases. In the first phase, a statistical imagetext mapping model is generated using the algorithm described in Sect. 2.1. The model maps visual features to different kinds of lesions in the retinal. We apply this method on each pixel of fundus image to estimate the lesion type of the target pixel. Then the lesion estimation map of the lesion types is the output of Phase 1, where the intensity of each pixel in the lesion estimation map can be easily decoded to types and confidence of lesions. In the second phase, we propose multi-sieving CNN (MS-CNN), as described in Sect. 2.2, to perform pixel level binary classification of the fundus images for the highly unbalanced binary dataset. The color information from the raw fundus images is coupled with the lesion type lesion estimation map from Phase 1, and fed into MS-CNN for MA detection. 2.1

Learning Image-Text Mapping Model

In this section, we introduce the image-text mapping model to extract expert knowledge from clinical reports. The input into the model is the fundus images

Retinal MA Detection Using Clinical Report Guided Multi-sieving CNN

527

Fig. 2. Image-text mapping model learning illustration. The top box illustrates the training procedure of image-text mapping model. A text-image mapping model is learned through the partitions’ lesion label distribution (middle row). At test time, the images are segmented into superpixels which are also mapped to the feature space for lesion label prediction (bottom row).

with their corresponding clinical text reports, and the output of this model is a probabilistic map of the lesion types in the fundus image. Our proposed image-text mapping model consists of five stages, as shown in Fig. 2: (1) image preprocessing and feature extraction; (2) text information extraction from clinical reports; (3) random feature space partition; (4) lesion distribution model; (5) mapping features to lesion types. First, we resize all image to the same image size and apply histogram balance to images to eliminate the large variation of the imaging conditions. To improve the computational efficiency, the fundus images are over-segmented into superpixels using Simple Linear Iterative Clustering (SLIC) [4]. Features are further extracted from superpixels using pre-trained AlexNet model provided in Caffe Model Zoo, where we use the fully connected neural network layer of AlexNet [5] which has a high representation power of image features. We also choose a rectangle centered at the center of the super pixel as the input of CNN because the shape of the super pixel is not regular. Then, we extract the lesion information from the clinical text reports written in the natural language. Based on our observation that the clinical text reports always contain lesion names appearing in the corresponding fundus images and lesions are only mentioned positively, we match the keywords of lesions into binary arrays. In Table 1, we show examples of keywords matching from clinical text reports. Next, Given the image visual features and the corresponding lesion types in the fundus images from the previous steps, we partition the visual feature space by assigning probability weight of each lesion type to each feature space partition. Because the location information of the lesions is missing from the

528

L. Dai et al.

Table 1. An example of clinical text reports and the lesion-related information extracted. The highlighted words in report content are key words of different kinds of lesions. And then the text report is transformed to a binary array indicating whether a certain kind of lesion appears in images, shown in right columns. MA, HE, SE, HH represents microaneurysm, hard exclude, soft exclude and hemorrhage respectively. Clinical reports

MA HE SE HH

There exist microaneurysm and dot hemorrhages in posterior pole, probable hard exclude at upper temporal

1

1

0

0

microaneurysm line shaped hemorrhages are seen near 1 vascular arcade, soft exclude are seen in upper temporal and temporal side of optic nerve head

0

1

1

text reports, a semantic mapping from visual features [6] is utilized to fill in this gap. To this end, the feature space is first randomly and uniformly partitioned using random fern [7] which generates an index value for each subspace. A fern is a sequence of binary decision functions F. These functions map the feature vectors of superpixels to an integer index of the partitioned space. Suppose there are L binary functions in F, as each binary function Li (i = 1, . . . , L) produces a bit in F, the resultant binary code represents values in range 1 to 2L . The mapping between the superpixels and lesion types is obtained by assigning the lesion types in the clinical reports to all superpixels in the corresponding fundus image. Inspired by “term frequency-inverse document frequency” (tf-idf ) model in natural language processing [6], we developed a model called “partition frequency-inverse lesion frequency” (pf-ilf model) to identify the best feature partitions to represent each lesion type. Here lesion types are treated as documents, and feature space partitions as terms. We use Laplacian smoothing to avoid zero partition frequency for some lesion types. Inverse lesion frequency ilf is defined as the total number of lesion types divided by the number of lesions that fall in the partition p. Then we can define the score pf-ilf of a partition p 1+f × log2 nLp , where fp,l is the number for a lesion type l as pd-ilf(p, l) = L+maxp,l k fk,l of superpixels with the lesion label l that fall in the feature space partitions p. L is the total number of lesion types, np is the number of lesions that fall into partition p. With the proposed lesion distribution model using pf-ilf score, we can identify the most representative feature space partitions p for a specific lesion type l, by ranking the pf-ilf scores of all partitions for lesion type l. The middle of Fig. 2 visualizes the random subspace partitioning of the visual feature space and the mapping between feature space partitions and lesion types. Finally, we predicts the lesion types in each superpixel using the image-text mapping model, as illustration in the middle and bottom rows of Fig. 2. For each lesion type l, we picked the top k partitions with highest pf-ilf defined as Pl (k). Suppose a superpixel s is mapped to a set of partitions Qs in the feature space. We define a final score of a superpixel s and a lesion type l as S(s, l) = |Pl (k) ∩ Qs |, indicating the size of the intersection set between Pl (k)

Retinal MA Detection Using Clinical Report Guided Multi-sieving CNN

529

and Q(s). Finally the superpixel s is labeled with the lesion type l with the highest score S(s, l). 2.2

Multi-sieving Convolutional Neural Network for MA Detection

In spite of its efficacy in large-scale image segmentation and object detection [5], CNN still faces limitations when dealing with this use case, which requires detecting the MAs in the fundus images. Firstly, it favors balanced dataset, while the ratio of positive examples (MA) over the negative examples (non-MA) can be as high as 1:1000. Then, multiple misleading visual traits such as blood vessels, can lead to erroneous classification using only visual features of the fundus image. In other words, non-image information would provide critical meta-data to guide the classification model by integrating additional cues such as expert knowledge from the clinical reports. The right image in the bottom of Fig. 2 visualizes the additional input information from the image-text mapping model. To address the unbalanced challenge, we first propose Multi-Sieving CNN (MS-CNN) to perform the classification to address the misclassification problem in the unbalanced fundus datasets. MS-CNN is a cascaded CNN model with the false positives from the previous network fed into the next network as negative examples. Suppose all positive samples are in set P , and negative samples are in set N . For the first phase, we select all samples in P and randomly select an equal number of samples in N as initial training samples (P (0) , N (0 ). Then for the nth phase, we first perform classification using network trained in previous phase on all samples in P (n−1) and N (n−1) . This will generate many false positive predictions, and collected in set F P (n) . We select all positive samples in P again, but now randomly select an equal number of negative samples from F P (n) , which is hard samples for the previous classifier.

3

Experimental Results

In this section, extensive experiments are carried out to evaluate the clinical report guided multi-sieving CNN model for MA detection. We collect a dataset from local hospital containing fundus images and clinical reports to train and test image-text mapping model. We also use the standard diabetic retinopathy (DIARETDB1 [8]) database to test our method. The dataset collected from local hospital contains 646 images. All images has a resolution of 3504 × 2336. 433 of them without obvious DR and the rest images contain different lesions. The DIARETDB1 data set contains 89 nondilated fundus images with different kinds of lesions. AlexNet [5] is used in our experiments as the basic CNN architecture. We evaluate the efficacy of our proposed framework in terms of the classification accuracy. All of our experiments were performed using the CNN with the same configuration as AlexNet [5]. We train and test the image-text mapping model using dataset collected from the local hospital. 80% of images are randomly selected as

530

L. Dai et al.

Table 2. Experimental results. The best results are highlighted using bold font. Method

Recall Precision Accuracy

Latim [9]

0.805

0.744

0.805

Fujita Lab [10]

0.715

0.703

0.713

C.Aravind [11]

0.800

0.920

0.900

Fulong Ren [12]

0.821

0.961

0.962

Eftal Sehirli [13]

0.691

0.993

-

MS-CNN without expert guidance 0.842

0.988

0.951

MS-CNN(block 1)

0.179

1.000

0.178

MS-CNN (block 2)

0.878 0.997

0.961

training set and the rest are testing set. Two annotators annotated disjoint subsets of the local dataset based on the fluorescence fundus angiography. Because text reports are not available in DIARETDB1, we use the same image-text mapping model trained using local data to perform the test on the DIARETDB1 dataset. Following the experiment setup in [8], 28 images in the dataset are preselected as training data and the rest images are used as testing data. To show that extra channel helped in the detection of MA, we trained and tested two CNN with the same architecture, but one with extra expert knowledge-guided channel, one with channel filled with zeros to avoid CNN structural change. To compare our method with existing methods, we implemented, turned and tested two state-of-the-art methods described in [9,10], we also compared with three methods published recently [11–13]. Our method achieved the highest score in recall and precision compared to all other methods, with comparable accuracy to [12]. We also observe from Table 2 that without the clinical report

(a) Image-text mapping result

(b) Segmentation result in different CNN blocks

Fig. 3. (a) Illustration of expert knowledge learning result. The top row is original fundus images with different kinds of lesions, the bottom row is corresponding output of expert knowledge model. The original output images are gray level images where different gray levels represent different lesion type. We transformed the original output images to pseudo-color images for visualization purpose. (b) Illustration of segmentation result of a first and second block in MS-CNN. White dots, red dots, and green dots represent false positive, false negative, true positive predictions respectively.

Retinal MA Detection Using Clinical Report Guided Multi-sieving CNN

531

guided information, the proposed MS-CNN method already outperform all stateof-the-art methods in terms of precision and recall. Furthermore, with the clinical report guided information, our proposed method significantly increase recall by 5.7% compared with the best of the five methods [12]. We also observe that there is a significant increase of recall from 84.2% to 87.8% when image-text mapping channel is added. This is critical to medical image analysis, as false negatives can be detrimental for the disease diagnosis. This indicates that most MAs are too vague to be distinguished from background. Our image-text mapping model is able to find the right properties of MA and thus provide necessary information to CNN. We also expected a significant increase in precision, but only a slight increase from 98.8% to 99.7% is observed. We believe that the multi-sieving scheme eliminated most of the false positive predictions and covered the effect of an extra channel. More blocks in MS-CNN have also been experimented, but the improvement in performance is negligible while the computational cost increases linearly with the number of blocks. So we keep 2 blocks for the MSCNN framework (Fig. 3). To demonstrate that multi-sieving scheme is effective, we extracted the result from the first and the second block of MS-CNN. As expected, the recall increases sharply from 17.9% to 87.7%. But we also noticed that precision decreases slightly from 100% to 99.7%. We have to trade off between precision and recall. Since the main purpose of MS-CNN is increasing recall, slightly decrease of precision is acceptable and overall performance is improved.

4

Conclusions

The paper presents a novel clinical report guided framework for automatic MA detection from fundus images. We first extract expert knowledge from clinical text reports and map visual features to semantic profiles. Integration of keywords information from text reports and fundus images help to boost the detection accuracy and resulted in a promising performance in terms of precision and recall. The proposed framework performs favorably by overcoming MA detection challenges including unbalanced dataset and imaging conditions faced by the existing approaches, via multi-sieving training strategy and multi-modality information of clinical reports and visual features. The framework proposed in this paper is a generic approach which can easily be extended for detection of multiple kinds of lesions in fundus images and other medical imaging modalities. Acknowledgments. This work is supported by, National High-tech R&D Program of China (863 Program) (2015AA015904), NSF IIS-1564892, NIH-CTSC UL1TR000457, NSFC (61572316, 61671290, 61525106), National Key R&D Program of China (2016YFC1300302), Key Program for International S&T Cooperation Project (2016YFE0129500) of China, Science and Technology Commission of Shanghai Municipality (16DZ0501100), and Interdisciplinary Program of Shanghai Jiao Tong University (14JCY10).

532

L. Dai et al.

References 1. Lee, R., Wong, T.Y., Sabanayagam, C.: Epidemiology of diabetic retinopathy, diabetic macular edema and related vision loss. Eye Vis. 2(1), 1 (2015) 2. UK Prospective Diabetes Study Group. Tight blood pressure control, risk of macrovascular, microvascular complications in type 2 diabetes: Ukpds 38. BMJ: British Medical Journal, pp. 703–713 (1998) 3. Klein, R., Meuer, S.M., Moss, S.E., Klein, B.E.K.: Retinal microaneurysm counts and 10-year progression of diabetic retinopathy. Arch. Ophthalmol. 113(11), 1386– 1391 (1995) 4. Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., Susstrunk, S.: SLIC Superpixels. EPFL Technical report 149300, p. 15, June 2010 5. Krizhevsky, A., Sulskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information and Processing Systems (NIPS), pp. 1–9 (2012) 6. Sivic, J., Zisserman, A.: Video Google: a text retrieval approach to object matching in videos. In: Proceedings Ninth IEEE International Conference on Computer Vision (ICCV), pp. 2–9 (2003) 7. Ozuysal, M., Calonder, M., Lepetit, V., Fua, P.: Fast keypoint recognition using random ferns. IEEE Trans. Pattern Anal. Mach. Intell. 32, 1–14 (2010) 8. Kauppi, T., Kalesnykiene, V., Kamarainen, J.-K., Lensu, L., Sorri, I., Raninen, A., Voutilainen, R., Uusitalo, H., K¨ alvi¨ ainen, H., Pietil¨ a, J.: The diaretdb1 diabetic retinopathy database and evaluation protocol. In: BMVC, pp. 1–10 (2007) 9. Quellec, G., Lamard, M., Josselin, P.M., Cazuguel, G., Cochener, B., Roux, C.: Optimal wavelet transform for the detection of microaneurysms in retina photographs. IEEE Trans. Med. Imaging 27(9), 1230–1241 (2008) 10. Mizutani, A., Muramatsu, C., Hatanaka, Y., Suemori, S., Hara, T., Fujita, H.: Automated microaneurysm detection method based on double ring filter in retinal fundus images, vol. 7260, pp. 72601N-1–72601N-8 (2009) 11. Aravind, C., Ponnibala, M., Vijayachitra, S.: Automatic detection of microaneurysms and classification of diabetic retinopathy images using SVM technique. In: IJCA Proceedings on International Conference on Innovations in Intelligent Instrumentation, Optimization and Electrical Sciences ICIIIOES 11, 18–22 (2013) 12. Ren, F., Cao, P., Li, W., Zhao, D., Zaiane, O.: Ensemble based adaptive oversampling method for imbalanced data learning in computer aided detection of microaneurysm. Comput. Med. Imaging Graph. 55, 54–67 (2017) 13. Sehirli, E., Turan, M.K., Dietzel, A.: Automatic detection of microaneurysms in RGB retinal fundus images. Studies 1(8) (2015)

Lesion Detection and Grading of Diabetic Retinopathy via Two-Stages Deep Convolutional Neural Networks Yehui Yang1,2(B) , Tao Li2 , Wensi Li3 , Haishan Wu4 , Wei Fan1 , and Wensheng Zhang2 1

2

Big Data Lab, Baidu Research, Beijing, China [email protected] Institute of Automation, Chinese Academy of Sciences, Beijing, China 3 The Beijing Moslem’s Hospital, Beijing, China 4 Heyi Ventures, Beijing, China

Abstract. We propose an automatic diabetic retinopathy (DR) analysis algorithm based on two-stages deep convolutional neural networks (DCNN). Compared to existing DCNN-based DR detection methods, the proposed algorithm has the following advantages: (1) Our algorithm can not only point out the lesions in fundus color images, but also give the severity grades of DR. (2) By introducing an imbalanced weighting scheme, more attentions will be payed on lesion patches for DR grading, which significantly improves the performance of DR grading under the same implementation setup. In this study, we label 12, 206 lesion patches and re-annotate the DR grades of 23, 595 fundus images from Kaggle competition dataset. Under the guidance of clinical ophthalmologists, the experimental results show that our lesion detection net achieves comparable performance with trained human observers, and the proposed imbalanced weighted scheme also be proved to significantly enhance the capability of our DCNN-based DR grading algorithm. Keywords: Diabetic retinopathy · Deep convolutional neural networks · Fundus images · Retinopathy lesions

1

Introduction

Diabetes is an universal chronic disease around some developed countries and developing countries including China and India [1–4]. The individuals with diabetic have high probability for having diabetic retinopathy (DR) which is one of the most major causes of irreversible blindness [5,6]. However, according to the report from Deepmind Health1 , 98% severe vision loss led by DR can be prevented by early detection and treatment. Therefore, the quickly and automatically detecting of DR is critical and urgent to reduce burdens of ophthalmologists, as well as providing timely morbidity analysis for the massive patients. 1

https://deepmind.com/applied/deepmind-health/research/.

c Springer International Publishing AG 2017  M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 533–540, 2017. DOI: 10.1007/978-3-319-66179-7 61

534

Y. Yang et al.

According to the International Clinical Diabetic Retinopathy Disease Severity Scale [5,9], the severity of DR can be graded into five stages: normal, mild, moderate, severe and proliferative. The first four stages can also be classified as non-proliferative DR (NPDR) or pre-proliferative DR, and NPDR may turn to the proliferative DR (PDR, the fifth stage) with high risk if without effective treatment. The early signs of DR are some lesions such as microaneurysm (MA), hemorrhages, exudate etc. Therefore, lesion detection is a less trivial step for the analysis of DR. There are plenty of literatures focus on detecting lesions in retina. Haloi et al. [10] achieve promising performance in exudates and cotton wool spots detection. Later, Haloi [4] try to find MAs in color fundus images via deep neural networks. van Grinsven et al. [7] propose a selective sampling method for fast hemorrhage detection. Additionally, Srivastava et al. [11] achieve robust results in finding MA and hemorrhages based on multiple kernel learning. However, the aforementioned algorithms do not attach the DR severity grades of the input fundus images, which is vital for the treatment of DR patients. Recently, Seoud et al. [8] propose an automatic DR grading algorithm based on random forests [19]. By leveraging deep learning techniques [13–16], Gulshan et al. [5] take efforts to classify the fundus images into normal and referable DR (defined as moderate and worse DR) with the annotation of 54 Unite States licensed ophthalmologists on over 128 thousands fundus images. Similarly, Sankar et al. [12] using DCNN to grade DR into normal, mild DR and several DR. Pratt et al. [2] predict the severity of DR according to the five-stage standard of International DR Scale standard [5,9]. Even though these DR grading algorithms seem to have achieved promising performance, they still exist the following problems: (1) The aforementioned DCNN-based DR grading methods can only give the DR grade but can not indicate the location and type of the existing lesions in the fundus images. However, the detailed information about the lesions may be more significant than a black box for clinicians in treatment. (2) The above end-to-end DCNN2 may not suitable to learn features for DR grading. Compared to the size of the input image, some tiny lesions (eg., MAs and some small hemorrhages) are such unconspicuous that they are prone to be overwhelmed by the other parts of input image via end-to-end DCNN. However, such lesions are critical for DR grading according to the standard [9]. To address the above issues, we propose two-stages DCNN for both lesion detection and DR grading. Accordingly, our method composed of two parts: local network to extract local features for lesion detection and global network to exploit image features in holistic level for DR grading. Instead of end-to-end DR grading, we construct a weighted lesion map to differentiate the contribution of different parts in image. The proposed weighted lesion map gives imbalanced attentions on different locations of the fundus image 2

End-to-end DCNN grading means that directly feed the input images into DCNN, then output the DR grades of the images.

Two-Stages CNNs for DR Analysis

535

in terms of the lesion information, i.e., the patches with more severe lesions will attract more attention to train the global grading net. Such imbalanced weighted scheme significantly improve the capability of the DR grading algorithm. Compared to the existing DCNN-based DR analysis algorithms, the proposed algorithm has the following advantages and contributions: (1) We propose a two-stages DCNN-based algorithm which can not only detect the lesions in fundus images but also grade the severity of DR. The twostages DCNNs learn more complete deep features of fundus images for DR analysis in both global and local scale. (2) We introduce imbalanced attention on input images by weighted lesion map to improve the performance of DR grading network. To the best of our knowledge, this is the first DNN-based work resorting imbalanced attention to learn underlying features in fundus images for DR grading.

2

Methods

In this section, we present the details of the proposed two-stages DCNN for lesion detection and DR grading. First, the input fundus images will be preprocessed and divided into patches, and the patches are classified by the local net into different lesion types. Then, the weighted lesion map is generated based on the input image and output of local net. Third, the global network is introduced for grading the DR of input image. (1) Local Network. To detect the lesions in the fundus image, the input images are divided into h×h patches using sliding windows with stride h−ov, where ov is the overlapped size between two adjacent patches. The local network is trained to classify the patches into 0 (normal), 1 (microaneurysm), 2 (hemorrhage), 3 (exudate), which are the main indicators to NPDR. (2) Weighed Lesion Map. Two maps are generated when all the patches in fundus image I ∈ Rd×d are classified by the local network. One is label map L ∈ Rs×s , which records the predicted labels of the patches. Wherein s = (d − h)/(h − ov), and . is the floor operator. The other map is probabilistic map P ∈ Rs×s , which retains the the biggest output probability of the softmax layer (the last layer of the local net) [17] for each patch label. As illustrated in Fig. 1(a), we construct a weighting matrix for each input image as: (1) Integrating the label map and probabilistic map as LP = (L + 1) P, where  is the element-wise product and 1 ∈ Rs×s is an all one matrix3 . (2) Each entry in LP is augmented to a h × h matrix (corresponding to the patch size). (3) Jointing the augmented matrixes into the weighting matrix MI ∈ Rd×d according to the relative locations in the input image I. The weighting matrix is constructed into the same size of input image, and the values in the intersection areas are set as the average values between adjacent expanded matrixes. 3

The motivation of the addition of the all one matrix is to avoid totally removing the information in the patches with label 0.

536

Y. Yang et al.

Fig. 1. The illustration of the construction of weighted lesion map. (a) Jointing label map and probabilistic map into weighting matrix. (b) Illustration of imbalanced attention on weighted lesion map.

The weighted lesion map of input image I is defined as I∗ = MI  I. The entries in the weighting matrix MI implicit the severity and probability of lesions in local patches. Therefore, the image patches have more severe lesion patches with higher probability will get higher weights in the weighted lesion map. As seen in Fig. 1(b), imbalanced attentions are payed on the weighted lesion map by highlighting the lesion patches. (3) Global Network. The global network is designed to grade the severity of DR according to the International Clinical Diabetic Retinopathy scale [9]. In this study, we focus on the advanced detection on NPDR, and the NPDR can be classified into four grades: 0 (normal), 1 (mild), 2 (moderate), 3 (severe). The global network is trained with the weighted lesion maps, and the output is the severity grade of the testing fundus images.

3 3.1

Experimental Evaluation Data Preparation

(1) Database: The Kaggle database contains 35, 126 training fundus images and 53, 576 testing photographs. All the images are assigned into five DR stages according to the international standard [9]. The images in the dataset come from different models and types of cameras under various illumination. According to our cooperant ophthalmologists, although the amount of images in this dataset is relatively big, there exist a large portion of biased labels. Additionally, the dataset do not indicates the locations of the lesions which are meaningful to clinicians. Therefore, we select subset from Kaggle database for re-annotation. The subset consists of 23, 595 randomly selected images in terms

Two-Stages CNNs for DR Analysis

537

of the four grades of NPDR, where 22, 795 for training and 800 for testing (each NPDR grade contains 200 testing images). The training and testing patches for lesion detection are cropped from the training and testing images respectively, which totally contains 12, 206 lesion patches and over 140 thousands randomly cropped normal patches. Licensed ophthalmologists and trained graduate students are invited or payed to annotate the lesions in the images and re-annotate DR grades of the fundus images. (2) Data Preprocessing and Augmentation. In this study, contrast improvement and circular region of interesting extraction are conducted on the color fundus images as [7]. For lesion detection, all the images are resized to 800×800, and the relative ratio between the sample height and length is kept by padding before resizing the raw images. The input sample size of global network are turned into 256 × 256 to reduce the computational complexity. Data augmentation are implemented to enlarge the training samples for deep learning, as well as to balance the samples across different classes. Inspired by [14], the augmentation methods include randomly rotation, cropping and scaling. The samples after augmentation is split into training and validating set for tuning the deep models, and the testing samples are not put into augmentation. The testing patches for local lesion detection are generated from testing images. (3) Reference Standard and Annotation. For training the local network, the patches are first annotated by two over three months trained observers, and both of them have experience in medical image processing. Then all the samples are checked by a clinical ophthalmologist. For training the global network, the labels of the fundus images are selected from Kaggle annotations by the trained observers, then the label biases are corrected by the clinical ophthalmologist. For the testing sets, firstly, all the testing lesion patches and DR grades of fundus images are annotated independently among all the trained observers and ophthalmologist, then the discrepancy patches are selected for further analyzing. Finally, the references of the samples are determined only by achieving the agreement of all annotators. 3.2

The Identification of Lesions

To evaluate the performance of local network for lesion recognition, we record the recall and precision for each class of lesion in testing fundus images on Table 1. The second line of the table present the number of different types of lesions in testing set. The left and right values in the table denote recall and precision respectively. Two baseline algorithms are take into comparison: random forests [19] and support vector machine (SVM) [18]. We use the default setting with a Python toolkit named Sciket-learn (http://scikit-learn.org/stable/) except that the number of RF trees is turned from 10 to 500. As seen in the table, the proposed local network significantly outperforms the random forests and SVM

538

Y. Yang et al.

Table 1. Recall (the left values) and precision (the right values) of lesion recognition Lesion number MA 1538

Hemorrhage 3717

Exudate 1248

Local network 0.7029/0.5678 0.8426/0.7445 0.9079/0.8380 Random forest 0.0078/0.06704

0.2754/0.1011

0.6859/0.1941

SVM

0.0108/0.0548

0.0787/0.0318

0.4153/0.0251

Normal: 0

AUC = 0.9687

MA: Hemorrhage: Exudate: 1 2 3

145,491

574

825

207

MA: 1

251

1,081

205

1

Hemorrhage: 2

327

247

3,132

11

Exudate: 3

68

2

45

1,133

High-sensitivity operation point

Sensitivity

Normal: 0

High-specificity operation point

1 - Specificity

(a)

(b)

Fig. 2. (a) Lesion confusion matrix. The value of (i, j)-th entry of the matrix denotes the number of class i patches with prediction as class j. Wherein, i, j ∈ {0, 1, 2, 3} according to the first and second axes respectively. (b) ROC curve (shown in red) of the proposed algorithm over lesion detection. The black diamonds on the curve indicate the sensitivity and specificity of our lesion detection algorithm on high-sensitivity and high-specificity operating points. The green and blue dots present the performance of two trained human observers on binary lesion detection on the same testing dataset.

under same training images, which indicate the powerful ability of DCNN in learning task-driven features. In addition, we also shown the confusion matrix for lesion recognition in Fig. 2(a). To show the importance of local net in finding retina lesions, we also train a binary classifier to distinguish the lesion patches from normal ones in the testing set. Receiver operating characteristics (ROC) curve is drawn with sensitivity and specif icity in Fig. 2(b), and the value of area under curve (AUC) is 0.9687. The black diamonds on the red curve highlight the performance of the proposed algorithm at high-specificity (sensitivity : 0.863, specif icity : 0.973) and highsensitivity points (sensitivity : 0.959, specif icity : 0.898). The green and blue dots correspond to the performance of two trained observers on binary lesion detection. As shown in the figure, the proposed algorithm can achieve superior performance than the trained observers by setting proper operating points. 3.3

Grading the DR Severity of Fundus Images

In this paper, we focus on the grading on NPDR, which can be classified into 0 to 3 stages: normal, mild, moderate and severe respectively. To prove the

Two-Stages CNNs for DR Analysis

539

importance of the proposed weighting scheme, we compare the Kappa score and Accuracy of grading networks with and without weighting (non-weighted for simplification) scheme under the same implementation setup. The results are shown in Fig. 3(a), and we also list the performance of the popular AlexNet under the condition of weighted and non-weighted scheme. As seen in Fig. 3(a), the proposed global grading net with weighting scheme achieve 0.767 Kappa score, and both the grading DCNNs achieve superior results with weighted lesion map, which prove the effectiveness of the proposed weighting scheme. Weighted AUC = 0.9590 Non-weighted AUC = 0.7986

Weighted

Sensitivity

Non-weighted

Weighted Non-weighted

Kappa

Accuracy

Our global net

Kappa

(a)

Accuracy Alexnet

1- Specificity

(b)

Fig. 3. (a) Illustration of importance of the proposed imbalanced weighting scheme on NPDR grading. (b) ROC curves of our global grading network in referable DR detection under weighted (red) and non-weighted (blue) conditions.

Since the symptom of some milder DR are too unconspicuous to be spotted, the judgements of milder DR are not easy to be unified even among licensed ophthalmologists. Therefore, similar with [5], we also train our global net to distinguish referable DR from normal images. The ROC curves on sensitivity and 1specificity are illustrated in Fig. 3(b). The performance of referable DR detection with weighted scheme is shown in red, and the AUC of the proposed algorithm is 0.9590. On the other side, the performance of the same network under nonweighted scheme is shown in blue, and the corresponding AUC is 0.7986. The results further prove the superior of the proposed imbalanced weighting scheme against our end-to-end grading net.

4

Conclusion

In this paper, we proposed two-stages DCNN to detect abnormal lesions and severity grades of DR in fundus images. The experimental results have shown the effectiveness of the proposed algorithm, and this study can provide valuable information for clinical ophthalmologists in DR examination. However, there still exist limitations need to be solved in our future work, such as collecting more high quality annotated fundus data, and paying attention to more types of lesions. Moreover, diabetic macular edema is also an import open issue needed to be addressed.

540

Y. Yang et al.

References 1. Shaw, J.E., Sicree, R.A., Zimmet, P.Z.: Global estimates of the prevalence of diabetes for 2010 and 2030. Diabetes Res. Clin. Pract. 87(1), 4–14 (2010) 2. Pratt, H., Coenen, F., Broadbent, D.M.: Convolutional neural networks for diabetic retinopathy. Procedia Comput. Sci. 90, 200–205 (2016) 3. Bhaskaranand, M., Cuadros, J., Ramachandra, C., et al.: EyeArt + EyePACS: automated retinal image analysis for diabetic retinopathy screening in a telemedicine system. In: OMIA (2015) 4. Haloi, M.: Improved microaneurysm detection using deep neural networks. arXiv preprint (2015). arXiv:1505.04424v2 5. Gulshan, V., Peng, L., Coram, M., et al.: Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. J. Am. Med. Assoc. 316(22), 2402–2410 (2016) 6. Kocur, I., Resnikoff, S.: Visual impairment and blindness in Europe and their prevention. Br. J. Ophthalmol. 86(7), 716C722 (2002) 7. van Grinsven, M.J., van Ginneken, B., Hoyng, C.B., et al.: Fast convolutional neural network training using selective data sampling: application to hemorrhage detection in color fundus images. IEEE Trans. Med. Imaging 35(5), 1273–1284 (2016) 8. Seoud, L., Chelbi, J., Cheriet, F.: Automatic grading of diabetic retinopathy on a public database. In: OMIA (2015) 9. American Academy of Ophthalmology. International Clinical Diabetic Retinopathy Disease Severity Scale (2012). http://www.icoph.org/dynamic/attachments/ resources/diabetic-retinopathy-detail.pdf 10. Haloi, M., Dandapat, S., Sinha, R.: A Gaussian scale space approach for exudates detection classification and severity prediction. arXiv preprint (2015). arXiv:1505.00737 11. Srivastava, R., Duan, L., Wong, D.W.K., et al.: Detecting retinal microaneurysms and hemorrhages with robustness to the presence of blood vessels. Comput. Methods Programs Biomed. 138, 83–91 (2017) 12. Sankar, M., Batri, K., Parvathi, R.: Earliest diabetic retinopathy classification using deep convolution neural networks. Int. J. Adv. Eng. Technol. 7, 466–470 (2016) 13. Gu, J., Wang, Z., Kuen, J., et al.: Recent advances in convolutional neural networks. arXiv preprint (2016). arXiv:1512.07108v2 14. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012) 15. Szegedy, C., Liu, W., Jia, Y., et al.: Going deeper with convolutions. In: CVPR (2015) 16. He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: CVPR (2016) 17. Bishop, C.M.: Pattern Recognition and Machine Learning. Information Science and Statistics. Springer, New York (2006) 18. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995) 19. Kam, H.T.: The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell. 20(8), 832–844 (1998)

Hashing with Residual Networks for Image Retrieval Sailesh Conjeti1(B) , Abhijit Guha Roy1 , Amin Katouzian2 , and Nassir Navab1,3 1

3

Computer Aided Medical Procedures, Technische Universit¨ at M¨ unchen, Munich, Germany [email protected] 2 IBM Almaden Research Center, Almaden, USA Computer Aided Medical Procedures, Johns Hopkins University, Baltimore, USA

Abstract. We propose a novel deeply learnt convolutional neural network architecture for supervised hashing of medical images through residual learning, coined as Deep Residual Hashing (DRH). It offers maximal separability of classes in hashing space while preserving semantic similarities in local embedding neighborhoods. We also introduce a new optimization formulation comprising of complementary loss terms and regularizations that suit hashing objectives the best by controlling over quantization errors. We conduct extensive validations on 2,599 Chest X-ray images with co-morbidities against eight state-of-the-art hashing techniques and demonstrate improved performance and computational benefits of the proposed algorithm for fast and scalable retrieval.

1

Introduction

Content-based image retrieval (CBIR) is often used for indexing and mining large image databases where similar images are retrieved given an unseen query image. The main two challenges are: (1) scalability, where similarity scores across databases require exhaustive computations and (2) semantic gap between visual content and associated annotations [1]. Alternatively, hashing based CBIR methods have been proposed where each image is indexed with a compact similarity preserving binary code that could be potentially leveraged for very fast search and retrieval. Unlike classification that substitutes expert decision, retrieval aims at fine-grained ranking of a large number of candidates within the database according to their relevance to the query. In effect, this helps create a context similar to the query, thus assisting in clinical decision-making. In general, from machine learning perspective, they can be categorized into: (1) shallow learning based hashing methods like Locality Sensitive Hashing (LSH) [2], data-driven methods e.g. Iterative Quantization (ITQ) [3], Kernel Sensitive Hashing [4], Metric Hashing Forests (MHF) [5]; (2) hashing using deep learning techniques like Restricted Boltzmann Machines in semantic hashing [6], deep hashing network for effective similarity retrieval (DHN) [10], simultaneous feature learning and hashing (SFLH) [1] etc. Furthermore, within medical image c Springer International Publishing AG 2017  M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 541–549, 2017. DOI: 10.1007/978-3-319-66179-7 62

542

S. Conjeti et al.

Fig. 1. tSNE embeddings of the hash codes generated by the proposed algorithm (DRH) against comparative methods on unseen test set. Color indicates different classes. The figure needs to be viewed in color. Abbreviations are defined later in the paper.

computing community, application-specific hashing methods have also been proposed including weighted hashing for histology CBIR [7], binary code tagging for chest X-ray images [8], and forest based hashing for neuron images [5], to name a few. Shallow-learning based hashing methods perform encoding in two stages: generating a vector of hand-crafted descriptors followed by learning the hash functions. These two independent stages may lead to sub-optimal results as image descriptors may not be tailored for hashing. Linear projection based hashing methods (like LSH, ITQ etc.) have been demonstrated to be limited in capturing nonlinear manifold structure of samples and finally kernalized methods (like KSH, MHF etc.) suffer from scalability problems. On the other hand, deploying deep-learnt hashing, we are able to perform effective end-to-end learning of binary representations directly from input images. In other words, learning similarity preserving hashing functions aims to generate binary embeddings such that the class-separability is preserved and local semantic neighborhoods are well defined, which motivated us to propose a novel architecture, coined as Deep Residual Hashing (DRH), within this work. A preview of the qualitative results can be visualized in 2D by using t-Stochastic Neighborhood Embedding (t-SNE) [9] of unseen test data post learning with Hamming distance as a similarity measure in hashing space (see Fig. 1). Starting from Fig. 1(a) which is generated by a purely unsupervised setting, we aim at further improvement towards Fig. 1(d) which is closer to an embedding where classes are not only well-separated but also semantically close in local neighborhoods. Our contributions include: (1) leveraging residual network architecture to seek multiple hierarchical features for learning binary hash codes; (2) incorporating retrieval loss inspired by neighborhood component analysis for learning discriminative hash codes [12]; (3) leveraging multiple hashing related losses and regularizations to control the quantization error and simultaneously encourage hash bits to be maximally independent to each other; and (4) clinically, to the best of our knowledge, this is the first retrieval work on medical images (specifically, chest x-ray images) to discuss co-morbidities i.e. co-occurring manifestations of multiple diseases. The proposed DRH framework differs from related prior art in computer vision by Zhu et al. [10] in terms of use of additional losses i.e. bit-balance loss and orthogonality regularization that have been introduced

Hashing with Residual Networks for Image Retrieval

543

Fig. 2. Architecture for DRH with a hash layer. For a 18 - layer network, P = 2, Q = 2, R = 2 and S = 2. For a 34 - layer network, P = 3, Q = 4, R = 6 and S = 3.

to create balanced and independent bits. Lai et al. [1] use a divide-and-encode model to generate independent hash bits and a piece-wise thresholded sigmoid function for binarization. The DRH achieves this through orthogonality regularization and imposing quantization losses to minimize loss of retrieval quality. Another important difference to both [1,10] is the use of residual connections within a deep hashing framework, which is a very important design choice and is done for the first time in hashing within this work.

2

Methodology

An ideal hashing method should preserve similarity and generate codes that are compact and easy to compute [2]. The desired similarity preserving aspect of the hashing function implies that semantically similar images are encoded with similar hash codes. Mathematically, hashing aims at learning a mapping K H : I → {−1, 1} for an input image I to K bit binary code. In hashing for image retrieval, we typically define a similarity matrix S = {sij }, where sij = 1 implies images Ii and Ij are similar and sij = 0 otherwise. 2.1

Architecture for Deep Residual Hashing

We start with a deep convolutional neural network architecture inspired in part by the seminal ResNet architecture proposed for image classification by He et al. [11]. As shown in Fig. 2, the proposed architecture consists of a convolutional layer (Conv 1) followed by a sequence of residual blocks (Conv 2–5) and terminates in a fully connected hashing (FCH) layer for hash code-generation. This design choice is mainly motivated by empirical evidence presented by He et al. that very deep residual networks are easier to optimize over their plain convolutional counterparts owing to the introduction of short cut connections that offer additional support for gradient flow. They also demonstrated that the representational power of residual networks consistently improves with depth, in contrast to the significant degradation that is often observed in plain networks [11]. For the embedding to be binary in nature, we squash the output of the residual layers to be within [−1, 1] by passing it through a hyperbolic tangent (tanh) activation

544

S. Conjeti et al.

function to get hi . The final binary hash codes (bi ) are generated by quantizing hi as: bi = sgn (hi ). 2.2

Model Learning and Optimization

In order to preserve the pairwise similarity matrix S, we use supervised retrieval loss inspired by the neighborhood component analysis [12]. Given N instances, N N ×N , and the prothe similarity matrix is defined as S = {sij }i,j=1 ∈ {0, 1} N 1 posed supervised retrieval loss is formulated as: JS = 1 − N 2 i,j=1 pij sij where pij is the probability that any two instances (i and j) can be potential neighbors. Inspired by kNN classification, we define pij as a softmax function of the hamming distance between the hash codes. As gradient based optimization of JS in the Hamming space is infeasible due to its non-differentiable nature, we surrogate hash codes with nonquantized embeddings h(·) and use Euclidean distance instead. This is derived  2 2 as: pij = e−hi −hj  / i=l e−hi −hl  . Such a differentiable surrogate for hashing has been proposed earlier in [6], however, no additional losses were imposed to control the potential quantization error and large approximation errors in distance estimation. Generation of high quality hash codes requires us to negate this quantization error and bridge the gap in distance estimation. In this paper, we jointly optimize for JS and improve hash code generation by imposing additional loss functions as follows: Quantization Loss: We use a differentiable smooth surrogate to measure quanN tization error proposed by Zhu et al. [10] as: JQ = i=1 (log cosh (|hi | − 1)). With the incorporation of the quantization loss, we hypothesize that the final binarization step would incur significantly less quantization error and the loss of retrieval quality (empirically validated in Sect. 4). Bit Balance Loss: In addition to JQ , we introduce an additional bit balance create loss JB to maximise the entropy of the learnt hash codes and effectively  1 tr HHT . This loss aims balanced hash codes. Here, JB is derived as: JB = − 2N at encouraging maximal information storage within each hash bit. Regularization: Inspired by ITQ [3], we also introduce a relaxed orthogonality regularization constraint RO on the convolutional weights (say, Wh ) connecting the output of the final residual block of the network to the hashing block. This enforces decorrelation among hash codes and ensures each of the hash bits are  2 independent. Here, RO is formulated as: RO = 12 Wh WhT − IF . In addition to  2  2  RO , we also impose weight decay regularization RW = 1 W(·)  + b(·)  2

F

2

to control the scale of learnt weights and biases. We formulate the optimization for learning the parameters of our network  (say, Θ : W(·) , b(·) ) as: argminΘ:{W(·) ,b(·) } J = JS + λq JQ + λb JB + λo RO + λw RW

(1)

Hashing with Residual Networks for Image Retrieval

545

where λq , λb , λo and λw are four parameters to balance the effect of different contributing terms. To solve this optimization problem, we employ mini-batch stochastic gradient descent (SGD) with momentum to estimate Θ. The gradient of JS with respect to the hash code of a single example (hi ) is derived as: ∂JS ∂hi

=2

 l:sli >0

pli dli −

 l=i

 q:slq >0

      plq pli dli − 2 j:sij >0 pij dij − j:sij >0 pij z=i piz diz

(2)

where dij = hi − hj . We suitably modify Eq. (2) for a multi-label setting, in contrast to classical single-label scenario presented in [6,12]. The deriva∂J tives of hashing related loss functions (JQ and JB ) are derived as: ∂hQi = B tanh (|hi | − 1) sgn (hi ) and ∂J ∂hi = −hi . The regularization function RO acts on the convolutional weights of the (W  hash layer  h ) and its derivative w.r.t Wh ∂RO T W is derived as follows: ∂W = W W − I . h h h h Having computed the gradients of the loss function with respect to Θ, we apply gradient descent-based learning rule to update Θ. For faster learning, we initialize the learning rate (LR) to the largest value that stably decreases the objective function (typically, at 10−2 or 10−3 ). Upon convergence at a particular setting of the LR, we scale it multiplicatively by 0.1 and resumed training. This is repeated until convergence or reaching the maximum number of epochs (here, 150). Such a training approach scales with O(n2 ) with respect to number of training examples.

3

Experiments

Database: We used 2,599 frontal view Chest X-rays (CXR) images from publicly available Indiana University (CXR) dataset (https://openi.nlm.nih.gov) [13]. Following the label generation strategy published in [14], we considered nine most frequently occurring unique patterns of Medical Subject Headings (MeSH) terms, listed in Fig. 4, related to cardiopulmonary diseases extracted from their radiology reports. We used non-overlapping subsets for training (80%) and testing (20%) with patient and disease-level splits. The semantic similarity matrix S is constructed using the MeSH terms i.e. a pair of images are considered similar if they share at least one MeSH term. Comparative Methods and Evaluation Metrics: In our comparative study, we used two unsupervised shallow-learning techniques: LSH [2], ITQ [3]; two supervised shallow-learning methods: KSH [4] and MHF [5], and four deep learning based methods: AlexNet - KSH (A - KSH) [15], VGGF - KSH (V - KSH) [16], SFLH [1] and DHN [10]. We also performed ablative testing and made baseline comparisons against DPH (Deep Plain Net Hashing) by removing the residual connections, and also DRH-NB, where continuous embeddings are used sans quantization, which may act as an upper bound on performance. We used the standard metrics for retrieval, as proposed by Lai et al. [1]: Mean Average Precision (MAP) and Precision - Recall Curves varying the code size (16, 32, 48 and 64 bits). For a fair comparison, all methods were trained and tested on identical

546

S. Conjeti et al.

Fig. 3. Precision-recall curves at code size of 64 bits based on Hamming ranking of retrieved results on unseen test set.

data folds. To better understand the effect of network depth, we evaluated two variants of DRH with differing depths: (·) − 18 and (·) − 34. For shallow hashing, we utilized 512-D GIST vector [17] as features. For deep hashing, the input image was resized to 224 × 224 and normalized to a dynamic range of [0–1] using the pre-processing steps discussed in [14]. For AKSH and V-KSH, the image normalizations were identical to [15,16] and we fine-tune the networks with cross-entropy loss and binarized the fully connected layers with KSH. For SLFH and DHN, loss functions were retained from the original works, however, we used our proposed 34-layer residual architecture for a fair comparison. We implemented all deep hashing methods on the open-source MatConvNet framework [18]. The hyper-parameters λq , λb , λ0 and λw were set at 0.05, 0.025, 0.01 and 0.001 after single-parameter tuning. The momentum, initial learning rate, and batch-size were set 10 0.9, 10−2 , and 128, respectively. For all deep hashing methods, the training data was augmented on-the-fly extensively through random rigid transformations, and intensity augmentation by matching histograms between images sharing similar co-morbidities. For shallow hashing methods, the two-most significant hyper-parameters were tuned with grid-search.

4

Results and Discussion

Introduction of residual connections offers short-cuts, which act as zeroresistance paths for gradient flow thus effectively mitigating vanishing of gradients as network depth increases. This is strongly substantiated by comparing the performance of DRH-34 to DRH-18 vs. the plain net variants of the same depth DPH-34 to DPH-18 (see Table 1). By increasing the layer depth, we observe significant improvement in MAP for DRH (9.3%) whereas the DPH performance is degraded by 2.2% at 64 bits. In addition, we observe that the MAP performance of majority of supervised hashing methods improves substantially as code size is increased. Comparing DRH to shallow hashing methods, it is evident from significant gap in the PR curves (see Fig. 3a), that GIST features fail to capture the semantic concepts despite introduction of supervised hashing (KSH and MHF).

Hashing with Residual Networks for Image Retrieval

547

Table 1. MAP of Hamming ranking w.r.t. varying code sizes and time for retrieval († - GPU; ‡ - CPU) (64 bits) for comparative methods. Method

MAP

Time (in ms)

16 bits 32 bits 48 bits 64 bits LSH [2]

22.77

23.93

23.99

24.85

193.7‡

ITQ [3]

25.06

25.19

25.87

26.23

194.1‡

MHF [5]

23.62

27.02

30.78

36.75

212.3‡

KSH [4]

26.46

32.49

32.01

30.42

198.5‡

A - KSH [15] 35.95

37.28

36.64

39.31

28.28†

V - KSH [16] 47.92

50.64

53.62

52.61

40.45†

SFLH [1]

62.94

63.27

70.48

73.37

13.11†

DHN [10]

62.31

50.64

62.38

70.47

13.48†

DPH - 18

48.78

52.13

54.01

66.59

DPH - 34

46.64

44.43

51.39

64.38

5.08†

DRH - 18

50.93

57.46

62.76

67.44

11.23†

DRH - 34

56.79

65.80

75.81

76.72

13.17 †

4.75†

Fig. 4. Qualitative results for DRH-34.

retrieval

However, this is mitigated with end-to-end learning in deep hashing methods as shown in Fig. 3b. Particularly, DRH outperforms state-of-the art methods (both SFLH and DHN) as well as A - KSH and V - KSH. This clearly demonstrates that simultaneous representation learning for hashing is preferred. Drawing comparisons from Table 1, we observe that at code-sizes larger than 16 bits, DRH-34 consistently outperforms SFLH and DHN (despite the network architecture for all the methods being residual in nature and of same depth). This singles out the proposed loss combinations to be better than triplet loss (SFLH) or pairwise cross entropy (DHN), for the problem at hand. Figure 4 demonstrates the first five retrieval ranking results for four randomly selected CXR images from the testing set. Case (d) is of particular interest, where we observe that the top neighbors (d 1–5) share at least one co-occurring pathology. For cases (a), (b), and (c), all top five retrieved neighbors share the same class. Computationally, the DRH-34 takes about 13.17 ms for a single query, supporting fast retrieval performance without compromising accuracy in comparison against the rest of methods. To investigate the contributions Table 2. MAP of the Hamming ranking of the proposed loss terms, we perw.r.t. varying network depths for baseline formed extensive ablative testing by variants of DRH at a code size of 64 bits. setting combinations of λ , λ and q b λo to zero individually. The weight Method λ MAP λq λb λo 18-L 34-L decay term RW is standard practice Ablative Testing DRH - ( ) ◦ ◦ ◦ 48.97 55.07 in training deep networks, hence not ◦ ◦ • 56.03 60.03 tested ablatively. The MAP and the ◦ • ◦ 52.51 55.80 • ◦ ◦ 62.69 66.23 PR curves for code size of 64 bits are ◦ • • 53.43 62.66 shown in Table 2 and Fig. 3c respec• ◦ • 66.42 71.17 • • ◦ 64.39 72.46 tively. From, Table 2, we observe that NB • • • 69.21 77.45 amongst the hashing related losses, Proposed • • • 67.44 76.72

548

S. Conjeti et al.

the quantization loss (JQ ) is of primordial importance (DRH-34 without JQ under-performs by 14% w.r.t. with JQ ). Orthogonalization (RO ) boosts the performance by 5% over the baseline sans any additional losses for DRH-34, implying that mutual independence of hash bits is improved. Interestingly, the bit balance loss (JB ) improves performance only marginally (0.7%) for DRH-34, but significantly for DRH-18 (3.5%). It must be noted that combinations with at least two losses improve consistently over individual baselines, substantiating effectiveness of proposed loss function and its complementary nature. We also observe that, DRH-18 and DRH-34 incur very small average MAP decrease of 1.8% and 0.7% against non-binarized continuous embeddings in DRH-NB-18 and DRH-NB-34 respectively, implying minimal loss due to quantization.

5

Conclusions

In this paper, we presented a novel deep hashing approach leveraging upon residual learning, coined as Deep Residual Hashing. It integrates representation learning and hash coding into an end-to-end joint optimization framework with supervised retrieval and suitable hashing related loss functions. We demonstrated promising results on a challenging chest X-ray dataset with co-occurring morbidities. In the future, we would investigate potential extensions to include additional anatomical views (like the dorsal view for CXR) and study generalization for other unseen disease manifestations.

References 1. Lai, H., Pan, Y., Liu, Y., Yan, S.: Simultaneous feature learning and hash coding with deep neural networks. In: CVPR 2015, pp. 3270–3278. IEEE (2015) 2. Slaney, M., Casey, M.: Locality-sensitive hashing for finding nearest neighbours. Sig. Proc. Mag. 5(2), 128–131 (2008) 3. Gong, Y., Lazebnik, S.: Iterative quantization: a procrustean approach to learning binary codes. In: CVPR 2011, pp. 817–824. IEEE (2011) 4. Liu, W., Wang, J., Ji, R., Jiang, Y.G., Chang, S.F.: Supervised hashing with kernels. In: CVPR 2012, pp. 2074–2081. IEEE (2012) 5. Conjeti, S., Katouzian, A., Kazi, A., Mesbah, S., Beymer, D., Syeda-Mahmood, T.F., Navab, N.: Metric hashing forests. MedIA 34, 13–29 (2016) 6. Torralba, A., Fergus, R., Weiss, Y.: Small codes and large image databases for recognition. In: CVPR 2008, pp. 1–8. IEEE (2008) 7. Zhang, X., Su, H., Yang, L., Zhang, S.: Weighted hashing with multiple cues for cell-level analysis of histopathological images. In: Ourselin, S., Alexander, D.C., Westin, C.-F., Cardoso, M.J. (eds.) IPMI 2015. LNCS, vol. 9123, pp. 303–314. Springer, Cham (2015). doi:10.1007/978-3-319-19992-4 23 8. Sze-To, A., Tizhoosh, H.R., Wong, A.K.: Binary codes for tagging x-ray images via deep de-noising autoencoders. arXiv preprint (2016). arXiv:1604.07060 9. Maaten, L.V., Hinton, G.: Visualizing data using t-SNE. JMLR 9, 2579–2605 (2008) 10. Zhu, H., Long, M., Wang, J., Cao, Y.: Deep hashing network for efficient similarity retrieval. In: AAAI 2016 (2016)

Hashing with Residual Networks for Image Retrieval

549

11. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arXiv preprint (2015). arXiv:1512.03385 12. Goldberger, J., Hinton, G.E., Roweis, S.T., Salakhutdinov, R.: Neighbourhood components analysis. In: NIPS 2004, pp. 513–520 (2004) 13. Demner-Fushman, D., Kohli, M.D., Rosenman, M.B., Shooshan, S.E., Rodriguez, L., Antani, S., Thoma, G.R., McDonald, C.J.: Preparing a collection of radiology examinations for distribution and retrieval. JAMIA 23(2), 304–310 (2016) 14. Shin, H.C., Roberts, K., Lu, L., Demner-Fushman, D., Yao, J., Summers, R.M.: Learning to read chest x-rays: recurrent neural cascade model for automated image annotation. arXiv preprint (2016). arXiv:1603.08486 15. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS 2012, pp. 1097–1105 (2012) 16. Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A.: Return of the devil in the details: delving deep into convolutional nets. arXiv preprint (2014). arXiv:1405.3531 17. Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. IJCV 42(3), 145–75 (2001) 18. Vedaldi, A., Matconvnet, L.K.: Convolutional neural networks for matlab. In: ACM International Conference on Multimedia 2015, pp. 689–692. ACM (2015)

Deep Multiple Instance Hashing for Scalable Medical Image Retrieval Sailesh Conjeti1(B) , Magdalini Paschali1 , Amin Katouzian2 , and Nassir Navab1,3 1

3

Computer Aided Medical Procedures, Technische Universit¨ at M¨ unchen, Munich, Germany [email protected] 2 IBM Almaden Research Center, Almaden, USA Computer Aided Medical Procedures, Johns Hopkins University, Baltimore, USA

Abstract. In this paper, for the first time, we introduce a multiple instance (MI) deep hashing technique for learning discriminative hash codes with weak bag-level supervision suited for large-scale retrieval. We learn such hash codes by aggregating deeply learnt hierarchical representations across bag members through an MI pool layer. For better trainability and retrieval quality, we propose a two-pronged approach that includes robust optimization and training with an auxiliary single instance hashing arm which is down-regulated gradually. We pose retrieval for tumor assessment as an MI problem because tumors often coexist with benign masses and could exhibit complementary signatures when scanned from different anatomical views. Experimental validations demonstrate improved retrieval performance over the state-of-the-art methods.

1

Introduction

In breast examinations, such as mammography, detected actionable tumors are further examined through invasive histology. Objective interpretation of these modalities is fraught with high inter-observer variability and limited reproducibility [1]. In this context, a reference based assessment, such as presenting prior cases with similar disease manifestations (termed Content Based Image Retrieval (CBIR)) could be used to circumvent discrepancies in cancer grading. With growing sizes of clinical databases, such a CBIR system ought to be both scalable and accurate. Towards this, hashing approaches for CBIR are being actively investigated for representing images as compact binary codes that can be used for fast and accurate retrieval [2–4]. Malignant carcinomas are often co-located with benign manifestations and suspect normal tissues [5]. In such cases, describing the whole image with a single label is inadequate for objective machine learning and alternatively requires expert annotations delineating the exact location of the region of interest. S. Conjeti and M. Paschali are contributed equally. c Springer International Publishing AG 2017  M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 550–558, 2017. DOI: 10.1007/978-3-319-66179-7 63

Deep Multiple Instance Hashing for Scalable Medical Image Retrieval Cancer

Bag

551

Deep Multiple Instance Hashing

CC and MLO Views

MIPool

···

···

Deep CNN

Binarization Benign

0 1 1 0 0

1 0 1 0 0

0 1 1 0 0

0 0 0 0 1

1 0 0 0 0

1 1 0 1 1

0 0 0 0 0

Normal

Fig. 1. Overview of DMIH for end-to-end generation of bag-level hash codes. Breast anatomy image is attributed to Cancer Research UK/Wikimedia Commons.

This argument extends to screening modalities like mammograms, where multiple anatomical views are acquired. In such scenarios, the status of the tumor is best represented to a CBIR system by constituting a bag of all associated images, thus veritably becoming multiple instance (MI) in nature. With this as our premise we present, for the first time, a novel deep learning based MI hashing method, termed as Deep Multiple Instance Hashing (DMIH). Seminal works on shallow learning-based hashing include Iterative Quantization (ITQ) [6], Kernel Sensitive Hashing (KSH) [2] etc. that propose a two-stage framework involving extraction of hand-crafted features followed by binarization. Yang et al. extend these methods to MI learning scenarios with two variants: Instance Level MI Hashing (IMIH) and Bag Level MI Hashing (BMIH) [7]. However, these approaches are not end-to-end and are susceptible to semantic gap between features and associated concepts. Alternatively, deep hashing methods such as simultaneous feature learning and hashing (SFLH) [8], deep hashing networks (DHN) [9] and deep residual hashing (DRH) [3] propose the learning of representations and hash codes in an end-to-end fashion, in effect bridging this semantic gap. It must be noted that all the above deep hashing works targeted single instance (SI) hashing scenarios and an extension to MI hashing was not investigated. Earlier works on MI deep learning in computer vision include work by Wu et al. [10], where the concept of an MI pooling (MIPool) layer is introduced to aggregate representations for multi-label classification. Yan et al. leveraged MI deep learning for efficient body part recognition [11]. Unlike MI classification that potentially substitutes the decision of the clinician, retrieval aims at presenting them with richer contextual information to facilitate decision-making. DMIH effectively bridges the two concepts for CBIR systems by combining the representation learning strength of deep MI learning with the potential for scalability arising from hashing. Within CBIR for breast cancer, notable prior art includes work on mammogram image retrieval by Jiang et al. [12] and largescale histology retrieval by Zhang et al. [4]. Both these works pose CBIR as an SI retrieval problem. Contrasting with [4,12], within DMIH we create a bag of images to represent a particular pathological case and generate a bag-level hash code, as shown in Fig. 1. Our contributions in this paper include: (1) introduction of a robust supervised retrieval loss for learning in presence of weak labels and potential outliers; (2) training with an auxiliary SI arm with gradual loss

552

S. Conjeti et al.

trade-off for improved trainability; and (3) incorporation of the MIPool layer to aggregate representations across variable number of instances within a bag, generating bag-level discriminative hash codes.

2

Methodology

Lets consider database B = {B1 , . . . , BNB } with NB bags. Each bag, Bi , with varying number (ni ) of instances (Ii ) is denoted as Bi = {I1 , . . . , Ini }. We aim at learning H that maps each bag to a K-d Hamming space H : B → {−1, 1}K , such that bags with similar instances and labels are mapped to similar codes. For supervised learning of H, we define a bag-level pairwise similarity matrix B S MI = {sij }N ij=1 , such that sij = 1 if the bags are similar and zero otherwise. In applications, such as this one, where retrieval ground truth is unavailable we can use classification labels as a surrogate for generating S MI . Architecture: As shown in Fig. 2, the proImage Bag Bi posed DMIH framework consists of a deep CNN terminating in a fully connected layer (FCL). Deep CNN ni are fed into the MIPool Its outputs {zij }j=1 {zij }nj=1 During layer to generate the aggregated representation Training Fully Connected ni Layer (FCL) , mean (·), etc.) zˆi that is pooled (max∀j {zij }j=1 across instances within the bag. zˆi is an embedSI Hashing MI Pool Layer ding in the space of the bags and is the input of tanh{·} a fully connected MI hashing layer. The output zˆi SI Hash Code SI ni {h } ij j=1 of this layer is squashed to [−1, 1] by passing MI Hashing Layer MI it through a tanh{·} function to generate hi , tanh{·} Robust NCA Loss which is quantized to produce bag-level hash MI Hash Code JSSI MI hi MI MI codes as bi = sgn (hi ). The deep CNN menQuantization Robust NCA tioned earlier could be a pretrained network, Loss Loss MI JQ J S such as VGGF [13], GoogleNet [14], ResNet50 (R50) [15] or an application specific network. During training of DMIH, we introduce an Fig. 2. DMIH architecture. auxiliary SI hashing (aux-SI) arm, as shown in Fig. 2. It taps off at the FCL layer and feeds directly into a fully connected SI hashing layer with tanh {·} activation to generate instance level non-quantized ni hash codes, denoted as {hSI ij }j=1 . While training DMIH using backpropagation, the MIPool layer significantly sparsifies the gradients (analogous to using very high dropout while training CNNs), thus limiting the trainability of the preceding layers. The SI hashing arm helps to mitigate this by producing auxiliary instance level gradients. i

Model Learning and Robust Optimization: To learn similarity preserving hash codes, we propose a robust version of supervised retrieval loss based on neighborhood component analysis (NCA) employed by [16]. The motivation to introduce robustness within the loss function is two-fold: (1) robustness induces immunity to potentially noisy labels due to high inter-observer variability and limited reproducibility for the applications at hand [1]; (2) it can

Deep Multiple Instance Hashing for Scalable Medical Image Retrieval

553

effectively counter ambiguous label assignment while training with the aux-SI hashing arm. Given S MI , the robust supervised retrieval loss JSMI is defined as: NB JSMI = 1 − N12 i,j=1 sij pij where pij is the probability that two bags (indexed B  K as i and j) are neighbors. Given hash codes hi = hki k=1 and hj , we define a k bit-wise residual operation rij as rij = (hki − hkj ). We estimate pij as:  e−LHuber (hi ,hj ) k pij = NB , where LHuber (hi , hj ) = ρk (rij ). −LHuber (hi ,hl ) e i=l ∀k

(1)

LHuber (hi , hj ) is the Huber norm between hash codes for bags i and j, while the robustness operation ρk is defined as:  1 k 2 k (r ) , if |rij |  ck k (2) ρk (rij ) = 2 ijk 1 2 k ck |rij | − 2 ck , if |rij | > ck In Eq. (2), the tuning factor ck is estimated inherently from the data and is set to ck = 1.345 × σk . The factor of 1.345 is chosen to provide approximately k . 95% asymptotic efficiency and σk is a robust measure of bit-wise variance of rij k Specifically, σk is estimated as 1.485 times the median absolute deviation of rij as empirically suggested in [17]. This robust formulation provides immunity to outliers during training by clipping their gradients. For training with the aux-SI hashing arm, we employ a similar robust retrieval loss JSSI defined over single instances with bag-labels assigned to member instances. To minimize loss of retrieval Mquality due to quantization, we use a differentiable quantization loss JQ = i=1 (log cosh(|hi | − 1)) proposed in [9]. This loss also counters the effect of using continuous relaxation in definition of pij over using Hamming distance. As a standard practice in deep learning, we also add an additional weight decay regularization term RW , which is the Frobenius norm of the weights and biases, to regularize the cost function and avoid over-fitting. The following composite loss is used to train DMIH: J = λtMI JSMI + λtSI JSSI + λq JQ + λw RW

(3)

λtMI/SI

where λtMI , λtSI , λq and λw are hyper-parameters that 1 control the contribution of each of the loss terms. 0.8 0.6 Specifically, λtMI and λtSI control the trade-off between MIL SIL 0.4 the MI and SI hashing losses. The SI arm plays a 0.2 significant role only in the early stages of training 0 and can be traded off eventually to avoid sub-optimal 0 50 100 150 t (Epochs) MI hashing. For this we introduce a weight trade-off formulation that gradually down-regulates λtSI , while simultaneously up-regulating λtMI . Here, we use λtSI = Fig. 3. Weight trade-off. 2 1 − 0.5 (1 − t/tmax ) and λtMI = 1 − λtSI , where t is the current epoch and tmax is the maximum number of epochs (see Fig. 3). We train DMIH with mini-batch stochastic gradient descent (SGD) with momentum. Due to potential outliers

554

S. Conjeti et al.

that can occur at the beginning of training, we scale ck up by a factor of 7 for t = 1 to allow a stable state to be reached.

3

Experiments

Databases: Clinical applicability of DMIH has been validated on two large scale datasets, namely, Digital Database for Screening Mammography (DDSM) [12,18] and a retrospectively acquired histology dataset from the Indiana University Health Pathology Lab (IUPHL) [4,19]. The DDSM dataset comprises of 11,617 expert selected regions of interest (ROI) curated from 1861 patients. Multiple ROIs associated with a single breast from two anatomical views constitute a bag (size: 1–12; median: 2), which has been annotated as normal, benign or malignant by expert radiologists. A bag labeled malignant could potentially contain multiple suspect normal and benign masses, which have not been individually identified. The IUPHL dataset is a collection of 653 ROIs from histology slides from 40 patients (20 with precancerous ductal hyperplasia (UDH) and rest with ductal carcinoma in situ (DCIS)) with ROI level annotations done by expert histopathologists. Due to high variability in sizes of these ROIs (upto 9 K × 8 K pixels), we extract multiple patches and populate a ROI-level bag (size: 1–15; median: 8). From both the datasets, we use patient-level non-overlapping splits to constitute the training (80%) and testing (20%) sets. Model Settings and Validations: To validate proposed contributions, namely robustness within NCA loss and trade-off from the aux-SI arm, we perform ablative testing with combinations of their baseline variants by fine-tuning multiple network architectures. Additionally, we compare DMIH against four state-of-the art methods: ITQ [6], KSH [2], SFLH [8] and DHN [9]. For a fair comparison, we use R50 for both SFLH and DHN, since as discussed later it performs the best. Since SFLH and DHN were originally proposed for SI hashing, we introduce additional MI variants by hashing through the MIPool layer. For ITQ and KSH, we further create two comparative settings: (1) Using IMIH [7] that learns instancelevel hash codes followed by bag-level distance computation and (2) Utilizing BMIH [7] using bag-level kernalized representations followed by binarization. For IMIH and SI variants of SFLH, DHN and DMIH, given two bags Bp and Bq with SI hash codes, say H(Bq ) = {hq1 , . . . , hqM } and H(Bp ) = {hp1 , . . . , hpN }, the bag-level distance is computed as: d(Bp , Bq ) =

M 1  (min Hamming(hpi , hqj )). M i=1 ∀j

(4)

All images were resized to 224 × 224 and training data were augmented with random rigid transformations to create equally balanced classes. λtMI and λtSI were set assuming tmax as 150 epoch; λq and λw were set at 0.05 and 0.001 respectively. The momentum term within SGD was set to 0.9 and batch size to 128 for DDSM and 32 for IUPHL. For efficient learning, we use an exponentially decaying learning rate initialized at 0.01. The DMIH framework was

Deep Multiple Instance Hashing for Scalable Medical Image Retrieval DDSM

IUPHL 1

0.9

0.9

+1

0

+1 0

0.7 0.6 0.5

0

+1

0.4

+1

+1

+1

+1

+1

+1

Fig. 4. Retrieval results for DMIH at code size 16 bits.

Precision

+1

Precision

0.8

+1

+1

555

DMIH DHN SFLH KSH-R50 ITQ-R50 KSH-G ITQ-G 0 0.2 0.4 0.6 0.8 Recall

0.8 0.7 0.6 0.5

1

DMIH DHN SFLH KSH-R50 ITQ-R50 KSH-G ITQ-G 0 0.2 0.4 0.6 0.8 Recall

1

Fig. 5. PR curves for DDSM and IUPHL datasets at code size of 32.

Table 1. Performance of ablative testing at code size of 16 bits. We report the nearest neighbor classification accuracy (nnCA) estimated over unseen test data. Letters A-E are introduced for easier comparisons, discussed in Sect. 4. Method

DDSM

Variants R

T

VGGF

R50

IUPHL GN

VGGF

R50

GN

A ◦ ◦ 68.65 72.76 71.70 83.85 85.42 82.29 B ◦ • 75.38 77.34 72.92 85.94 90.10 88.02 Ablative C • ◦ 70.65 76.63 70.02 83.33 85.94 86.46 Testing D ◦  66.65 69.67 68.26 83.33 88.54 84.90 E •  67.05 76.59 72.84 84.38 89.58 85.42 DMIH-mean • • 78.67 82.31 76.83 87.50 89.58 89.06 DMIH-max • • 81.21 85.68 78.67 91.67 95.83 88.02 DMIH(λq = 0) • • 75.34 79.88 73.06 87.50 89.58 88.51 DMIH NB • • 83.25 88.02 79.06 94.79 96.35 92.71 R(Robustness) ◦ = L2 , • = LHuber ◦ = Equal weights, • = Decaying SIL weights, Legend T(Trade-off )  = No SIL branch Networks R50: ResNet50, GN: GoogleNet

implemented in MatConvNet [20]. We use standard retrieval quality metrics: nearest neighbor classification accuracy (nnCA) and precision-recall (PR) curves to perform the aforementioned comparisons. The results (nnCA) from ablative testing and comparative methods are tabulated in Tables 1 and 2 respectively. Within Table 2, methods were evaluated at two different code sizes (16 bits and 32 bits). We also present the PR curves of select bag-level methods (32 bits) in Fig. 5.

4

Results and Discussion

Effect of aux-SI Loss: To justify using the aux-SI loss, we introduce a variant of DMIH without it (E in Table 1), which leads to a significant decline of 3% to 14% in contrast to DMIH. This could be potentially attributed to the prevention of the gradient sparsification caused by the MIPool layer. From Table 1, we observe a 3%–10% increase in performance, comparing cases with gradual decaying tradeoff (B) against baseline setting (λtMI = λtSI = 0.5, A, C). Effect of Robustness: For robust-NCA, we compared against the original NCA formulation proposed in [16] (A, B, D in Table 1). Robustness helps handle potentially noisy MI labels, inconsistencies within a bag and the ambiguity in

556

S. Conjeti et al.

assigning SI labels. Comparing the effect of robustness for baselines sans the SI hashing arm (D vs. E) we observe marginally positive improvement across the architectures and datasets, with a substantial 7% in ResNet50 for DDSM. Robustness contributes more with the addition of the aux-SI hash arm (proposed vs. E) with improved performance in the range of 4%–5% across all settings. This observation further validates our prior argument. Effect of Quantization: To assess the effect of quantization, we define two baselines: (1) setting λq = 0 and (2) using non-quantized hash codes for retrieval (DMIH - NB). The latter potentially acts as an upper bound for performance evaluation. From Table 1, we observe a consistent increase in performance by margins of 3%–5% if DMIH is learnt with an explicit quantization loss to limit the associated error. It must also be noted that comparing with DMIH - NB, there is only a marginal fall in performance (2%–4%), which is desired. As a whole, the two-pronged proposed approach, including robustness and trade-off, along with quantization loss delivers the highest performance, proving that DMIH is able to learn effectively, despite the ambiguity induced by the SI hashing arm. Figure 4 demonstrates the retrieval performance of DMIH on the target databases. For IUPHL, the retrieved images are semantically similar to the query as consistent anatomical signatures are evident in the retrieved neighbors. For DDSM, in the cancer and normal cases the retrieved neighbors are consistent, however it is hard to distinguish between benign and malignant. The retrieval time for a single query for DMIH was observed at 31.62 ms (for IUPHL) and 17.48 ms (for DDSM ) showing potential for fast and scalable search. Table 2. Results of comparison with state-ofthe art hashing methods. Method

A/F

L DDSM

IUPHL

16-bit 32-bit 16-bit 32-bit Shallow ITQ [6]

KSH [2]

Deep

SFLH [8]

R50

◦ 66.35 67.71 78.58 80.28

R50

• 64.56 71.98 89.58 79.69

G

◦ 65.22 66.55 51.79 51.42

G

• 59.73 61.03 57.29 58.85

R50

◦ 61.88 64.81 87.74 86.51

R50

• 59.81 72.17 70.83 80.21

G

◦ 60.50 61.91 57.36 57.83

G

• 55.34 55.67 60.94 58.85

R50

◦ 73.54 77.46 83.33 85.94

R50M  71.98 75.93 85.42 88.54 DHN [9]

R50

◦ 65.64 74.79 82.29 86.46

R50M  72.88 80.43 88.02 90.62 DMIH-SIL R50 DMIH Legend A/F:

L:

◦ 76.02 78.37 87.92 88.58

R50M  85.68 89.47 95.83 93.23 A: Architecture, F: Features R50: ResNet50, R50M: ResNet50+MIPool, G: GIST ◦ = IMIH, • = BMIH,  = End-to-end

Comparative Methods In the contrastive experiments against ITQ and KSH, handcrafted GIST [21] features underperformed significantly, while the improvement with the R50 features ranged from 5%–30%. However, DMIH still performed 10%– 25% better. Comparing the SI with the MI variations of DHN, SFLH and DMIH, it is observed that the performance improved in the range of 3%–11%, suggesting that endto-end learning of MI hash codes is preferred over two-stage hashing i.e. hashing at SI level and comparing at bag level with Eq. (4). However, DMIH fares comparably better than both the SI and MI versions of SFLH and DHN, owing to the robustness of the proposed

Deep Multiple Instance Hashing for Scalable Medical Image Retrieval

557

retrieval loss function. As also seen from the associated PR curves in Fig. 5, the performance gap between shallow and deep hashing methods remains significant despite using R50 features. Comparative results strongly support our premise that end-to-end learning of MI hash codes is preferred over conventional two-stage approaches.

5

Conclusion

In this paper, for the first time, we propose an end-to-end deep robust hashing framework, termed DMIH, for retrieval under a multiple instance setting. We incorporate the MIPool layer to aggregate representations across instances to generate a bag-level discriminative hash code. We introduce the notion of robustness into our supervised retrieval loss and improve the trainability of DMIH by utilizing an aux-SI hashing arm regulated by a trade-off. Extensive validations and ablative testing on two public breast cancer datasets demonstrate the superiority of DMIH and its potential for future extension to other MI applications. Acknowledgements. The authors would like to warmly thank Dr. Shaoting Zhang for generously sharing the datasets used in this paper. We would also like to thank Abhijit Guha Roy and Andrei Costinescu for their insightful comments about this work.

References 1. Duijm, L.E.M., Louwman, M.W.J., Groenewoud, J.H., van de Poll-Franse, L.V., Fracheboud, J., Coebergh, J.W.: Inter-observer variability in mammography screening and effect of type and number of readers on screening outcome. BJC 24, 901–907 (2009) 2. Liu, W., Wang, J., Ji, R., Jiang, Y.G., Chang, S.F.: Supervised hashing with kernels. In: CVPR 2012, pp. 2074–2081. IEEE (2012) 3. Conjeti, S., Guha Roy, A., Katouzian, A., Navab, N.: Hashing with residual networks for image retrieval. In: 20th International Conference on Medical Image Computing and Computer Assisted Intervention, Canada (2017) 4. Zhang, X., Liu, W., Dundar, M., Badve, S., Zhang, S.: Towards large-scale histopathological image analysis: hashing-based image retrieval. In: TMI 2015. IEEE (2015) 5. Veta, M., Pluim, J.P., van Diest, P.J., Viergever, M.A.: Breast cancer histopathology image analysis: a review. Trans. Biomed. Eng. 61, 1400–1411 (2014). IEEE 6. Gong, Y., Lazebnik, S.: Iterative quantization: a procrustean approach to learning binary codes. In: CVPR 2011, pp. 817–824. IEEE (2011) 7. Yang, Y., Xu, X.-S., Wang, X., Guo, S., Cui, L.: Hashing multi-instance data from bag and instance level. In: Cheng, R., Cui, B., Zhang, Z., Cai, R., Xu, J. (eds.) APWeb 2015. LNCS, vol. 9313, pp. 437–448. Springer, Cham (2015). doi:10.1007/ 978-3-319-25255-1 36 8. Lai, H., Pan, Y., Liu, Y., Yan, S.: Simultaneous feature learning and hash coding with deep neural networks. In: CVPR 2015, pp. 3270–3278 (2015)

558

S. Conjeti et al.

9. Zhu, H., Long, M., Wang, J., Cao, Y.: Deep hashing network for efficient similarity retrieval. In: AAAI 2016 (2016) 10. Wu, J., Yu, Y., Huang, C., Yu, K.: Multiple instance learning for image classification and auto-annotation. In: CVPR 2015, pp. 3460–3469 (2015) 11. Zhennan, Y., Yiqiang, Z., Zhigang, P., Shu, L., Shinagawa, Y., Shaoting, Z., Metaxas, D.N., Xiang, S.Z.: Multi-instance deep learning: discover discriminative local anatomies for bodypart recognition. Trans. Med. Imaging 35, 1332–1343 (2016). IEEE 12. Jiang, M., Zhang, S., Li, H., Metaxas, D.N.: Computer-aided diagnosis of mammographic masses using scalable image retrieval. TBME 62, 783–792 (2015) 13. Chatfeld, K., Simonyan, K., Vedaldi, A., Zisserman, A.: Return of the devil in the details: delving deep into convolutional nets (2014). arXiv:1405.3531 14. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.E., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: CVPR 2015, pp. 1–9 (2015) 15. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR 2016, pp. 770–778. IEEE Computer Society (2016) 16. Torralba, A., Fergus, R., Weiss, Y.: Small codes and large image databases for recognition. In: CVPR 2008, pp. 1–8. IEEE (2008) 17. Huber, P.J.: Robust statistics. In: Lovic, M. (ed.) International Encyclopedia of Statistical Science, pp. 1248–1251. Springer, Heidelberg (2011) 18. Heath, M., Bowyer, K., Kopans, D., Kegelmeyer Jr., W.P., Moore, R., Chang, K., Munishkumaran, S.: Current status of the digital database for screening mammography. In: Karssemeijer, N., Thijssen, M., Hendriks, J., van Erning, L. (eds.) Digital Mammography, pp. 457–460. Springer, Dordrecht (1998). doi:10.1007/ 978-94-011-5318-8 75 19. Badve, S., Bilgin, G., Dundar, M., Grcan, M.N., Jain, R.K., Raykar, V.C., Sertel, O.: Computerized classification of intraductal breast lesions using histopathological images. Biomed. Eng. 58, 1977–1984 (2011). IEEE 20. Vedaldi, A., Matconvnet, L.K.: Convolutional neural networks for matlab. In: ACM International Conference on Multimedia 2015, pp. 689–692. ACM (2015) 21. Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. IJCV 42, 145–175 (2001)

Accurate Pulmonary Nodule Detection in Computed Tomography Images Using Deep Convolutional Neural Networks Jia Ding, Aoxue Li, Zhiqiang Hu, and Liwei Wang(B) The Key Laboratory of Machine Perception (MOE), School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China {dingjia,lax,huzq}@pku.edu.cn, [email protected] Abstract. Early detection of pulmonary cancer is the most promising way to enhance a patient’s chance for survival. Accurate pulmonary nodule detection in computed tomography (CT) images is a crucial step in diagnosing pulmonary cancer. In this paper, inspired by the successful use of deep convolutional neural networks (DCNNs) in natural image recognition, we propose a novel pulmonary nodule detection approach based on DCNNs. We first introduce a deconvolutional structure to Faster Region-based Convolutional Neural Network (Faster R-CNN) for candidate detection on axial slices. Then, a three-dimensional DCNN is presented for the subsequent false positive reduction. Experimental results of the LUng Nodule Analysis 2016 (LUNA16) Challenge demonstrate the superior detection performance of the proposed approach on nodule detection (average FROC-score of 0.893, ranking the 1st place over all submitted results), which outperforms the best result on the leaderboard of the LUNA16 Challenge (average FROC-score of 0.864).

1

Introduction

Pulmonary cancer, causing 1.3 million deaths annually, is a leading cause of cancer death worldwide [8]. Detection and treatment at an early stage are required to effectively overcome this burden. Computed tomography (CT) was recently adopted as a mass-screening tool for pulmonary cancer diagnosis, enabling rapid improvement in the ability to detect tumors early. Due to the development of CT scanning technologies and rapidly increasing demand, radiologists are overwhelmed with the amount of data they are required to analyze. Computer-Aided Detection (CAD) systems have been developed to assist radiologists in the reading process and thereby potentially making pulmonary cancer screening more effective. The architecture of a CAD system for pulmonary nodule detection typically consists of two stages: nodule candidate detection and false positive reduction. Many CAD systems have been proposed for nodule detection [7,10]. Torres et al. detect candidates with a dedicated dotenhancement filter and then a feed-forward neural network based on a small set of hand-craft features is used to reduce false positives [10]. Although conventional CAD systems have yielded promising results, they still have two distinct drawbacks as follows. c Springer International Publishing AG 2017  M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 559–567, 2017. DOI: 10.1007/978-3-319-66179-7 64

560

J. Ding et al.

Fig. 1. The framework of the proposed CAD system.

– Traditional CAD systems detect candidates based on some simple assumptions (eg. nodules look like a sphere) and propose some low-level descriptors [10]. Due to the high variability of nodule shape, size, and texture, low-level descriptors fail to capture discriminative features, resulting in inferior detection results. – Since CT images are 3D inherently, 3D contexts play an important role in recognizing nodules. However, several 2D/2.5D deep neural networks have achieved promising performance in false positive reduction [6,11] while rare works focus on introducing 3D contexts for nodule detection directly. In this paper, to address the aforementioned two issues, we propose a novel CAD system based on DCNNs for accurate pulmonary nodule detection. In the proposed CAD system, we first introduce a deconvolutional structure to Faster Region-based Convolutional Neural Network (Faster R-CNN), the state-of-theart general object detection model, for candidate detection on axial slices. Then, a three-dimensional DCNN (3D DCNN) is presented for false positive reduction. The framework of our CAD system is illustrated in Fig. 1. To evaluate the effectiveness of our CAD system, we test it on the LUng Nodule Analysis 2016 (LUNA16) challenge [7], and yield the 1st place of Nodule Detection Track (NDET) with an average FROC-score of 0.893, which outperforms the best

Accurate Pulmonary Nodule Detection in CT Using DCNNs

561

result on the leaderboard (average FROC-score of 0.864). Our system achieves high detection sensitivities of 92.2% and 94.4% at 1 and 4 false positives per scan, respectively.

2

The Proposed CAD System

In this section, we propose a CAD system based on DCNNs for accurate pulmonary nodule detection, where two main stages are incorporated: (1) candidate detection by introducing the deconvolutional structure into Faster R-CNN and (2) false positive reduction by using a three-dimensional DCNN. 2.1

Candidate Detection Using Improved Faster R-CNN

Candidate detection, as a crucial step in the CAD systems, aims to restrict the total number of nodule candidates while remaining high sensitivity. Inspired by the successful use of DCNNs in object recognition [5], we propose a DCNN model for detecting nodule candidates from CT images, where the deconvolutional structure is introduced into the state-of-the-art general object detection model, Faster R-CNN, for fitting the size of nodules. The details of our candidate detection model are given as follows. We first describe the details of generating inputs for our candidate detection network. Since using 3D volume of original CT scan as the DCNN input gives rise to high computation cost, we use axial slices as inputs instead. For each axial slice in CT images, we concatenate its two neighboring slices in axial direction, and then rescale it into 600 × 600 × 3 pixels (as shown in Fig. 1). In the following, we describe the details of the architecture of the proposed candidate detection network. The network is composed of two modules: a region proposal network (RPN) aims to propose potential regions of nodules (also called Region-of-Interest (ROI)); a ROI classifier then recognizes whether ROIs are nodules or not. In order to save computation cost of training DCNNs, these two DCNNs share the same feature extraction layers. Region Proposal Network. The region proposal network takes an image as input and outputs a set of rectangular object proposals (i.e. ROIs), each with an objectness score [5]. The structure of the network is given as follows. Owing to the much smaller size of pulmonary nodules compared with common objects in natural images, original Faster R-CNN, which utilizes five-group convolutional layers of VGG-16Net [9] for feature extraction, cannot explicitly depict the features of nodules and results in a limited performance in detecting ROIs of nodules. To address this problem, we add a deconvolutional layer, whose kernel size, stride size, padding size and kernel number are 4, 4, 2 and 512 respectively, after the last layer of the original feature extractor. Note that, the added deconvolutional layer recovers more fine-grained features compared with original feature maps, our model thus yields much better detection results than the original Faster R-CNN. To generate ROIs, we slide a small network over the feature map of the deconvolutional layer. This small network takes a 3 × 3

562

J. Ding et al.

spatial window of deconvolutional feature map as input and map each sliding window to a 512-dimensional feature. The feature is finally fed into two sibling fully-connected layers for regressing the boundingbox of regions (i.e. Reg Layer in Fig. 1) and predicting objectness score (i.e. Cls Layer in Fig. 1), respectively. At each sliding-window location, we simultaneously predict multiple ROIs. The multiple ROIs are parameterized relative to the corresponding reference boxes, which we call anchors. To fit the size of nodules, we design six anchors with different size for each sliding window: 4 × 4, 6 × 6, 10 × 10, 16 × 16, 22 × 22, and 32 × 32 (See Fig. 2). The detailed description of RPN is given in [5].

Fig. 2. Illustration of anchors in the improved Faster R-CNN

ROI Classification Using Deep Convolutional Neural Network. With the ROIs extracted by RPN, a DCNN is developed to decide whether each ROI is nodule or not. A ROI Pooling layer is firstly exploited to map each ROI to a small feature map with a fixed spatial extent W × H (7 × 7 in this paper). The ROI pooling works by dividing the ROI into an W × H grid of sub-windows and then max-pooling the values in each sub-window into the corresponding output grid cell. Pooling is applied independently to each feature map channel as in standard max pooling. After ROI pooling layer, a fully-connected network, which is composed of two 4096-way fully-connected layers, then map the fixedsize feature map into a feature vector. A regressor and a classifier based on the feature vector (i.e. BBox Reg and BBox Cls in Fig. 1) then respectively regress boundingboxes of candidates and predict candidate confidence scores. In the training process, by merging the RPN and ROI classifier into one network, we define the loss function for an image as follows. Lt =

1  1  1  1  Lc (ˆ pi , p∗i )+ Lr (tˆi , t∗i )+ Lc (˜ pj , p∗j )+ Lr (t˜j , t∗j ) Nc i Nr i N c j Nr  j

(1) where Nc , Nr , Nc and Nr denote the total number of inputs in Cls Layer, Reg Layer, BBox Cls and BBox Reg, respectively. The pˆi and p∗i respectively denote the predicted and true probability of anchor i being a nodule. tˆi is a vector representing the 4 parameterized coordinates of the predicted bounding box of

Accurate Pulmonary Nodule Detection in CT Using DCNNs

563

RPN, and t∗i is that of the ground-truth box associated with a positive anchor. In the same way, p˜j , p∗j , tˆj and t∗j denote the corresponding concepts in the ROI classifier. The detailed definitions of classification loss Lc and regression loss Lr are the same as the corresponding definitions in the literature [5]. 2.2

False Positive Reduction Using 3D DCNN

In the consideration of time and space cost, we propose a two-dimensional (2D) DCNN (i.e. Improved Faster R-CNN) to detect nodule candidates (See Sect. 2.1). With the extracted nodule candidates, a 3D DCNN, which captures the full range of contexts of candidates and generates more discriminative features compared with 2D DCNNs, is utilized for false positive reduction. This network contains six 3D convolutional layers which are followed by Rectified Linear Unit (ReLU) activation layers, three 3D max-pooling layers, three fully connected layers, and a final 2-way softmax activation layer to classify the candidates from nodules to none-nodules. Moreover, dropout layers are added after max-pooling layers and fully-connected layers to avoid overfitting. We initialize the parameters of the proposed 3D DCNN by the same strategy using in the literature [3]. The detailed architecture of the proposed 3D DCNN is illustrated in Fig. 3.

Fig. 3. The architecture of the proposed three-dimensional deep convolutional neural network. In this figure, ‘Conv’, ‘Pool’, and ‘FC’ denote the convolutional layer, pooling layer and fully-connected layer, respectively.

As for inputs of the proposed 3D DCNN, we firstly normalize each CT scan with a mean of −600 and a standard deviation of −300. After that, for each candidate, we use the center of candidate as the centroid and then crop a 40 × 40 × 24 patch. The strategy for data augmentation is given as follows. – Crop. For each 40 × 40 × 24 patch, we crop smaller patches in the size of 36 × 36 × 20 from it, thus augmenting 125 times for each candidate. – Flip. For each cropped 36 × 36 × 20 patch, we flip it from three orthogonal dimensions (coronal, sagittal and axial position), thus finally augmenting 8 × 125 = 1000 times for each candidate. – Duplicate. In training process, whether a candidate is positive or negative is decided by whether the geometric center of the candidate locates in a nodule or not. To balance the number of positive and negative patches in the training set, we duplicate positive patches by 8 times.

564

J. Ding et al.

Note that 3D context of candidates plays an important role in recognizing nodules due to the inherently 3D structure of CT images. Our 3D convolutional filters, which integrate 3D local units of previous feature maps, can ‘see’ 3D context of candidates, whereas traditional 2D convolutional filters only capture 2D local features of candidates. Hence, the proposed 3D DCNN outperforms traditional 2D DCNNs.

3

Experimental Results and Discussions

In this section, we evaluate the performance of our CAD system on the LUNA16 Challenge [7]. Its dataset was collected from the largest publicly available reference database for pulmonary nodules: the LIDC-IDRI [2], which contains a total of 1018 CT scans. For the sake of pulmonary nodules detection, CT scans with slice thickness greater than 3 mm, inconsistent slice spacing or missing slices were excluded, leading to the final list of 888 scans. The goal of this challenge is to automatically detect nodules in these volumetric CT images. In the LUNA16 challenge, performance of CAD systems are evaluated using the Free-Response Receiver Operating Characteristic (FROC) analysis [7]. The sensitivity is defined as the fraction of detected true positives divided by the number of nodules. In the FROC curve, sensitivity is plotted as a function of the average number of false positives per scan (FPs/scan). The average FROCscore is defined as the average of the sensitivity at seven false positive rates: 1/8, 1/4, 1/2, 1, 2, 4, and 8 FPs per scan. 3.1

Candidate Detection Results

The candidate detection results of our CAD system, together with other candidate detection methods submitted to LUNA16 [7], are shown in Table 1. From this table, we can observe that our CAD system has achieved the highest sensitivity (94.6%) with the fewest candidates per scan (15.0) among these CAD systems, which verifies the superiority of the improved Faster R-CNN in the Table 1. The comparison of CAD systems in the task of candidate detection. System

Sensitivity Candidates/scan

ISICAD

0.856

335.9

SubsolidCAD

0.361

290.6

LargeCAD

0.318

47.6

M5L

0.768

22.2

ETROCAD

0.929

333.0

Baseline (w/o deconv) 0.817

22.7

Baseline (4 anchors)

0.895

25.8

Ours

0.946

15.0

Accurate Pulmonary Nodule Detection in CT Using DCNNs

565

task of candidate detection. We also make comparison to two baseline methods of the improved Faster R-CNN (See Table 1). ‘Baseline (w/o deconv)’ is a baseline method where the deconvolutional layer is omitted in feature extraction and Baseline(4 anchors)’ is a baseline method where only four anchors (i.e. 4 × 4, 10 × 10, 16 × 16, and 32 × 32) are used in improved Faster R-CNN. According to Table 1, the comparison between ‘Ours’ vs. ‘Baseline (w/o deconv)’ verifies the effectiveness of the deconvolutional layer in the improved Faster R-CNN, while the comparison between ‘Ours’ vs. ‘Baseline (4 anchors)’ indicates that the proposed 6 anchors are more suitable for candidate detection. 3.2

False Positive Reduction Results

To evaluate the performance of our 3D DCNN in the task of false positive reduction, we conduct a baseline method using the NIN [4], a state-of-the-art 2D DCNN model for general image recognition. To fit the NIN into the task of false positive reduction, we modify its input size from 32 × 32 × 3 into 36 × 36 × 7 and the number of final softmax outputs is changed into 2. For fair comparison, we use the same candidates and data augmentation strategy with the proposed 3D DCNN. The comparison of the two DCNNs in the task of false positive reduction are provided in Fig. 4. Experimental results demonstrate that our 3D DCNN significantly outperforms 2D NIN, which verifies the superiority of the proposed 3D DCNN to 2D DCNN in false positive reduction. We further present the comparison among top results on the leaderboard of LUNA16 Challenge1 , which is shown in Fig. 4. From this figure, we can observe that our model has attained the best performance among the CAD systems submitted in the task of nodule detection. Although Ethan20161221 and resnet yield comparable performance when the number of FPs/scan is more than 2, however, they perform a significant drop with less than 2 FPs/scan, which limits their practicability in

Fig. 4. Comparison of performance among our CAD system and other submitted approaches on the LUNA16 Challenge. (a) Average FROC-scores (b) FROC curves 1

Until the submission of this paper, https://luna16.grand-challenge.org/results/.

566

J. Ding et al.

nodule detection. Moreover, Aidence trained its model using the labeled data on the NLST dataset [1], therefore, its result is actually incomparable to ours. Noted that, since most CAD systems used in clinical diagnosis have their internal threshold set to operate somewhere between 1 to 4 false positives per scan on average, our system satisfies clinical usage perfectly.

4

Conclusion

In this study, we propose a novel pulmonary nodule detection CAD system based on deep convolution networks, where a deconvolutional improved Faster R-CNN is developed to detect nodule candidates from axial slices and a 3D DCNN is then exploited for false positive reduction. Experimental results on the LUNA16 Nodule Detection Challenge demonstrate that the proposed CAD system ranks the 1st place of Nodule Detection Track (NDET) with an average FROC-score of 0.893. We believe that our CAD system would be a very powerful tool for clinical diagnosis of pulmonary cancer. Acknowledgements. This work was partially supported by National Basic Research Program of China (973 Program) (grant no. 2015CB352502), NSFC (61573026) and the MOE-Microsoft Key Laboratory of Statistics and Machine Learning, Peking University. We would like to thank the anonymous reviewers for their valuable comments on our paper.

References 1. Alberts, D.S.: The national lung screening trial research team reduced lung-cancer mortality with low-dose computed tomographic screening. New Engl. J. Med. 365(5), 395–409 (2011) 2. Armato, S., McLennan, G., Bidaut, L., et al.: The lung image database consortium (LIDC) and image database resource initiative (IDRI): a completed reference database of lung nodules on CT scans. Med. Phys. 38(2), 915–931 (2011) 3. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing humanlevel performance on imagenet classification. In: IEEE International Conference on Computer Vision, pp. 1026–1034 (2015) 4. Lin, M., Chen, Q., Yan, S.: Network in network (2013). arXiv:1312.4400 5. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015) 6. Setio, A.A.A., Ciompi, F., Litjens, G., et al.: Pulmonary nodule detection in CT images: false positive reduction using multi-view convolutional networks. IEEE Trans. Med. Imaging 35(5), 1160–1169 (2016) 7. Setio, A.A.A., Traverso, A., Bel, T., et al.: Validation, comparison, and combination of algorithms for automatic detection of pulmonary nodules in computed tomography images: the luna16 challenge (2016). arXiv:1612.08012 8. Siegel, R.L., Miller, K.D., Jemal, A.: Cancer statistics, 2015. CA Cancer J. Clin. 65(1), 5–29 (2015)

Accurate Pulmonary Nodule Detection in CT Using DCNNs

567

9. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations, pp. 1–9 (2015) 10. Torres, E.L., Fiorina, E., Pennazio, F., et al.: Large scale validation of the M5L lung CAD on heterogeneous CT datasets. Med. Phys. 42(4), 1477–1489 (2015) 11. Zagoruyko, S., Komodakis, N.: Wide residual networks (2016). arXiv:1605.07146

Discriminative Localization in CNNs for Weakly-Supervised Segmentation of Pulmonary Nodules Xinyang Feng1 , Jie Yang1 , Andrew F. Laine1 , and Elsa D. Angelini1,2(B) 1 2

Department of Biomedical Engineering, Columbia University, New York, NY, USA ITMAT Data Science Group, NIHR Imperial BRC, Imperial College, London, UK [email protected]

Abstract. Automated detection and segmentation of pulmonary nodules on lung computed tomography (CT) scans can facilitate early lung cancer diagnosis. Existing supervised approaches for automated nodule segmentation on CT scans require voxel-based annotations for training, which are labor- and time-consuming to obtain. In this work, we propose a weakly-supervised method that generates accurate voxel-level nodule segmentation trained with image-level labels only. By adapting a convolutional neural network (CNN) trained for image classification, our proposed method learns discriminative regions from the activation maps of convolution units at different scales, and identifies the true nodule location with a novel candidate-screening framework. Experimental results on the public LIDC-IDRI dataset demonstrate that, our weaklysupervised nodule segmentation framework achieves competitive performance compared to a fully-supervised CNN-based segmentation method.

1

Introduction

Lung cancer is a major cause of cancer-related deaths worldwide. Pulmonary nodules refer to a range of lung abnormalities that are visible on lung computed tomography (CT) scans as roughly round opacities, and have been regarded as crucial indicators of primary lung cancers [1]. The detection and segmentation of pulmonary nodules in lung CT scans can facilitate early lung cancer diagnosis, timely surgical intervention and thus increase survival rate [2]. Automated detection systems that locate and segment nodules of various sizes can assist radiologists in cancer malignancy diagnosis. Existing supervised approaches for automated nodule segmentation require voxel-level annotations for training, which are labor-intensive and time-consuming to obtain. Alternatively, image-level labels, such as a binary label indicating the presence of nodules, can be obtained more efficiently. Recent work [3,4] studied nodule segmentation using weakly labeled data without dense voxel-level annotations. Their methods, however, still rely on user inputs for additional information such as exact nodule location and estimated nodule size during the segmentation. Convolutional neural networks (CNNs) have been widely used for supervised image classification and segmentation tasks. It was very recently discovered in a c Springer International Publishing AG 2017  M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 568–576, 2017. DOI: 10.1007/978-3-319-66179-7 65

Discriminative Localization in CNNs for Weakly-Supervised Segmentation

569

study [5] on natural images that CNNs trained on semantic labels for image classification task (“what”), have remarkable capability in identifying the discriminative regions (“where”) when combined with a global average pooling (GAP) operation. This method utilizes the up-sampled weighted activation maps from the last convolutional layer in a CNN. It demonstrated the localization capability of CNNs for detecting relatively large-sized targets within image, which is not the general scenario in medical imaging domain where pathological changes are more various in size and rather subtle to capture. However, this work sheds light on weakly-supervised disease detection. In this work, we exploit CNN for accurate and fully-automated segmentation of nodules in a weakly-supervised manner with binary slice-level labels only. Specifically, we adapt a classic image classification CNN model to detect slices with nodule, and simultaneously learn the discriminative regions from the activation maps of convolution units at different scales for coarse segmentation. We then introduce a candidate-screening framework utilizing the same network to generate accurate localization and segmentation. Experimental results on the public LIDC-IDRI dataset [6,7] demonstrate that, despite the largely reduced amount of annotations required for training, our weakly-supervised nodule segmentation framework achieves competitive performance compared to a CNNbased fully-supervised segmentation method.

2

Method

The framework is overviewed in Fig. 1. There are two stages: training stage and segmentation stage. In the first stage, we train a CNN model to classify CT slices as with or without nodule. The CNN is composed of a fully convolutional component, a convolutional layer + global average pooling layer (Conv + GAP) structure, and a final fully-connected (FC) layer. Besides providing a binary classification, the CNN generates a nodule activation map (NAM) showing potential nodule localizations, using a weighted average of the activation maps with the weights learnt in the FC layer. In the second stage, coarse segmentation of nodule candidates is generated within a spatial scope defined by the NAM. For fine segmentation, each nodule candidate is masked out from the image alternately. By feeding the masked image into the same network, a residual NAM (called R-NAM) is generated and used to select the true nodule. Shallower layers in the CNN can be concatenated into the classification task through skip architecture and Conv + GAP structure, extending the one-GAP CNN model to multi-GAP CNN that is able to generate NAMs with higher resolution. 2.1

Nodule Activation Map

In a classification-oriented CNN, while the shallower layers represent general appearance information, the deep layers encode discriminative information that is specific to the classification task. Benefiting from the convolutional structure, spatial information can be retained in the activations of convolutional units.

570

X. Feng et al.

Fig. 1. (A) Training: a CNN model is trained to classify CT slices and generate nodule activation maps (NAMs); (B) Segmentation: for test slices classified as “nodule slice”, nodule candidates are screened using a spatial scope defined by the NAM for coarse segmentation. Residual NAMs (R-NAMs) are generated from images with masked nodule candidates for fine segmentation.

Activation maps of deep convolutional layers, therefore, enable discriminative spatial localization of the class of interest. In our case, we locate nodules with a specially generated weighted activation map called nodule activation map. One-GAP CNN. For a given image I, we represent the activation of unit k at spatial location (x, y) in the last convolutional layer as ak (x, y). The activation of each unitk is summarized through a spatially global average pooling operation as Ak = (x,y) ak (x, y). The feature vector constituted of Ak is followed by a FC layer, which generates the nodule classification score (i.e. input to the softmax function for nodule class) as:    wk,nodule Ak = wk,nodule ak (x, y) (1) Snodule = k

k

(x,y)

where the weights wk,nodule learnt in the FC layer essentially measure the importance of unit k in the classification task. As spatial information is retained in the activation maps through ak (x, y), a weighted average of the activation maps

Discriminative Localization in CNNs for Weakly-Supervised Segmentation

571

results in a robust nodule activation map:  wk,nodule ak (x, y) NAM(x, y) =

(2)

The nodule classification score can be directly linked with the NAM by:    wk,nodule ak (x, y) = NAM(x, y) Snodule =

(3)

k

(x,y)

k

(x,y)

By simply up-sampling the NAM to the size of the input image I, we can identify the discriminative image region that is most relevant to nodule. Multi-GAP CNN. Although activation maps of the last convolutional layer carry most discriminative information, they are usually greatly down-sampled from the original image resolution due to pooling operations. We hereby introduce a multi-GAP CNN model that takes advantage of shallower layers with higher spatial resolution. Similar to the idea of the skip architecture proposed in fully-convolutional network (FCN) [8], shallower layers can be directed to the final classification task skipping the following layers. We also add a Conv + GAP structure following the shallow layers. The concatenation of feature vectors generated by each GAP layer is fed into the final FC layer. The NAM generated from the multi-GAP CNN model (multi-GAP NAM) is a weighted activation map involving activations at multiple scales. 2.2

Segmentation

Coarse Segmentation. For slices classified as “nodule slice”, nodule candidates are screened within a spatial scope C defined by the most prominent blob in the NAM processed via watershed. They are then coarsely segmented using an iterated conditional mode (ICM) based multi-phase segmentation method [9], with the phase number equal to four as determined by global intensity distribution. Fine Segmentation. The NAM indicates a potential but not exact nodule location. To identify the true nodule from the coarse segmentation results, i.e. which nodule candidate triggered the activation, we generate residual NAMs (R-NAMs) by masking each nodule candidate Rj alternately and feeding the masked image I\Rj into the same network. Significant change of activations within C indicates the exclusion of a true nodule. Formally, we generate the fine segmentation by selecting the nodule candidate Rk following:   2 NAMI (x, y) − NAMI\Rj (x, y) Rk = argmaxRj (4) (x,y)∈C

where NAMI is the original NAM, and NAMI\Rj is the R-NAM generated by masking nodule candidate Rj . Our current implementation targets the segmentation of one nodule per NAM. Incidence of slices with two nodules is ∼1% within slices with nodules. No slices contain more than two nodules in our dataset.

572

X. Feng et al.

Multi-GAP Segmentation. For the multi-GAP CNN model, we observed a slight drop in classification accuracy compared with the one-GAP CNN model (see Sect. 3.2), which is expected since features from shallower layers are more general and less discriminative. In light of this, we further propose a multi-GAP segmentation method by training both a one-GAP CNN model and a multi-GAP CNN model to combine the discriminative capability of the one-GAP system and finer localization of the multi-GAP system. Specifically, segmentation is performed on slices classified as “nodule slice” by the one-GAP CNN model for its higher classification accuracy. To define the screening scope for coarse segmentation, we first use the one-GAP NAM to generate a baseline scope C1 . If there is a prominent blob Cmulti within C1 in the multi-GAP NAM, we define the final scope C as Cmulti to eliminate redundant Fig. 2. Illustration of 1-/2-/3-GAP NAMs, the nodule candidates with more screening scopes C and coarse segmentation localized spatial constraints. results on a sample slice. When the multi-GAP NAM fails to identify any discriminative regions within C1 , the final screening scope C remains C1 . The R-NAM of the masked image is generated by the one-GAP CNN model and compared with one-GAP NAM within C1 . Figure 2 illustrates 1-/2-/3-GAP NAMs, the corresponding screening scopes C and coarse segmentation results on a sample slice. While multi-GAP NAM enables finer localization, one-GAP NAM has better discriminative power.

3 3.1

Experimental Results Data and Experimental Setup

Data used in this study contains 1,010 thoracic CT scans from the public LIDCIDRI database. Details about this database, such as acquisition protocols and quality evaluations, can be found in [6]. Lungs were segmented and each axial slice was cropped to 384 × 384 pixels centering on the lung mask. Nodules were delineated by up to four experts. Voxel-level annotations are used to generate slice-level labels, and are used as ground truth for segmentation evaluation. Nodules with diameter −950 and Iexp (ϕ(xi )) ≤ −856. The airways and vasculature are segmented by only considering voxels with an HU between −500 HU and −1024 HU in both scans. 2.2

Local Disease and Deformation Distributions

We present the concept of local feature distributions (Fig. 1a and b). The aim is to quantify local abnormalities in lung physiology and pathology to define a signature unique to a patients disease state. We introduce two models: (1) local disease distributions and (2) local deformation distributions. The disease distributions model the spread of emphysema and fSAD whilst the deformation distribution characterises local volume change across the lung. They are created by locally sampling regions of Z and J in a Cartesian grid using local regions of interest Ωk (ROI) where k = 1 · · · K indexes the center voxel of the ROI. The size (r × r × r) of the ROI governs the scale of the sampling. We modelled two properties of disease spread: (1) locally diffuse/dense disease and (2) global homogeneity/heterogeneity. For each ROI centered at zk where z ∈ Ωk , we computed the fraction of PRMemph and PRMf SAD voxels; defined as vk (emph) and vk (f SAD). Dense disease occurred when vk (·) → 1 whilst diffuse disease was present when vk (·) → 0. The deviation of diffuse and dense regions in the lung defined the heterogeneity/homogeneity of disease spread.

588

F.J.S. Bragman et al.

A distribution f (v(·)) for each feature was built by sampling K regions. The shape of the distribution is governed by the two disease properties (Fig. 1a). It provides information on the nature of local disease spread (diffuse or dense) and whether it is homogeneous or heterogeneous. Expansion of the lung is dependent on local biomechanical properties (emphysema) and airway resistance (functional small airways disease), which will affect lung deformation locally. To capture volume change on a local basis, the Jacobian map (J) was sampled by calculating the mean Jacobian (μ(J)k ) for all Ωk . A distribution f (μ(J)) of these measurements was built to capture local volume change throughout the lung using the same process as above (Fig. 1b).

f (v(emph)) f (v(fSAD))

J

µ(J)1

f (µ(J))

Z f (v)

z ∈ Ωk

µ(J)k

vk (emph) vk (f SAD )

0.2

0

(a) Local disease

0.2

0.4

v

0.6

0.8

1

0.4

0.6

0.8

1

µ(J)

(b) Local deformation

Fig. 1. Local disease and deformation distributions.

2.3

Manifold Learning of COPD Distributions

We hypothesised that the heterogeneity of COPD could be modelled by the local disease and deformation distributions. Manifold learning can be used to capture variability in the distributions and learn separate embeddings for emphysema, fSAD and lung deformation. Fusion of these embeddings can then be performed to create various models of COPD. Distribution Distance. Inter-patient differences are computed using the Earth Movers Distance (LEM D ) [11]. It is a cross-bin distance metric, which measures the minimum amount of work needed to transform one distribution into another. The distributions are quantised into separate histograms hv(emph) , hv(f SAD) and hJ using Nb bins. They are normalised to sum to 1 such that they have equal mass. A closed-form solution of the LEM D can be used for one-dimensional distributions with equal mass and bins [7]. It reduces to the L1 norm between cumulative distributions (H) of two  histograms h1,(·) and h2,(·) :   Nb LEM D h1,(·) , h2,(·) = n |Hn,1,(·) − Hn,2,(·) | . Manifold Learning and Fusion. Manifold learning is used to model emphysema, fSAD and Jacobian distributions. The aim is to capture variations in the

Manifold Learning of COPD

589

distributions in a population of COPD patients. As emphysema and fSAD occur synchronously and both affect lung function, the manifold fusion framework of Aljabar et al. [1] is employed to create a single representation of these processes. For P subjects, the PRM classified volumes are Z1 , · · · , ZP and their respective Jacobian determinant maps are J = J1 , · · · , JP . The distributions are quantised using Nb bins into their respective histograms hp,v(emph) , hp,v(f SAD) and hp,J . Pairwise measures in the population are obtained with the LEM D yielding the pairwise matrices Memph , Mf SAD and MJ . They can be visualised as connected graphs where each node represents a patient and the edge length is the LEM D . Isomap1 [12] is applied to each matrix. A K-nearest neighbour search is first performed to create a sparse representation of M(·) where edges are restricted to the K-nearest neighbourhood of each node. A full pairwise geodesic distance matrix D(·) is then estimated by analysis of the K-nearest graph of M(·) using Djikstra’s shortest-path algorithm [3]. The low-dimensional (·) embedding yp , p = 1, ·, P is obtained by minimisation of min

 p,j

2 (·) (·) Dp,j − ||yp(·) − yj ||

(1)

using Multi-Dimensional Scaling. The coordinate embeddings for Memph , Mf SAD and MJ are y e , y f and y J with dimensions de , df and dJ that are selected. Fusion of the coordinates y (·) can be performed in any combination to investigate various processes. For simplicity, we consider all embeddings. The coordinates are uniformly scaled with the scale factors se , sf and sJ such that the first (·) component of each embedding y1 has a unit variance. These are concatenated e e f f J J to yield Y = (s y , s y , s y ) with dimension de + df + dJ . A distance matrix Mc is obtained by calculating pairwise Euclidean distances of Y . Isomap is then applied to yield the combined coordinate embedding y c with dimension dc .

3

Experiments

3.1

Data Processing

A total of 1, 154 scans of COPD patients (GOLD ≥ 1) were downloaded from COPDGene [10]. They were acquired on various scanners (GE Medical Systems, Siemens and Philips) with the following reconstruction algorithms: STANDARD (GE), AS+ B31f and B31f (Siemens), and 64 B (Philips). The Pulmonary Toolkit2 was used for lung segmentation. Breath-hold scans were registered with NiftyReg [9] with a modified version of the EMPIRE10 pipeline [8]. The transformation was a stationary velocity field parameterised by a cubic B-spline and the similarity measure was MIND [6]. The constraint term was the bending energy of the velocity field, weighted at 1% for all stages of the pipeline. After manual 1 2

lvdmaaten.github.io/drtoolbox/. github.com/tomdoel/pulmonarytoolkit.

590

F.J.S. Bragman et al.

inspection of the registrations, 743 patients were selected. Scans were rejected if there were major errors close to the fissures and the lung boundary. The sampling size of the ROIs was r = 20 mm, consistent with the size of the secondary pulmonary lobule. Sampling was performed with a Cartesian grid of center voxels spaced every 5 mm. We chose a value of Nb = 60 as its effect on pairwise distances was minimal with increasing Nb when Nb > 50. The dimensionality d of y and the parameter K for each embedding were determined by estimating the reconstruction quality of the lower-dimensional coordinates. The residual variance 1 − ρ2M,y between the distances in M(·) and the pairwise distances of y (·) was considered. For each embedding step (y e , y f and y J ), we determined the combination of K and d that minimised the residual variance. Grid-search parameters were set to d∗ ∈ [1, 5] and K ∗ ∈ [5, 100]. Final parameters were K = [50, 30, 45] and d = [5, 5, 4] for y e , y f and y J . We considered a model of the disease distributions (y e , y f → y c1 ) and a model also including the deformation (y e , y f , y J → y c2 ). Parameters for both models were Kc1 = 55 and Kc2 = 60 with dc1 = 4 and dc2 = 4. Table 1. Pearson correlation coefficient between the first three embedding coordinates and the distributions using the median (ϕ), median absolute deviation (ρ), skewness (γ1 ), kurtosis (γ2 ). [∗ = p < 0.05, † = p < 10−3 ] PRMemph y2e y3e

y1e ϕ ρ

0.96† –0.19† 0.89



0.22



γ1 –0.71† –0.28† γ2 –0.41

3.2



–0.26



0.01 –0.00

PRMf SAD y2f y3f

y1f

y1J

J y2J

y3J

0.97†

0.07

–0.01 –0.48†

–0.06





–0.41† –0.46†

0.14∗ –0.09

0.35

–0.36

0.00 –0.86†

0.21†





–0.01 –0.37

0.33

0.16† –0.68† –0.24†

0.04 0.00

0.26† –0.36† –0.18† –0.01

Associations with Disease Severity

Correlations between the embeddings and distribution moments were computed (Table 1). The first and second components of the embeddings had strong to moderate correlations with the distribution parameters, demonstrating that manifold learning of the distributions modelled the variation in the population. We considered several models to predict COPD severity using FEV1 %predicted and FEV1/FVC (Table 2). We considered three simple models (mean PRMemph , mean PRMf SAD and mean Jacobian μ(J)) and compared them to univariate and (e,f ) multivariate models of embedding coordinates (y). The univariate models (y1 ) showed moderate improvement over the simple mean models. However, the combined models (y1c1 and y1c2 ) improved model prediction. The multivariate models demonstrated best performance, with model 2 (y c2 = y e + y f + y J ) performing best, even after adjusting for an increase in variables. It had a Bayesian Information Criterion (BIC) of 620 compared to 625 (y c1 ) and 633, 650 and 648 for

Manifold Learning of COPD

591

PRMemph , PRMf SAD and μ(J) respectively. The increase in explanatory power c was also seen when correlating the first component of the combined models (y11,2 ) with FEV1 %predicted. The first components of the combined models had Pearson coefficients of r = 0.67, p < 0.001 and r = 0.70, p < 0.001 respectively. Coefficients for the mean models were r = −0.63, p < 0.001, r = −0.50, p < 0.001 and r = 0.52, p < 0.001 respectively. We also used manifold fusion to create a joint model between mean values of PRMemph and PRMf SAD and a second with PRMemph , PRMf SAD and μ(J). Pairwise mean differences were used to create M(·) . Correlation of the first component was r = 0.60, p < 0.001 and r = −0.65, p < 0.001 respectively. This corroborated the utility of combining embeddings based on the local distributions (y1c2 → r = 0.70, p < 0.001) (Fig. 2). 30

50 40

20

100

10

60

c2 yp,2

c1 yp,2

0

80

20

0

80

−10 60

−20 −30

40

−10

40 −40

−20 −30 −60

100

10

30

−50

20 −40

−20

0

c1 yp,1

20

40

20

−60 −50

60

0

50

c2 yp,1

(a) yc1

(b) yc2

Fig. 2. Projection of embeddings (a) y c1 and (b) y c2 with FEV1 %predicted overlayed.

Table 2. Regression of models versus various clinical measures of COPD severity. Model performance quoted as adjusted-r2 . [† = p < 10−3 ] Y

Mean features PRMe PRMf µ(J)

Univariate c

y1 1

c

y1 2

y1e

Multivariate y1f

y1J

y c1

y c2

ye

yf

yJ

FEV1 %p

0.40†

0.25† 0.26† 0.45† 0.49† 0.42† 0.29† 0.13† 0.48† 0.51† 0.43† 0.34† 0.14†

F EV1/F V C

0.51†

0.30† 0.22† 0.54† 0.53† 0.54† 0.32† 0.09† 0.59† 0.60† 0.55† 0.38† 0.10†

3.3

Trajectories of Emphysema and fSAD Progression

It is likely that trajectories of disease progression in COPD vary depending on the dominant disease phenotype. We assessed whether we can model these in the tissue disease model (y c1 ). We parameterised y c1 using the emphysema  and fSAD distributions as covariates (l) with kernel regression: y c (l(·)) = v1 i K(li − l)yic where K is a Gaussian kernel and v is a normalisation constant. The covariate was the LEM D between the distributions and an idealised healthy distribution

592

F.J.S. Bragman et al.

(distribution peak at v = 0). The outcome is two trajectories in the manifold space (Fig. 3a). The emphysema trajectory can be considered as the path taken when emphysema progression is dominant and vice-versa for fSAD. We classified patients based on these trajectories. A patient is seen to follow an emphysema progression trajectory if it is closest to y c (l(emph)). At the baseline, patients are classified as both emphysema and fSAD subtypes. When considering two sets of patients stratified by trajectory, the explanatory power of the embeddings improved in comparison to y c1 (Table 2). The emphysema regression produced an adjusted-r2 of 0.52 and 0.63 when predicting FEV1 %predicted and F EV1/F V C respectively whilst fSAD was 0.45 and 0.62. yc (l(emph)) yc (l(fSAD))

30

20

20

10

10

c yp,3

c yp,3

30

0

0

−10

−10

−20

−20 0

0 20

c yp,2

40 60

40

20

0

−20

−40

c yp,1

(a) y c1

20 c yp,2

40 60

40

20

0

−20

−40

c yp,1

(b) Classified trajectory

Fig. 3. (a) Three-dimensional projection of y c1 and (b) classified trajectories of y c1 .

4

Discussion and Conclusion

We have presented a method to parameterise distributions of various local features implicated in COPD progression. The disease distributions model local aspects of tissue destruction whilst modelling global properties of heterogeneity and homogeneity. The deformation distribution quantifies the local effect of disease on lung function. Patients exhibiting different mechanisms of tissue destruction can have identical global averages yet can display different disease distributions. These differences are likely to cause differences in local biomechanical properties, which are captured by the deformation distribution. We have shown that models of the proposed distributions better predict COPD severity than conventional metrics (Table 2). We have shown that embeddings based on distribution dissimilarities have stronger correlations with FEV1 %predicted than those learned from mean differences. Both these results suggest that the position of a patient in the manifold space of y c1 or y c2 is critical for assessing COPD. This was observed in the trajectory classification (Fig. 3). Determining the trajectory that a patient is following may help inform therapeutic decisions and improve our understanding of COPD progression.

Manifold Learning of COPD

593

Complexity of the modelling may be increased to model more specific information about lung pathophysiology. Separate manifolds can be produced on a lobar basis. This is likely to further increase the explanatory power of the models since inter-lobar disease metrics correlate with different aspects of physiology. The detection of regional differences in local deformation may add further important information regarding the pathophysiology of a patient. Acknowledgements. This work was supported by the EPSRC under Grant EP/H046410/1 and EP/K502959/1, and the UCLH NIHR RCF Senior Investigator Award under Grant RCF107/DH/2014. It used data (phs000179.v3.p2) from the COPDGene study, supported by NIH Grant U01HL089856 and U01HL089897.

References 1. Aljabar, P., Wolz, R., Srinivasan, L., Counsell, S.J., Rutherford, M.A., Edwards, A.D., Hajnal, J.V., Rueckert, D.: A combined manifold learning analysis of shape and appearance to characterize neonatal brain development. IEEE Trans. Med. Imaging 30(12), 2072–2086 (2011) 2. Bragman, F.J.S., McClelland, J.R., Modat, M., Ourselin, S., Hurst, J.R., Hawkes, D.J.: Multi-scale analysis of imaging features and its use in the study of COPD exacerbation susceptible phenotypes. In: Golland, P., Hata, N., Barillot, C., Hornegger, J., Howe, R. (eds.) MICCAI 2014. LNCS, vol. 8675, pp. 417–424. Springer, Cham (2014). doi:10.1007/978-3-319-10443-0 53 3. Dijkstra, E.W.: A note on two problems in connexion with graphs. Numerische Mathematik 1(1), 269–271 (1959) 4. Galb´ an, C.J., Han, M.K., Boes, J.L., Chughtai, K.A., Charles, R., Johnson, T.D., Galb´ an, S., Rehemtulla, A., Kazerooni, E.A., Martinez, F.J., Ross, B.D.: CT-based biomarker provides unique signature for diagnosis of COPD phenotypes and disease progression. Nat. Med. 18(11), 1711–1715 (2013) 5. Harmouche, R., Ross, J.C., Diaz, A.A., Washko, G.R., Estepar, R.S.J.: A robust emphysema severity measure based on disease subtypes. Acad. Radiol. 23(4), 421– 428 (2016) 6. Heinrich, M.P., Jenkinson, M., Bhushan, M., Matin, T., Gleeson, F.V., Brady, M., Schnabel, J.A.: MIND: modality independent neighbourhood descriptor for multimodal deformable registration. Med. Image Anal. 16(7), 1423–1435 (2012) 7. Levina, E., Bickel, P.: The earth mover’s distance is the Mallows distance: some insights from statistics. Eighth IEEE Int. Conf. Comput. Vis. 2, 251–256 (2001) 8. Modat, M., McClelland, J., Ourselin, S.: Lung registration using the NiftyReg package. In: Medical Image Analysis for the Clinic: A Grand Challenge EMPIRE, vol. 10, pp. 33–42 (2010) 9. Modat, M., Ridgway, G.R., Taylor, Z.A., Lehmann, M., Barnes, J., Hawkes, D.J., Fox, N.C., Ourselin, S.: Fast free-form deformation using graphics processing units. Comput. Methods Programs Biomed. 98(3), 278–284 (2010) 10. Regan, E.A., Hokanson, J.E., Murphy, J.R., Make, B., Lynch, D.A., Beaty, T.H., Curran-Everett, D., Silverman, E.K., Crapo, J.D.: Genetic epidemiology of COPD (COPDGene) study design. COPD 7(1), 32–43 (2010) 11. Rubner, Y., Tomasi, C., Guibas, L.J.: The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vis. 40(2), 99–121 (2000) 12. Tenenbaum, J.B., de Silva, V., Langford, J.C.: A global geometric framework for nonlinear dimensionality reduction. Science 290(5500), 2319–2323 (2000)

Hybrid Mass Detection in Breast MRI Combining Unsupervised Saliency Analysis and Deep Learning ( ) Guy Amit ✉ , Omer Hadad, Sharon Alpert, Tal Tlusty, Yaniv Gur, Rami Ben-Ari, and Sharbell Hashoul

Haifa University Campus, 3498825 Mount Carmel Haifa, Israel [email protected]

Abstract. To interpret a breast MRI study, a radiologist has to examine over 1000 images, and integrate spatial and temporal information from multiple sequences. The automated detection and classification of suspicious lesions can help reduce the workload and improve accuracy. We describe a hybrid massdetection algorithm that combines unsupervised candidate detection with deep learning-based classification. The detection algorithm first identifies imagesalient regions, as well as regions that are cross-salient with respect to the contralateral breast image. We then use a convolutional neural network (CNN) to classify the detected candidates into true-positive and false-positive masses. The network uses a novel multi-channel image representation; this representation encompasses information from the anatomical and kinetic image features, as well as saliency maps. We evaluated our algorithm on a dataset of MRI studies from 171 patients, with 1957 annotated slices of malignant (59%) and benign (41%) masses. Unsupervised saliency-based detection provided a sensitivity of 0.96 with 9.7 false-positive detections per slice. Combined with CNN classification, the number of false positive detections dropped to 0.7 per slice, with 0.85 sensitivity. The multi-channel representation achieved higher classification performance compared to single-channel images. The combination of domain-specific unsu‐ pervised methods and general-purpose supervised learning offers advantages for medical imaging applications, and may improve the ability of automated algo‐ rithms to assist radiologists. Keywords: Breast MRI · Lesion detection · Saliency · Deep learning

1

Introduction

Magnetic Resonance Imaging (MRI) of the breast is widely-used as a screening exami‐ nation for women at high risk of breast cancer. A typical breast MRI study consists of 1000 to 1500 images, which involve a lot of time for interpretation and reporting. Computer-assisted interpretation can potentially reduce the radiologist’s workload by automating some of the diagnostic tasks, such as lesion detection. Previous studies addressed automatic lesion detection in breast MRI using a variety of methods. These can be broadly categorized into: image processing approaches [1–3], machine learning approaches [4, 5], or a combination of both [6–8]. Deep convolutional neural networks (CNN) have been previously applied to breast MRI images for mass classification [9], © Springer International Publishing AG 2017 M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 594–602, 2017. DOI: 10.1007/978-3-319-66179-7_68

Hybrid Mass Detection in Breast MRI Combining Unsupervised Saliency Analysis

595

as well as parts of an automated lesion segmentation pipeline [10]. Cross-saliency anal‐ ysis has been shown to be advantageous for unsupervised asymmetry detection, and was applied to lesion detection in breast mammograms and brain MRI [11]. We describe a new hybrid framework for the automatic detection of breast lesions in dynamic contrast enhanced (DCE) MRI studies. The framework combines unsuper‐ vised candidate proposals by analyzing salient image areas, with a CNN classifier that filters out false detections using multiple image channels. Our work comprises four major contributions: (1) an unsupervised lesion detection algorithm, using patch-based distinctiveness within the entire breast, and between the left and right breasts; (2) a new multi-channel representation for DCE MRI images, which compactly captures anatomy, kinetic, and salient features in a single image that can be fed to a deep neural network; (3) a hybrid lesion detection framework that provides high detection accuracy and a low false positive rate by combining the unsupervised detection and the multi-channel repre‐ sentation with a deep neural network classifier; (4) the evaluation of proposed methods on a large dataset of MRI studies, including publicly available data.

2

Methods

2.1 Datasets To train and evaluate our system, we used a dataset of 193 breast MRI studies from 171 female patients, acquired through a variety of acquisition devices and protocols. Two publicly available resources [12–14] provided the data of 78 patients (46%). Each study included axial DCE T1 sequences with one pre-contrast series and at least 3 post-contrast series. Three breast radiologists interpreted the studies using image information alone, without considering any clinical data. Each identified lesion was assigned a BI-RADS score. The boundaries of the lesions were manually delineated on each relevant slice, with an average of 11 ± 10 annotated slices per patient. Overall, there were 1957 anno‐ tated lesion contours in 1845 slices; 59% of them were labeled as malignant (BI-RADS 4/5) and 41% were labeled as benign (BI-RADS 2/3). The average lesion size was 319 ± 594 mm2. We partitioned the patients into training (75%, 128 patients, 1326 slices) and testing (25%, 43 patients, 519 slices) subsets. The partitioning was random, while ensuring a similar distribution of benign and malignant lesions in each of the subsets. 2.2 Image Representation We used our multi-channel image representation previously described in [9]. The rationale of this representation is to capture both anatomical and metabolic characteristics of the lesion in a single multi-channel image. Figure 1 shows the three image channels that repre‐ sent the DCE study: (1) peak enhancement intensity channel; (2) contrast uptake channel: difference between peak enhancement and baseline images; (3) contrast washout image: difference between the early and the delayed contrast images.

596

G. Amit et al.

Fig. 1. Axial DCE T1 sequences acquired before contrast injection (a), at peak enhancement (b) and after contrast washout (c). The kinetic graph (d) shows the pattern of contrast uptake and the temporal location of each sequence. The multi-channel image representation combines the peakenhancement image (b) with the contrast uptake image (e) and contrast washout image (f).

2.3 Lesion Detection Framework Figure 2 shows the components of the lesion detection framework.

MRI Study

3-channels images

Normalized 3-channels breast slices

Bounding box of suspicious regions

Lesion detection maps per slice mass

Lesion detection maps per study

Detected lesions

non mass

Fig. 2. Analysis steps of the lesion detection framework (top) and their outputs (bottom)

Image Preprocessing and Segmentation The two-dimensional slice images were normalized to reduce data variability due to the mixture of studies from different sources. For each of the data subsets, the global 1% and 99% percentiles of pixel intensity were calculated for each channel, and contrast stretching was applied to convert all images to the same dynamic range. The breast area was segmented using U-Net, a fully convolutional network designed for medical image

Hybrid Mass Detection in Breast MRI Combining Unsupervised Saliency Analysis

597

segmentation [15]. The network was implemented using Lasagne, a python framework built on top of Theano. We trained the network on a subset of slices with manually delineated breast contours. The training process of 20 epochs on a batch size of 4 images required about 1 h on a Titan X NVIDIA GPU. The remainder of the detection pipeline was applied only to the region within the segmented breast. Saliency Analysis For each MRI slice image, two patch-based saliency maps were created [11]: (1) patch distinctiveness and, (2) contralateral breast patch flow. The patch distinctiveness sali‐ ency map is generated by computing the L1-distance between each patch and the average patch along the principal components of all image patches [16]. For a given vectorized patch px,y around the points (x, y), this measure is given by:

( ) ∑n | | PD px,y = |p 𝜔T | k=1 | x,y k |

(1)

where 𝜔Tk is the kth principal component of the entire image patch distribution. Contralateral patch flow calculates the flow field between patches of the left and right breasts, using the PatchMatch algorithm [17]. The algorithm uses the smooth motion field assumption to compute a dense flow field for each pixel by considering a k x k patch around it. Initially, for each pixel location (x, y), it assigns a random displacement vector T that marks the location of its corresponding patch in the other image. The quality of the displacement T is then measured by computing the L2 distance between the corre‐ sponding patches: x+k∕2 y+k∕2 ( ) ∑ ∑ ‖ )‖ ( D px,y , px+Tx ,y+Ty = ‖I(i, j) − I i + Tx , j + Ty ‖ ‖ ‖2

(2)

i=x−k∕2 j=y−k∕2

The algorithm attempts to improve the displacement for each location by testing new hypotheses generated from the displacement vectors of neighboring patches in the same image. The algorithm progresses by iterating between the random and propagation steps, keeping the location of the best estimate according to the L2 distance. We applied the algorithm to find, for each patch in the source image, the corresponding nearest neighbor patch in the target image. The nearest neighbor error (NNE) was used to estimate the cross-image distinctiveness of the patch: ( ) ( ) NNE px,y = min D px,y , px+Tx ,y+Ty T

(3)

Candidate Region Detection Candidate regions were detected on the saliency maps using a scale-invariant algorithm that searches for regions with high density of saliency values. For a given range of window sizes (wi, hj) and a set of threshold values {t1, t2,.. tn}, the algorithm efficiently computes, for each pixel (x, y) and a region sx,y of size wi x hj around it:

598

G. Amit et al.

Score(x, y) = max wi, hj tk

∑ (sx,y > tk ) ∑ wi ⋅ hj

sx,y

(4)

Non-maximal suppression was then applied to the Score image to obtain the locations of all local maxima. We used window sizes in the range of 5 to 50 pixels and normalized threshold values from 0.3 to 0.9. The region detection algorithm was applied to each of the two saliency maps, producing two binary detection masks, which were combined by an ‘or’ operator to generate candidate detections per slice. CNN-Based Candidate Classification The detected candidates were cropped from their slice images using square bounding boxes, extended by 20% to ensure that the entire lesion was included in the cropped image. The extracted lesion images were resized to fit the CNN input, and 32 × 32 × 5 multi-channel images were created with 3 channels of the DCE image and 2 channels of the corresponding saliency maps. The CNN architecture consisted of 9 convolutional layers in 3 consecutive blocks, similar to [18]. The first block had two 5 × 5 × 32 filters with ReLU layers followed by a max pooling layer, the second block had four 5 × 5 × 32 filters with ReLU layers followed by an average pooling layer, and the final block had three convolutional layers of size 5 × 5 × 64, 6 × 6 × 64, and 3 × 3 × 64 respectively, each followed by a ReLU layer. The network was terminated by a fully connected layer with 128 neurons and a softmax loss layer. The network output assigned either a ‘mass’ or ‘non-mass’ label to each bounding box. As the training data was unbalanced, with many more examples of ‘non-mass’ regions, we trained an ensemble of 10 networks, each with a different random sample of ‘non-mass’ regions. Majority voting of the ensemble determined the final classification. For each slice, the output of the framework was a binary detection map, with regions that were proposed by the saliency analysis and classified as ‘mass’ by the CNNs. The detection output per study was generated by summing the slice detection maps along the longitudinal axis. This produced a projected heatmap showing the spatial concen‐ tration of detected regions. Thresholding this heatmap was used to further reject false detections. 2.4 Experiments We trained the ensemble of convolutional networks on a set of 1564 bounding boxes of masses and 11,286 of non-masses, detected by the saliency analysis. The training set was augmented by adding three rotated and two flipped variants for each image. The networks were trained using MatCovNet, using a stochastic gradient descent solver with a momentum of 0.9. The average training time of 100 epochs was 20 min on NVIDIATitan-X black GPU. To evaluate the performance of the detection framework on the test set of 43 patients, all DCE slices of the test studies were processed by the entire pipeline. Overall, there were 5420 test slices, an average of 126 slices per patient. We compared the detection maps per slice and per study to the annotated ground-truth.

Hybrid Mass Detection in Breast MRI Combining Unsupervised Saliency Analysis

3

599

Results

The unsupervised saliency analysis correctly detected 0.96 of true lesions in the entire dataset, with an average of 9.7 false positive detections per slice. The detection rates on the training and testing sets were similar. The average accuracy of mass/non-mass clas‐ sification, obtained by the CNN on the validation set during the training process, was

Table 1. Classification performance using single and multi-channel image representations Channels 1 3 5

Accuracy 0.79 ± 0.05 0.84 ± 0.03 0.86 ± 0.02

AUC 0.88 ± 0.05 0.92 ± 0.03 0.94 ± 0.01

Table 2. Evaluation results on the test set

Saliency-based detection Slice-based CNN detection

a

Study-based CNN detectiona

Patients/ studies 43/50 43/50

Slices/lesions Sensitivity 5420/625 5420/625

0.96 0.85

Falsepositives 9.7/slice 0.7/slice

43/50

5420/625

0.98

7/study

a

End to end analysis, which includes the unsupervised stage.

(a)

(b)

(c)

(d)

Fig. 3. An example of true-positive and false-negative detection. A BI-RADS 5 invasive ductal carcinoma lesion in the right breast, shown in two peak-enhancement slices of the same sequence (a, c), along with the corresponding cross-saliency maps (b, d) and the ground-truth contour (red). In (b), the detection algorithm identified the region of the lesion, and correctly classified it (green), while rejecting false detections (yellow). The same lesion at a consecutive slice was missed by the saliency analysis (d).

600

G. Amit et al.

0.86 ± 0.02, with area under the receiver operator characteristics curve (AUC) of 0.94 ± 0.01. Table 1 shows that training with the 5-channel image representation achieved the highest accuracy. The evaluation of the entire detection framework on the test set slices yielded a sensitivity of 0.85 with 0.7 false-positives per slice. The CNN was able to reject 89% of the false candidate regions detected by the saliency analysis. Comparing the detection heatmaps per study with the projected ground-truth showed an improved sensitivity of 0.98 with an average of 7 false-positive detections per study (Table 2). Figure 3 shows a representative example of true-positive and false-negative detections.

4

Discussion

Despite the potential of computer-assisted algorithms to improve the reading process of breast imaging studies, and although existing systems have been shown to improve sensitivity, they have not significantly affected the diagnostic accuracy (AUC) or the interpretation times of both novice and experienced radiologists [19]. Recent advance‐ ments in deep learning technologies provide an opportunity to develop learning-based solutions that will effectively support the diagnostic process. However, as most deep learning research has focused on natural images, using deep networks as a ‘black-box’ solution for detection and classification problems in the medical imaging domain may not provide optimal results. Our suggested hybrid approach, which combines tailored computer vision algorithms of cross-saliency analysis with deep network classifiers, allows the incorporation of domain-specific insights about the data to improve perform‐ ance. In our framework, both the multi-channel representation of DCE images and the cross-saliency analysis of left and right breasts are examples of such insights. The great majority of published work on lesion detection in breast MRI uses proprietary datasets, typically small in size, which could not be used as a common benchmark for comparison. The reported results show a large variability in sensitivity and false positive rate, ranging from 1 to 26 false positives per study at a sensitivity range of 0.89 to 1.0 [1, 2, 10]. An objective performance comparison between methods requires publicly available large datasets with ground-truth annotations. In this work, we used MRI studies from the TCIA repository [12], enriched by additional proprietary studies. Additionally, sharing domain-specific deep learning models that have been trained on proprietary data may be an alternative mechanism for collaboration among medical imaging researchers. Our current work is missing a comparison to state-of-the-art detection methods such as faster R-CNN, which provides both region proposals and classification within the same convo‐ lutional network. The plans for our future work include such a comparison to test our hypothesis on the advantages of the hybrid approach. In conclusion, we propose a combination of multi-channel image representation, unsupervised candidate proposals by saliency analysis, and deep network classification to provide automatic lesion detection in breast MRI with high sensitivity and low falsepositive rate. When evaluated on a large set of studies, this method could facilitate the incorporation of cognitive technologies into the radiology workflow.

Hybrid Mass Detection in Breast MRI Combining Unsupervised Saliency Analysis

601

References 1. Vignati, A., Giannini, V., et al.: A fully automatic lesion detection method for DCE-MRI fatsuppressed breast images, 26 February 2009 2. Ertas, G., Doran, S., Leach, M.O.: Computerized detection of breast lesions in multi-centre and multi-instrument DCE-MR data using 3D principal component maps and template matching. Phys. Med. Biol. 56, 7795–7811 (2011) 3. McClymont, D., Mehnert, A., et al.: Fully automatic lesion segmentation in breast MRI using mean-shift and graph-cuts on a region adjacency graph. J. Magn. Reson. Imaging 39, 795– 804 (2014) 4. Gallego-Ortiz, C., Martel, A.L.: Improving the accuracy of computer-aided diagnosis for breast MR imaging by differentiating between mass and nonmass lesions. Radiology 278, 679–688 (2016) 5. Agner, S.C., Soman, S., et al.: Textural kinetics: a novel dynamic contrast-enhanced (DCE)MRI feature for breast lesion classification. J. Digit. Imaging 24, 446–463 (2011) 6. Ertaş, G., Gülçür, H.Ö., et al.: Breast MR segmentation and lesion detection with cellular neural networks and 3D template matching. Comput. Biol. Med. 38, 116–126 (2008) 7. Renz, D.M., Böttcher, J., et al.: Detection and classification of contrast-enhancing masses by a fully automatic computer-assisted diagnosis system for breast MRI. J. Magn. Reson. Imaging 35, 1077–1088 (2012) 8. Pang, Z., Zhu, D., Chen, D., Li, L., Shao, Y.: A computer-aided diagnosis system for dynamic contrast-enhanced MR images based on level set segmentation and ReliefF feature selection. Comput. Math. Methods Med. 2015(2015), Article ID 450531 (2015). doi: 10.1155/2015/450531 9. Amit, G., Ben-Ari, R., Hadad, O., Monovich, E., Granot, N., Hashoul, S.: Classification of breast MRI lesions using small-size training sets: comparison of deep learning approaches. In: Proceedings of SPIE Medical Imaging, vol. 10134 (2017) 10. Wu, H., Cristina Gallego-Ortiz, A.M.: Deep artificial neural network approach to automated lesion segmentation in breast DCE-MRI. In: Proceedings of the 3rd MICCAI Workshop on Breast Image Analysis, pp. 73–80 (2015) 11. Erihov, M., Alpert, S., Kisilev, P., Hashoul, S.: A cross saliency approach to asymmetry-based tumor detection (2015) 12. Clark, K., Vendt, B., et al.: The Cancer Imaging Archive (TCIA): maintaining and operating a public information repository. J. Digit. Imaging. 26, 1045–1057 (2013) 13. Lingle, W., Erickson, B.J., et al.: Radiology Data from The Cancer Genome Atlas Breast Invasive Carcinoma [TCGA-BRCA] collection. http://doi.org/10.7937/K9/TCIA. 2016.AB2NAZRP 14. Bloch, B.N., Jain, A., Jaffe, C.C.: Data from breast-diagnosis. http://doi.org/10.7937/K9/ TCIA.2015.SDNRQXXR 15. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). doi:10.1007/978-3-319-24574-4_28 16. Margolin, R., Tal, A., Zelnik-Manor, L.: What makes a patch distinct? In: 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1139–1146. IEEE (2013) 17. Barnes, C., Shechtman, E., Goldman, D.B., Finkelstein, A.: The generalized PatchMatch correspondence algorithm. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6313, pp. 29–43. Springer, Heidelberg (2010). doi:10.1007/978-3-642-15558-1_3

602

G. Amit et al.

18. Hadad, O., Bakalo, R., Ben-Ar, R., Hashoul, S., Amit, G.: Classification of breast lesions using cross-modal deep learning. In: IEEE International Symposium on Biomedical Imaging (ISBI) (2017) 19. Lehman, C.D., Blume, J.D., et al.: Accuracy and interpretation time of computer-aided detection among novice and experienced breast MRI readers. Am. J. Roentgenol. 200, W683– W689 (2013)

Deep Multi-instance Networks with Sparse Label Assignment for Whole Mammogram Classification Wentao Zhu(B) , Qi Lou, Yeeleng Scott Vang, and Xiaohui Xie Department of Computer Science, University of California, Irvine, Irvine, USA {wentaoz1,xhx}@ics.uci.edu, {qlou,ysvang}@uci.edu

Abstract. Mammogram classification is directly related to computeraided diagnosis of breast cancer. Traditional methods rely on regions of interest (ROIs) which require great efforts to annotate. Inspired by the success of using deep convolutional features for natural image analysis and multi-instance learning (MIL) for labeling a set of instances/patches, we propose end-to-end trained deep multi-instance networks for mass classification based on whole mammogram without the aforementioned ROIs. We explore three different schemes to construct deep multi-instance networks for whole mammogram classification. Experimental results on the INbreast dataset demonstrate the robustness of proposed networks compared to previous work using segmentation and detection annotations. (Code: https://github.com/wentaozhu/ deep-mil-for-whole-mammogram-classification.git). Keywords: Deep multi-instance learning · Whole mammogram classification · Max pooling-based MIL · Label assignment-based MIL · Sparse MIL

1

Introduction

According to the American Cancer Society, breast cancer is the most frequently diagnosed solid cancer and the second leading cause of cancer death among U.S. women [1]. Mammogram screening has been demonstrated to be an effective way for early detection and diagnosis, which can significantly decrease breast cancer mortality [15]. Traditional mammogram classification requires extra annotations such as bounding box for detection or mask ground truth for segmentation [5,11,17]. Other work have employed different deep networks to detect ROIs and obtain mass boundaries in different stages [6]. However, these methods require hand-crafted features to complement the system [12], and training data to be annotated with bounding boxes and segmentation ground truths which require expert domain knowledge and costly effort to obtain. In addition, multi-stage training cannot fully explore the power of deep networks. Due to the high cost of annotation, we intend to perform classification based on a raw whole mammogram. Each patch of a mammogram can be treated as c Springer International Publishing AG 2017  M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 603–611, 2017. DOI: 10.1007/978-3-319-66179-7 69

604

W. Zhu et al.

an instance and a whole mammogram is treated as a bag of instances. The whole mammogram classification problem can then be thought of as a standard MIL problem. Due to the great representation power of deep features [9,19–21], combining MIL with deep neural networks is an emerging topic. Yan et al. used a deep MIL to find discriminative patches for body part recognition [18]. Patch based CNN added a new layer after the last layer of deep MIL to learn the fusion model for multi-instance predictions [10]. Shen et al. employed two stage training to learn the deep multi-instance networks for pre-detected lung nodule classification [16]. The above approaches used max pooling to model the general multi-instance assumption which only considers the patch of max probability. In this paper, more effective task-related deep multi-instance models with end-toend training are explored for whole mammogram classification. We investigate three different schemes, i.e., max pooling, label assignment, and sparsity, to perform deep MIL for the whole mammogram classification task. The framework for our proposed end-to-end trained deep MIL for mammogram classification is shown in Fig. 1. To fully explore the power of deep MIL, we convert the traditional MIL assumption into a label assignment problem. As a mass typically composes only 2% of a whole mammogram (see Fig. 2), we further propose sparse deep MIL. The proposed deep multi-instance networks are shown to provide robust performance for whole mammogram classification on the INbreast dataset [14].

Fig. 1. The framework of whole mammogram classification. First, we use Otsu’s segmentation to remove the background and resize the mammogram to 227 × 227. Second, the deep MIL accepts the resized mammogram as input to the convolutional layers. Here we use the convolutional layers in AlexNet [13]. Third, logistic regression with weight sharing over different patches is employed for the malignant probability of each position from the convolutional neural network (CNN) feature maps of high channel dimensions. Then the responses of the instances/patches are ranked. Lastly, the learning loss is calculated using max pooling loss, label assignment, or sparsity loss for the three different schemes.

Deep Multi-instance Networks with Sparse Label Assignment

(a)

(b)

(c)

605

(d)

Fig. 2. Histograms of mass width (a) and height (b), mammogram width (c) and height (d). Compared to the size of whole mammogram (1, 474 × 3, 086 on average after cropping), the mass of average size (329 × 325) is tiny, and takes about 2% of a whole mammogram.

2

Deep MIL for Whole Mammogram Mass Classification

Unlike other deep multi-instance networks [10,18], we use a CNN to efficiently obtain features of all patches (instances) at the same time. Given an image I, we obtain a much smaller feature map F of multi-channels Nc after multiple convolutional layers and max pooling layers. The (F )i,j,: represents deep CNN features for a patch Qi,j in I, where i, j represents the pixel row and column index respectively, and : denotes the channel dimension. The goal of our work is to predict whether a whole mammogram contains a malignant mass (BI-RADS ∈ {4, 5, 6} as positive) or not, which is a standard binary classification problem. We add a logistic regression with weights shared across all the pixel positions following F , and an element-wise sigmoid activation function is applied to the output. To clarify it, the malignant probability of feature space’s pixel (i, j) is ri,j = sigmoid(a · Fi,j,: + b),

(1)

where a is the weights in logistic regression, b is the bias, and · is the inner product of the two vectors a and Fi,j,: . The a and b are shared for different pixel positions i, j. We can combine ri,j into a matrix r = (ri,j ) of range [0, 1] denoting the probabilities of patches being malignant masses. The r can be flattened into a one-dimensional vector as r = (r1 , r2 , ..., rm ) corresponding to flattened patches (Q1 , Q2 , ..., Qm ), where m is the number of patches. 2.1

Max Pooling-Based Multi-instance Learning

The general multi-instance assumption is that if there exists an instance that is positive, the bag is positive [7]. The bag is negative if and only if all instances are negative. For whole mammogram classification, the equivalent scenario is that if there exists a malignant mass, the mammogram I should be classified as positive. Likewise, negative mammogram I should not have any malignant masses. If we treat each patch Qi of I as an instance, the whole mammogram classification is a standard multi-instance task.

606

W. Zhu et al.

For negative mammograms, we expect all the ri to be close to 0. For positive mammograms, at least one ri should be close to 1. Thus, it is natural to use the maximum component of r as the malignant probability of the mammogram I p(y = 1|I, θ) = max{r1 , r2 , ..., rm },

(2)

where θ is the weights in deep networks. If we sort r first in descending order as illustrated in Fig. 1, the malignant probability of the whole mammogram I is the first element of ranked r as {r 1 , r 2 , ..., r m } = sort({r1 , r2 , ..., rm }), p(y = 1|I, θ) = r 1 , and p(y = 0|I, θ) = 1 − r 1 ,

(3)

where r  = (r 1 , r 2 , ..., r m ) is descending ranked r. The cross entropy-based cost function can be defined as Lmaxpooling = −

N 1  λ log(p(yn |In , θ)) + θ2 N n=1 2

(4)

where N is the total number of mammograms, yn ∈ {0, 1} is the true label of malignancy for mammogram In , and λ is the regularizer that controls model complexity. One disadvantage of max pooling-based MIL is that it only considers the patch Q 1 (patch of the max malignant probability), and does not exploit information from other patches. A more powerful framework should add task-related priori, such as sparsity of mass in whole mammogram, into the general multiinstance assumption and explore more patches for training. 2.2

Label Assignment-Based Multi-instance Learning

For the conventional classification tasks, we assign a label to each data point. In the MIL scheme, if we consider each instance (patch) Qi as a data point for classification, we can convert the multi-instance learning problem into a label assignment problem. After we rank the malignant probabilities r = (r1 , r2 , ..., rm ) for all the instances (patches) in a whole mammogram I using the first equation in Eq. 3, the first few r i should be consistent with the label of whole mammogram as previously mentioned, while the remaining patches (instances) should be negative. Instead of adopting the general MIL assumption that only considers the Q 1 (patch of malignant probability r 1 ), we assume that (1) patches of the first k largest malignant probabilities {r 1 , r 2 , ..., r k } should be assigned with the same class label as that of whole mammogram, and (2) the rest patches should be labeled as negative in the label assignment-based MIL. After the ranking/sorting layer using the first equation in Eq. 3, we can obtain the malignant probability for each patch p(y = 1|Q i , θ) = r i ,

and

p(y = 0|Q i , θ) = 1 − r i .

(5)

Deep Multi-instance Networks with Sparse Label Assignment

607

The cross entropy loss function of the label assignment-based MIL can be defined N  k 1   Llabelassign. = − log(p(yn |Q j , θ))+ mN n=1 j=1 (6)  m  λ  2 log(p(y = 0|Q j , θ)) + θ . 2 j=k+1

One advantage of the label assignment-based MIL is that it explores all the patches to train the model. Essentially it acts a kind of data augmentation which is an effective technique to train deep networks when the training data is scarce. From the sparsity perspective, the optimization problem of label assignmentbased MIL is exactly a k-sparse problem for the positive data points, where we expect {r 1 , r 2 , ..., r k } being 1 and {r k+1 , r k+2 , ..., r m } being 0. The disadvantage of label assignment-based MIL is that it is hard to estimate the hyperparameter k. Thus, a relaxed assumption for the MIL or an adaptive way to estimate the hyper-parameter k is preferred. 2.3

Sparse Multi-instance Learning

From the mass distribution, the mass typically comprises about 2% of the whole mammogram on average (Fig. 2), which means the mass region is quite sparse in the whole mammogram. It is straightforward to convert the mass sparsity to the malignant mass sparsity, which implies that {r 1 , r 2 , ..., r m } is sparse in the whole mammogram classification problem. The sparsity constraint means we expect the malignant probability of part patches r i being 0 or close to 0, which is equivalent to the second assumption in the label assignment-based MIL. Analogously, we expect r 1 to be indicative of the true label of mammogram I. After the above discussion, the loss function of sparse MIL problem can be defined Lsparse =

N  λ 1  − log(p(yn |In , θ)) + μrn 1 + θ2 , N n=1 2

(7)

where p(yn |In , θ) can be calculated in Eq. 3, rn = (r 1 , r 2 , ..., r m ) for mammogram In ,  · 1 denotes the L1 norm, μ is the sparsity factor, which is a trade-off between the sparsity assumption and the importance of patch Q 1 . From the discussion of label assignment-based MIL, this learning is a kind of exact k-sparse problem which can be converted to L1 constrain. One advantage of sparse MIL over label assignment-based MIL is that it does not require assign label for each patch which is hard to do for patches where probabilities are not too large or small. The sparse MIL considers the overall statistical property of r. Another advantage of sparse MIL is that, it has different weights for general MIL assumption (the first part loss) and label distribution within mammogram (the second part loss), which can be considered as a trade-off between max pooling-based MIL (slack assumption) and label assignment-based MIL (hard assumption).

608

3

W. Zhu et al.

Experiments

We validate the proposed models on the most frequently used mammographic mass classification dataset, INbreast dataset [14], as the mammograms in other datasets, such as DDSM dataset [4], are of low quality. The INbreast dataset contains 410 mammograms of which 100 containing malignant masses. These 100 mammograms with malignant masses are defined as positive. For fair comparison, we also use 5-fold cross validation to evaluate model performance as [6]. For each testing fold, we use three folds for training, and one fold for validation to tune hyper-parameters. The performance is reported as the average of five testing results obtained from cross-validation. We employ techniques to augment our data. For each training epoch, we randomly flip the mammograms horizontally, shift within 0.1 proportion of mammograms horizontally and vertically, rotate within 45 degree, and set 50 × 50 square box as 0. In experiments, the data augmentation is essential for us to train the deep networks. For the CNN network structure, we use AlexNet and remove the fully connected layers [13]. Through CNN, the mammogram of size 227 × 227 becomes 256 6 × 6 feature maps. Then we use steps in Sect. 2 to do MIL. Here we employ weights pretrained on the ImageNet due to the scarce of data. We use Adam optimization with learning rate 5 × 10−5 for training models [2]. The λ for max pooling-based and label assignment-based MIL are 1 × 10−5 . The λ and μ for sparse MIL are 5×10−6 and 1×10−5 respectively. For the label assignment-based MIL, we select k from {1, 2, 4, 6, 8} based on the validation set. We firstly compare our methods to previous models validated on DDSM dataset and INbreast dataset in Table 1. Previous hand-crafted feature-based methods require manually annotated detection bounding box or segmentation ground truth even in test denoting as manual [3,8,17]. The feat. denotes requiring hand-crafted features. Pretrained CNN uses two CNNs to detect the mass region and segment the mass, followed by a third CNN to do mass classification on the detected ROI region, which requires hand-crafted features to pretrain the Table 1. Accuracy comparisons of the proposed deep MILs and related methods on test sets. Methodology

Dataset Set-up

Ball et al. [3]

DDSM

Manual+feat 0.87

Accu

N/A

AUC

Varela et al. [17]

DDSM

Manual+feat 0.81

N/A

Domingues et al. [8]

INbr

Manual+feat 0.89

N/A

Pretrained CNN [6]

INbr

Auto.+feat

0.84 ±0.04

0.69 ±0.10

Pretrained CNN+Random Forest [6] INbr

Auto.+feat

0.91 ± 0.02 0.76 ±0.23

AlexNet

INbr

Auto

0.81 ±0.02

0.79 ±0.03

AlexNet+Max Pooling MIL

INbr

Auto

0.85 ±0.03

0.83 ±0.05

AlexNet+Label Assign. MIL

INbr

Auto

0.86 ±0.02

0.84 ±0.04

AlexNet+Sparse MIL

INbr

Auto

0.90 ±0.02

0.89 ± 0.04

Deep Multi-instance Networks with Sparse Label Assignment

609

network and needs multi-stages training [6]. Pretrained CNN+Random Forest further employs random forest and obtained 7% improvement. These methods are either manually or need hand-crafted features or multi-stages training, while our methods are totally automated, do not require hand-crafted features or extra annotations even on training set, and can be trained in an end-to-end manner. The max pooling-based deep MIL obtains better performance than the pretrained CNN using 3 different CNNs and detection/segmentation annotation in the training set. This shows the superiority of our end-to-end trained deep MIL for whole mammogram classification. According to the accuracy metric, the sparse deep MIL is better than the label assignment-based MIL, which is better than the max pooling-based MIL. This result is consistent with previous discussion that the sparsity assumption benefited from not having hard constraints of the label assignment assumption, which employs all the patches and is more efficient than max pooling assumption. Our sparse deep MIL achieves competitive accuracy to random forest-based pretrained CNN, while much higher AUC than

(a)

(b)

(c)

(d)

Fig. 3. The visualization of predicted malignant probabilities for instances/patches in four resized mammograms. The first row is the resized mammogram. The red rectangle boxes are mass regions from the annotations on the dataset. The color images from the second row to the last row are the predicted malignant probability from logistic regression layer for (a) to (d) respectively, which are the malignant probabilities of patches/instances. Max pooling-based, label assignment-based, sparse deep MIL are in the second row, third row, fourth row respectively.

610

W. Zhu et al.

previous work, which shows our method is more robust. The main reasons for the robust results using our models are as follows. Firstly, data augmentation is an important technique to increase scarce training datasets and proves useful here. Secondly, the transfer learning that employs the pretrained weights from ImageNet is effective for the INBreast dataset. Thirdly, our models fully explore all the patches to train our deep networks thereby eliminating any possibility of overlooking malignant patches by only considering a subset of patches. This is a distinct advantage over previous networks that employ several stages consisting of detection and segmentation. To further understand our deep MIL, we visualize the responses of logistic regression layer for four mammograms on test set, which represents the malignant probability of each patch, in Fig. 3. We can see the deep MIL learns not only the prediction of whole mammogram, but also the prediction of malignant patches within the whole mammogram. Our models are able to learn the mass region of the whole mammogram without any explicit bounding box or segmentation ground truth annotation of training data. The max pooling-based deep multi-instance network misses some malignant patches in (a), (c) and (d). The possible reason is that it only considers the patch of max malignant probability in training and the model is not well learned for all patches. The label assignmentbased deep MIL mis-classifies some patches in (d). The possible reason is that the model sets a constant k for all the mammograms, which causes some misclassifications for small masses. One of the potential applications of our work is that these deep MIL networks could be used to do weak mass annotation automatically, which provides evidence for the diagnosis.

4

Conclusion

In this paper, we propose end-to-end trained deep MIL for whole mammogram classification. Different from previous work using segmentation or detection annotations, we conduct mass classification based on whole mammogram directly. We convert the general MIL assumption to label assignment problem after ranking. Due to the sparsity of masses, sparse MIL is used for whole mammogram classification. Experimental results demonstrate more robust performance than previous work even without detection or segmentation annotation in the training. In future work, we plan to extend the current work by: (1) incorporating multi-scale modeling such as spatial pyramid to further improve whole mammogram classification, (2) employing the deep MIL to do annotation or provide potential malignant patches to assist diagnoses, and (3) applying to large datasets and expected to have improvement if the big dataset is available.

References 1. American cancer society. what are the key statistics about breast cancer? 2. Ba, J., Kingma, D.: Adam: a method for stochastic optimization. In: ICLR (2015)

Deep Multi-instance Networks with Sparse Label Assignment

611

3. Ball, J.E., Bruce, L.M.: Digital mammographic computer aided diagnosis (cad) using adaptive level set segmentation. In: EMBS (2007) 4. Bowyer, K., Kopans, D., Kegelmeyer, W., et al.: The digital database for screening mammography. In: IWDM (1996) 5. Carneiro, G., Nascimento, J., Bradley, A.P.: Unregistered multiview mammogram analysis with pre-trained deep learning models. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 652–660. Springer, Cham (2015). doi:10.1007/978-3-319-24574-4 78 6. Dhungel, N., Carneiro, G., Bradley, A.P.: The automated learning of deep features for breast mass classification from mammograms. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 106–114. Springer, Cham (2016). doi:10.1007/978-3-319-46723-8 13 7. Dietterich, T.G., Lathrop, R.H., Lozano-P´erez, T.: Solving the multiple instance problem with axis-parallel rectangles. Artif. Intell. 89(1), 31–71 (1997) 8. Domingues, I., Sales, E., Cardoso, J., Pereira, W.: Inbreast-database masses characterization. In: XXIII CBEB (2012) 9. Greenspan, H., van Ginneken, B., Summers, R.M.: Guest editorial deep learning in medical imaging: overview and future promise of an exciting new technique. IEEE TMI 35(5), 1153–1159 (2016) 10. Hou, L., Samaras, D., Kurc, T.M., et al.: Patch-based convolutional neural network for whole slide tissue image classification arXiv:1504.07947 (2015) 11. Jiao, Z., Gao, X., Wang, Y., Li, J.: A deep feature based framework for breast masses classification. Neurocomputing 197, 221–231 (2016) 12. Kooi, T., Litjens, G., van Ginneken, B., et al.: Large scale deep learning for computer aided detection of mammographic lesions. Med. Image Anal. 35, 303–312 (2017) 13. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS, pp. 1097–1105 (2012) 14. Moreira, I.C., Amaral, I., Domingues, I., et al.: Inbreast: toward a full-field digital mammographic database. Academic radiology (2012) 15. Oeffinger, K.C., Fontham, E.T., Etzioni, R., et al.: Breast cancer screening for women at average risk: 2015 guideline update from the American cancer society. Jama (2015) 16. Shen, W., Zhou, M., Yang, F., Dong, D., Yang, C., Zang, Y., Tian, J.: Learning from experts: developing transferable deep features for patient-level lung cancer prediction. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 124–131. Springer, Cham (2016). doi:10. 1007/978-3-319-46723-8 15 17. Varela, C., Timp, S., Karssemeijer, N.: Use of border information in the classification of mammographic masses. Phys. Med. Biol. 51(2), 425 (2006) 18. Yan, Z., Zhan, Y., Peng, Z., et al.: Multi-instance deep learning: discover discriminative local anatomies for bodypart recognition. IEEE Trans. Med. Imaging 35(5), 1332–1343 (2016) 19. Zhu, W., Lan, C., Xing, J., et al.: Co-occurrence feature learning for skeleton based action recognition using regularized deep lstm networks. In: AAAI (2016) 20. Zhu, W., Miao, J., Qing, L., Huang, G.B.: Hierarchical extreme learning machine for unsupervised representation learning. In: IJCNN, pp. 1–8. IEEE (2015) 21. Zhu, W., Xie, X.: Adversarial deep structural networks for mammographic mass segmentation arXiv:1612.05970 (2016)

Segmentation-Free Kidney Localization and Volume Estimation Using Aggregated Orthogonal Decision CNNs Mohammad Arafat Hussain1(B) , Alborz Amir-Khalili1 , Ghassan Hamarneh2 , and Rafeef Abugharbieh1 1

2

BiSICL, University of British Columbia, Vancouver, BC, Canada {arafat,alborza,rafeef}@ece.ubc.ca Medical Image Analysis Lab, Simon Fraser University, Burnaby, BC, Canada [email protected]

Abstract. Kidney volume is an important bio-marker in the clinical diagnosis of various renal diseases. For example, it plays an essential role in follow-up evaluation of kidney transplants. Most existing methods for volume estimation rely on kidney segmentation as a prerequisite step, which has various limitations such as initialization-sensitivity and computationally-expensive optimization. In this paper, we propose a hybrid localization-volume estimation deep learning approach capable of (i) localizing kidneys in abdominal CT images, and (ii) estimating renal volume without requiring segmentation. Our approach involves multiple levels of self-learning of image representation using convolutional neural layers, which we show better capture the rich and complex variability in kidney data, demonstrably outperforming hand-crafted feature representations. We validate our method on clinical data of 100 patients with a total of 200 kidney samples (left and right). Our results demonstrate a 55% increase in kidney boundary localization accuracy, and a 30% increase in volume estimation accuracy compared to recent state-ofthe-art methods deploying regression-forest-based learning for the same tasks.

1

Introduction

Chronic kidney disease (CKD) refers to the reduced or absent functionality of kidneys for more than 3 months, which has been identified as a major risk factor for death worldwide [1]. It is reported that about 3 million adults in Canada [1] and about 26 million adults in the United States [2] live with CKD. Detection of CKD is difficult. Biochemical tests like the ‘estimated glomerular filtration rate’ and ‘serum albumin-to-creatinine ratio’ have been shown to be unreliable in detecting disease and tracking its progression [3]. However, CKD is most often associated with an abnormal change in kidney volume and thus, quantitative ‘kidney volume’ has emerged as a potential surrogate marker for renal function and has become useful for predicting and tracking the progression of CKD [4]. c Springer International Publishing AG 2017  M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 612–620, 2017. DOI: 10.1007/978-3-319-66179-7 70

Segmentation-Free Kidney Localization and Volume Estimation

613

In addition, kidney volume has been used in evaluating the split renal function in kidney donors as well as in follow-up evaluation of kidney transplants [4]. Kidney volume measurement from 3D CT volumes in clinical environments typically requires a two step procedure: (i) localizing kidneys, and then (ii) estimating volumes of the localized kidneys. For years, kidneys were typically localized manually in clinical settings. To automate this process, Criminisi et al. [5,6] proposed regression-forest (RF)-based anatomy localization methods that predict the boundary wall locations of a tight region-of-interest (ROI) encompassing a particular organ. However, their reported boundary localization error for typical healthy kidneys (size ∼13 × 7.5 × 2.5 cm3 ) was ∼16 mm, which could significantly affect the subsequent volume estimation, depending on the volume estimation process. Cuingnet et al. [7] fine tuned the method in [5] by using an additional RF, which improved the kidney localization accuracy by ∼60%. However, the authors mentioned (in [7]) that their method was designed and validated with only non-pathological kidneys. Recently, Lu et al. [8] proposed a right-kidney localization method using a cross-sectional fusion of convolutional neural networks (CNN) and fully convolutional networks (FCN) and reported a kidney centroid localization error of ∼8 mm. However, the robustness of this right-kidney-based deep learning model in localizing both kidneys is yet to be tested as the surrounding anatomy, and often locations, shapes and sizes of left and right kidneys are completely different and non-symmetric. Subsequent to localization, kidney volumes are typically estimated using different segmentation methods, a strategy that has various limitations. For example, graph cuts and active contours/level sets-based methods are sensitive to the choice of parameters, prone to leaking through weak anatomical boundaries, and require considerable computations [4]. Recently, Cuingnet et al. [7] used a combination of RF and template deformation to segment kidneys, while Yang et al. [9] used multi-atlas image registration; these methods rely extensively on prior knowledge of kidney shapes. However, building a realistic model of kidney shape variability and deciding the balance between trusting the model vs. the data are non-trivial tasks. In addition, these prior-shape-based methods are likely to fail for pathological cases, e.g., presence of large exophytic tumors. To overcome these segmentation-based limitations and associated computational overhead, Hussain et al. [4] recently proposed a segmentation-free kidney volume estimation approach using a dual RF, which bypassed the segmentation step altogether. Although promising, their approach relied on manual localization of kidneys in abdominal CT. In addition, they used hand-engineered features that may be difficult to optimally design. In this paper, we propose a hybrid deep learning approach that simultaneously addresses the combined challenges of (i) automatic kidney localization and (ii) segmentation-free volume estimation. Our method uses an effective deep CNN-based approach for tight kidney ROI localization, where orthogonal 2D slice-based probabilities of containing kidney cross-sections are aggregated into a voxel-based decision that ultimately predicts whether an interrogated voxel sits inside or outside of a kidney ROI. In addition, the second novel module in

614

M.A. Hussain et al.

our hybrid localization-segmentation approach is the direct estimation of kidney volume using a deep CNN that skips the segmentation procedure. To the best of our knowledge, our segmentation-free method is the first that uses deep CNN for kidney volume estimation. We estimate slice-based cross-sectional kidney areas followed by integration over these values across axial kidney span to produce the volume estimate. Our hybrid approach involves multi-level learning of image representations using convolutional layers, which demonstrably better capture the complex variability in the kidney data outperforming the hand-crafted feature representations used in [4,6,7].

2 2.1

Materials and Methods Data

Our clinical dataset consisted of 100 patient scans accessed from Vancouver General Hospital (VGH) records with all the required ethics approvals in place. Of the 100 scans, 45 had involved the use of contrast agents. We accumulated a total of 200 kidney samples (both left and right kidneys) among which 140 samples (from 70 randomly chosen patients) were used for training and the rest for testing. Our dataset included 12 pathological kidney samples, and our training and test data contained 6 cases, each. The in-plane pixel size ranged from 0.58 to 0.98 mm and the slice thickness ranged from 1.5 to 3 mm. Ground truth kidney volumes were calculated from kidney delineations performed by expert radiologists. Both CNNs were implemented using Caffe [10] and were trained on a workstation with 2 Intel 3 GHz Xeon processors and an Nvidia GeForce GTX 460 GPU with 1 GB of VRAM. The base learning rate for both CNN training was set to 0.01 and was decreased by a factor of 0.1 to 0.0001 over 25000 iterations. For basic pre-processing of the data, we programmed an automatic routine for ‘abdominal’ CT that separates left and right kidneys. Since left and right kidneys always fall in the separate half volumes, the routine simply divided the abdominal CT volume medially along the left-right direction. Relying on the DICOM attributes (i.e., slice thickness and total number of axial slices), the routine also discarded few slices in the pelvic region from an image (where applicable). However, this step was optional and only carried out on slices beyond ∼52 cm (4x the typical kidney length) from the chest side of the image. Finally, our data pre-processing routine re-sized the medially separated CT volumes to generate fixed resolution cubic volumes (e.g., 256 × 256 × 256 voxel in our case) using either interpolation or decimation, as needed. Augmentation of training samples was also done by flipping and rotating the 2D axial slices. 2.2

Kidney Localization

We use a deep CNN to predict the locations of six walls of the tight ROI boundary around a kidney by aggregating individual probabilities associated with three intersecting orthogonal (axial, coronal and sagittal) image slices (Fig. 1(b)). The

Segmentation-Free Kidney Localization and Volume Estimation

615

Fig. 1. Example kidney data from our patient pool demonstrating data variability present (ranging from normal to pathological), and our hybrid kidney localizationvolume estimation approach. (a) Some CT snapshots showing variations in normal and pathological kidney shape and size, (b) orthogonal decision aggregated CNN for kidney localization, and (c) segmentation-free kidney volume estimation using deep CNN.

CNN has eleven layers excluding the input. It has five convolutional layers, three fully connected layers, two softmax layers and one additive layer. All but the last three layers contain trainable weights. The input is a 256 × 256 pixel image slice, either from the axial, coronal or sagittal directions, sampled from the initially generated local kidney-containing volumes. We train this single CNN (from layer 1 to layer 8) using a dataset containing a mix of equal numbers of 3D orthogonal image slices. Convolutional layers are typically used for sequentially learning the high-level non-linear spatial image features (e.g., object edges, intensity variations, orientations of objects etc.). Subsequent fully connected layers prepare

616

M.A. Hussain et al.

these features for optimal classification of the object (e.g., kidney cross-section) present in the image. In our case, five convolutional layers followed by three fully connected layers make reasonable decision on orthogonal image slices if they include kidney cross-sections or not. During testing, three different orthogonal image slices are parallel-fed to this CNN and the probabilities for each slice of being a kidney slice or not are acquired at the softmax layer, S1 (Fig. 1(b)). These individual probabilities from the three orthogonal image slices are added class-wise (i.e., containing [class: 1] or not [class: 0]) in the  kidney cross-section and P in Fig. 1(b). We then use a second softadditive layer, shown as P 1 0   max layer S2 with P1 and P0 as inputs. This layer decides whether the voxel where the three orthogonal input slices intersect is inside or outside the tight kidney ROI. The second softmax layer S2 is included to nullify any potential miss-classification by the first softmax layer S1. The voxels having probability (∈[0,1]) >0.5 at S2 are considered to be inside the kidney ROI. Finally, the locations of six boundary walls (two in each orthogonal direction) of this ROI are recorded from the maximum span of the distribution of these voxels (with probability >0.5) along three orthogonal directions. 2.3

Segmentation-Free Volume Estimation

In Sect. 2.2, we estimated the kidney encompassing tight ROI. Typically, kidney shape and appearance vary across patients (Fig. 1(a)). Training our CNN requires 2D image patches of consistent size. In addition, the patch size needs to be universal so that kidney cross-sections are always contained in it regardless of its size and shape. Therefore, to generate training data, we choose a patch size of 120 × 120 pixel, making sure that the cross-section of the initially estimated kidney ROI is at the centre of it. We also ensure that there is enough free space around a kidney cross-section as well as at the top and bottom in the axial direction. The ratio between the number of pixels fall inside a kidney cross-section to the total number of pixels in the image patch (120 × 120 pixel) is considered as the output variable (label) for that particular image patch. We estimate the cross-sectional area of a kidney in each slice using a deep CNN shown in Fig. 1(c). The CNN performs regression and has seven layers excluding the input. It has four convolutional layers, three fully connected layers, and one Euclidean loss layer. We also use dropout layers along with the first two fully connected layers in order to avoid over-fitting. As mentioned earlier, the input is a 120×120 pixel image patch and the output is the ratio of kidney pixels to the total image size. The CNN is trained by minimizing the Euclidean loss between the desired and predicted values. Once the CNN model is trained, we deploy the model to predict the kidney area in a particular image patch. Finally, the volume of a particular kidney is estimated by integrating the predicted areas in all of its image patches in the axial direction.

Segmentation-Free Kidney Localization and Volume Estimation

3

617

Results

We provide results of our proposed kidney localization and volume estimation stages separately to enable direct comparisons with those obtained by recently reported kidney localization [5–8] and volume estimation methods [4,7,9,11,12]. Since the recently reported methods we use for comparison are mostly either RF-based or deep CNN-based, reproduce their results are impossible without access to the code and data on which they were trained. However, since the type of data these methods were validated is similar to ours in terms of resolution and imaging modality, we conservatively use their reported accuracy values for comparison, rather than using our own implementation of their models. Table 1. Comparison of mean kidney ROI boundary localization error (mm) and mean kidney ROI centroid localization error (mm) in terms of Euclidean distance. Not reported values are shown with (-). Methods

Boundary error (mm)

Centroid error (mm)

Left

Right

Left

Right

Cascaded RF (MICCAI’12) [7]

7.00 ± 10.0

7.00 ± 6.00

11.0 ± 18.0

10.0 ± 12.0

RF1 (MCV’10) [5]

17.3 ± 16.5

18.5 ± 18.0

-

-

RF2 (MedIA’13) [6]

13.6 ± 12.5

16.1 ± 15.5

-

-

-

-

7.80 ± 9.40

CS-(CNN+FCN) (LABELS’16) [8] Proposed Method

6.19 ± 6.02 5.86 ± 6.40 7.71 ± 4.91 7.56 ± 4.10

In Table 1, we present kidney boundary and centroid localization performance comparisons of cascaded RF-based [7], single RF-based (RF1 [5] and RF2 [6]), cross-sectional (CS) fusion of CNN and FCN-based [8], and our proposed method. The cascaded RF method used the RF1 [5] for coarse localization of both left and right kidneys, then finetuned these locations using an additional RF per left/right kidney. Even then its centroid localization errors and boundary localization errors were higher than those of the proposed method. The RF2 [6] was an incremental work over the RF1 [5], and both use regressionforests for different anatomy localization. Both methods exhibited higher boundary localization errors than those of the cascaded RF and proposed methods, and did not report any centroid localization accuracy. The recently proposed CS-(CNN+FCN) [8] method reported significantly better kidney centroid localization performance with respect to the cascaded RF [7]. However, this method was only validated on right kidneys, and did not report the kidney boundary localization accuracy. As evident from the quantitative results, compared to all these recent methods, the proposed method demonstrates better performance in both kidney boundary and centroid localization by producing the lowest localization errors in all categories. Table 2 shows quantitative comparative results of our direct volume estimation module (including the localization step) with those obtained by a

618

M.A. Hussain et al.

Table 2. Volume estimation accuracies compared to state-of-the-art competing methods. Not reported values are shown with (-). Method types Methods

Kidney samples

Mean dice index

Manual

Ellipsoid Fit (Urology’14) [11]

14.20 ± 13.56

-

Seg

RF+Template (MICCAI’12) [7] 358

-

0.752 ± 0.222

Atlas-based (EMBS’14) [9]

22

-

0.952 ± 0.018

Single RF (MICCAI’14) [12]

44

36.14 ± 20.86

-

Dual RF (MLMI’16) [4]

44

9.97 ± 8.69

-

Seg-free

a

44

Mean volume error (%)

a

Proposed Method 60 7.01 ± 8.63 Estimated from reported Dice quartile values (in [7]) using the method in [13].

manual ellipsoid fitting method, two segmentation-based methods, and two segmentation-free regression-forest-based methods. We use the mean volume errors by [11,12] reported in [4] for comparison. For the manual approach [11], we see in Table 2 that the estimated mean volume error for this approach is approximately 14% with high standard deviation. Then we consider two segmentationbased methods [7,9]. These methods reported their volume estimation accuracy in terms of Dice similarity coefficient (DSC), which does not relate linearly to the percentage of volume error. Since segmentation-free methods do not perform any voxel classification, DSC cannot be calculated for these methods. Therefore, it is difficult to directly compare DSC performance to percentage of volume estimation error. However, [7] used 2 RFs, an ellipsoid fitting and subsequent template deformation for kidney segmentation. Even then, authors in [7] admitted that this method did not correctly detect/segment about 20% of left and 20% of right kidneys (DSC 3 and >0.5 respectively for the human readers and the machine classifier. CT based diagnosis of the two human readers including on the validation set were found to be 0.69 and 0.73 respectively. Interestingly CIpris outperformed both human readers.

654

5

M. Alilou et al.

Conclusion

In this work, we presented Ipris, a novel radiomic method, to automatically distinguish between benign and malignant nodules on routine lung CT scans. Ipris attempts to capture the transitional heterogeneity from the intra- to the perinodular space and exploits the fact that the transitional patterns may be substantially different between benign and malignant nodules on CT scans. On an independent validation set, Ipris was compared against well established radiomic features and interpretations of two human readers. Ipris yielded a better performance compared to established radiomic features in terms of both classification AUC computational efficiency. Significantly, Ipris also was found to perform substantially better compared to two human expert readers, a pulmonologist and a thoracic radiologist with 7 years of experience reading chest CT scans. Additionally, even though this was not explicitly evaluated, Ipris appears to be robust to the slice thickness of the CT scans, since the datasets considered in this work involved 1–5 mm thickness. Future work will involve integrating Ipris with established radiomic features to further improve classification performance. Acknowledgments. Research reported in this publication was supported by the National Cancer Institute of the National Institutes of Health under award numbers 1U24CA199374-01, R01CA202752-01A1 R01CA208236-01A1 R21CA179327-01; R21CA195152-01 the National Institute of Diabetes and Digestive and Kidney Diseases under award number R01DK098503-02, National Center for Research Resources under award number 1 C06 RR12463-01 the DOD Prostate Cancer Synergistic Idea Development Award (PC120857); the DOD Lung Cancer Idea Development New Investigator Award (LC130463), the DOD Prostate Cancer Idea Development Award; the DOD Peer Reviewed Cancer Research Program W81XWH-16-1-0329, W81XWH-151-0613, the Case Comprehensive Cancer Center Pilot Grant VelaSano Grant from the Cleveland Clinic the Wallace H. Coulter Foundation Program in the Department of Biomedical Engineering at Case Western Reserve University. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

References 1. Alilou, M., et al.: An integrated segmentation and shape based classification scheme for distinguishing adenocarcinomas from granulomas on lung CT. Med. Phys. (2017). doi:10.1002/mp.12208 2. Aerts, H., et al.: Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach. Nature Commun. 5, 4006 (2014) 3. Brambilla, E., et al.: Prognostic effect of tumor lymphocytic infiltration in resectable non-small-cell lung cancer. J. Clin. Oncol. 34(11), 1223–30 (2016) 4. He, L., Huang, Y., Ma, Z., Liang, C., Liang, C., Liu, Z.: Effects of contrastenhancement, reconstruction slice thickness and convolution kernel on the diagnostic performance of radiomics signature in solitary pulmonary nodule. Sci. Rep. 6, 34921 (2016) 5. Lambin, P., et al.: Radiomics: extracting more information from medical images using advanced feature analysis. Eur. J. Cancer 48, 441–446 (2012)

Intra-perinodular Textural Transition (Ipris)

655

6. Shah, S.K., et al.: Computer-aided diagnosis of the solitary pulmonary nodule. Acad. Radiol. 12, 570–575 (2005) 7. Way, T.W., et al.: Computer-aided diagnosis of pulmonary nodules on CT scans: segmentation and classification using 3D active contours. Med. Phys. 33, 2323–2337 (2006) 8. Xu, J., Napel, S., Greenspan, H., Beaulieu, C.F., Agrawal, N., Rubin, D.: Quantifying the margin sharpness of lesions on radiological images for content-based image retrieval. Med. Phys. 39(9), 5405–5418 (2012)

Transferable Multi-model Ensemble for Benign-Malignant Lung Nodule Classification on Chest CT Yutong Xie1, Yong Xia1(&), Jianpeng Zhang1, David Dagan Feng2,5, Michael Fulham2,3,4, and Weidong Cai2 1

Shaanxi Key Lab of Speech & Image Information Processing (SAIIP), School of Computer Science and Engineering, Northwestern Polytechnical University, Xi’an 710072, People’s Republic of China [email protected] 2 Biomedical and Multimedia Information Technology (BMIT) Research Group, School of Information Technologies, University of Sydney, Sydney, NSW 2006, Australia 3 Department of Molecular Imaging, Royal Prince Alfred Hospital, Camperdown, NSW 2050, Australia 4 Sydney Medical School, University of Sydney, Sydney, NSW 2006, Australia 5 Med-X Research Institute, Shanghai Jiaotong University, Shanghai 200030, China

Abstract. The classification of benign versus malignant lung nodules using chest CT plays a pivotal role in the early detection of lung cancer and this early detection has the best chance of cure. Although deep learning is now the most successful solution for image classification problems, it requires a myriad number of training data, which are not usually readily available for most routine medical imaging applications. In this paper, we propose the transferable multi-model ensemble (TMME) algorithm to separate malignant from benign lung nodules using limited chest CT data. This algorithm transfers the image representation abilities of three ResNet-50 models, which were pre-trained on the ImageNet database, to characterize the overall appearance, heterogeneity of voxel values and heterogeneity of shape of lung nodules, respectively, and jointly utilizes them to classify lung nodules with an adaptive weighting scheme learned during the error back propagation. Experimental results on the benchmark LIDC-IDRI dataset show that our proposed TMME algorithm achieves a lung nodule classification accuracy of 93.40%, which is markedly higher than the accuracy of seven state-of-the-art approaches. Keywords: Lung nodule classification Computed tomography (CT)

 Deep learning  Ensemble learning 

Electronic supplementary material The online version of this chapter (doi:10.1007/978-3-31966179-7_75) contains supplementary material, which is available to authorized users. © Springer International Publishing AG 2017 M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 656–664, 2017. DOI: 10.1007/978-3-319-66179-7_75

Transferable Multi-model Ensemble for Benign-Malignant Lung Nodule Classification

657

1 Introduction The 2015 global cancer statistics show that lung cancer accounts for approximately 13% of 14.1 million new cancer cases and 19.5% of cancer-related deaths each year [1]. The 5-year survival for patients with an early diagnosis is approximately 54%, as compared to 4% if the diagnosis is made late when the patient has the stage IV disease [2]. Hence, early diagnosis and treatment are the most effective means to improve lung cancer survival. The National Lung Screening Trial [3] showed that screening with CT will result in a 20% reduction in lung cancer deaths. On chest CT scans, a “spot” on the lung, less than 3 cm in diameter, is defined as a lung nodule, which can be benign or malignant [4]. Malignant nodules may be primary lung tumors or metastases and so the classification of lung nodules is critical for best patient care. Radiologists typically read chest CT scans on a slice-by-slice basis, which is time-consuming, expensive and prone to operator bias and requires a high degree of skill and concentration. Computer-aided lung nodule classification avoids many of these issues and has attracted a lot of research attention. Most solutions in the literature are based on using hand-crafted image features to train a classifier, such as the support vector machine (SVM), artificial neural network and so on. For instance, Han et al. [5] extracted Haralick and Gabor features and local binary patterns to train a SVM for lung nodule classification. Dhara et al. [6], meanwhile, used computed shape-based, margin-based and texture-based features for the same purpose. More recently, deep learning, particularly the deep convolutional neural network (DCNN), has become the most successful image classification technique and it provides a unified framework for joint feature extraction and classification [7]. Hua et al. applied the DCNN and deep belief network (DBN) to separate benign from malignant lung nodules. Shen et al. [7] proposed a multi-crop convolutional neural network (MC-CNN) for lung nodule classification. Despite improved accuracy, these deep models have not achieved the same performance on routine lung nodule classification as they have in the famous ImageNet Challenge. The suboptimal performance is attributed mainly to the overfitting of deep models due to inadequate training data, as there is usually a small dataset in medical image analysis and this relates to the work required in acquiring the image data and then in image annotation. Hu et al. [8] proposed a deep transfer metric learning method to transfer discriminative knowledge from a labeled source domain to an unlabeled target domain to overcome this limitation. A major difference between traditional and deep learning methods is that traditional methods rely more on the domain knowledge, such as there is a high correspondence between nodule malignancy and heterogeneity in voxel values (HVV) and heterogeneity in shapes (HS) [9], and deep learning relies on access to massive datasets. Ideally, the advantages of both should be employed. Chen et al. [10] fused heterogeneous Haralick features, histogram of oriented gradient (HOG) and features derived from the deep stacked denoising autoencoder and DCNN at the decision level to predict nine semantic labels of lung nodules. In our previous work [11], we used a texture descriptor and a shape descriptor to explore the heterogeneity of nodules in voxel values and shape, respectively, and combined both descriptors with the features learned by a nine-layer DCNN for nodule classification. Although improved accuracy was

658

Y. Xie et al.

reported, this method still uses hand-crafted features to characterize the heterogeneity of nodules, which are less effective. Recently, Hu et al. [8] reported that the image representation ability of DCNNs, which was learned from large-scale datasets, could be transferred to solving generic small-data visual recognition problems. Hence, we suggest transferring the DCNN’s image representation ability to characterize the overall appearance of lung nodule images and also the nodule heterogeneity in terms of voxel values and shape. In this paper, we propose a transferable multi-model ensemble (TMME) algorithm for benign-malignant lung nodule classification on chest CT. The main uniqueness of this algorithm includes: (1) three types of image patches are designed to fine-tune three pre-trained ResNet-50 models, aiming to characterize the overall appearance (OA), HVV and HS of each nodule slice, respectively; and (2) these three ResNet-50 models are used jointly to classify nodules with an adaptive weighting scheme learned during the error back propagation, which enables our model to be trained in an ‘end to end’ manner. We compared our algorithm to seven state-of-the-art lung nodule classification approaches on the benchmark LIDC-IDRI dataset. Our results suggest that the proposed algorithm provides substantial performance improvement.

2 Data and Materials The benchmark LIDC-IDRC databases [12] were used for this study, in which the nodules were evaluated over five levels, from benign to malignant, by up to four experienced thoracic radiologists. The mode of levels given by radiologists was defined as the composite level of malignancy. Same to [5–7, 13–15], we only considered nodules  3 mm in diameter and treated 873 nodules with composite level of 1 or 2 as benign and 484 nodules with composite level of 4 or 5 as malignant.

3 Algorithm We have summarized our proposed TMME algorithm in Fig. 1. The algorithm has three steps: (1) extracting the region of interest (ROI) for preprocessing and data augmentation, (2) building a TMME model for slice-based nodule classification, and (3) classifying each nodule based on the labels of its slices. 3.1

Preprocessing and Data Augmentation

A lung nodule is presented in multiple slices. On each slice, a square ROI encapsulating the nodule is identified using the tool developed by [16] to represent the nodule’s OA. To characterize the nodule’s HVV, non-nodule voxels outside the ROI are set to 0 and, if the ROI is larger than 16  16, a 16  16 patch that contains the maximum nodule voxels is extracted. To describe the nodule’s HS, nodule voxels inside the ROI are set to 255. Then, the OA patch, HVV patch and HS patch are resized to 200  200, using the bicubic interpolation. Four augmented copies of each training sample are

Transferable Multi-model Ensemble for Benign-Malignant Lung Nodule Classification

659

TMME model Input lung nodule slice

Nodule ROIs

ResNet-50 for HS

HS patch

W Malignant lung nodule

ResNet-50 for OA

OA patch

Benign lung nodule ResNet-50 for HVV

HVV patch

Fig. 1. Framework of our proposed TMME algorithm

generated by using rotation, shear, horizontal or vertical flip and translation with random parameters to enlarge the size of the training set and are put in the enlarged training set. 3.2

TMME for Nodule Slice Classification

The ResNet-50 model [17] that has been pre-trained on the ImageNet dataset, is adopted. Two neurons in the last fully connected layer are randomly selected and other neurons, together with the weights attached to them, are removed. Three copies of this ResNet-50 are fine-tuned using all OA, HVV and HS patches in the enlarged training set, respectively, to adapt them to characterizing nodule slices. Denoting the prediction vector produced by each ResNet-50 by Xi ¼ ðxi1 ; xi2 Þ ði ¼ 1; 2; 3Þ, the ultimate prediction vector of the ensemble model can be calculated as: Pk ¼

X3 X2 i¼1

j¼1

xijk xij ; k ¼ 1; 2

ð1Þ

where Pk is the predicted likelihood of the input belonging to category k, and xijk is the weight which connects the xij and Pk . Thus, the integrated loss of this ensemble model can be formulated as: Lðy; PÞ ¼ ln

X2

 Pj e  Py ; j¼1

ð2Þ

where y 2 f1; 2g is the input’s true label, and P ¼ ðP1 ; P2 Þ. The change of weight xijk in the ensemble model is in proportion to descend along the gradient, shown as follows: Dxijk

! @Lðy; PÞ eP k ¼ g ¼ gxij P2  dky ; Pm @xijk m¼1 e

where g represents the learning rate, and, if k ¼ y; dky ¼ 1, otherwise, dky ¼ 0.

ð3Þ

660

Y. Xie et al.

Since our training data set is small, the learning rate is set to 0.00001 and the stochastic gradient descent with a batch size of 100 is adopted. Moreover, 10% of the training patches are chosen to form a validation set, and the training is terminated even before reaching the maximum iteration number of 50, if the error on the other 90% of training images continues to decline but the error on the validation set stops decreasing. 3.3

Nodule Classification

Let a lung nodule W be contained in S slices, denoted by W ¼ fW1 ; W2 ; . . .; WS g. Input the i-th slice Wi into the TMME model, and we obtain a two-dimensional prediction vector H ðWi Þ. The class label of nodule W is assigned based on the sum of the prediction made on each slice, shown as follows: ’



ð4Þ

4 Results The proposed TMME algorithm was applied to the LIDC-IDRC dataset 10 times independently, with 10-fold cross validation. The mean and standard deviation of obtained accuracy, sensitivity, specificity and area under the receiver operator curve (AUC), together with the performance of seven state-of-the-art methods on this dataset, were given in Table 1. It shows that our algorithm not only outperformed hand-crafted feature-based traditional methods but also substantially improved upon Xie et al.’s method [11]. Our results indicate that the pre-trained and fine-tuned ResNet-50 model can effectively transfer the image representation ability learned on the ImageNet dataset to characterizing the OA, HVV and HS of lung nodules, and an adaptive ensemble of these three models has superior ability to differentiate malignant from benign lung nodules.

Table 1. Performance of eight lung nodule classification methods on the LIDC-IDRC dataset Algorithms Shen et al. 2017 [7] Dhara et al. 2016 [6] Han et al. 2015 [5] Anand 2010 [15] Hua et al. 2015 [13] Han et al. 2013 [14] Xie et al. 2016 [11] Proposed (mean  standard deviation)

Accuracy (%) 87.14 – – 86.3 – – 86.79 93:40  0:01

Sensitivity (%) 77.00 89.73 89.35 89.6 73.4 – 60.26 91:43  0:02

Specificity (%) 93.00 86.36 86.02 86.7 82.2 – 95.42 94:09  0:02

AUC 0.9300 0.9505 0.9405 – – 0.9441 – 0:9778  0:0001

Transferable Multi-model Ensemble for Benign-Malignant Lung Nodule Classification

661

5 Discussion 5.1

Data Argumentation

The number of training samples generated by data augmentation plays an important role in applying a deep model to small-sample learning problems. On one hand, training a deep model requires as many data as possible; on the other hand, more data always lead to higher time cost. We re-performed the experiment using different numbers of augmented images and listed the obtained performance in Table 2. Table 2. Performance of the proposed algorithm with different number of augmentatioin data. Augmented data per Accuracy (%) Sensitivity (%) Specificity (%) AUC Time for image training 0 89.84 83.85 93.43 0.9451 3 h 2 92.24 88.74 93.99 0.9724 7 h 4 93.40 91.43 94.09 0.9778 12 h 6 93.66 91.65 94.90 0.9788 17.5 h 8 93.73 91.90 94.90 0.9794 24 h

It reveals that using four augmented images for each training sample achieved a trade-off between accuracy and time cost, since further increasing the number of augmented images only improved the accuracy slightly but cost much more time for training. Meanwhile, it should be noted that our algorithm, without using data augmentation, achieved an accuracy of 89.84%, which is still superior to the accuracy of those methods given in Table 1. 5.2

Ensemble Learning

To demonstrate the performance improvement that results from the adaptive ensemble of three ResNet-50 models, we compared the performance of our algorithm to that of three component models, which characterize lung nodules from the perspective of OA, HVV and HS, respectively. As shown in Table 3, although each ResNet-50 model achieves a relatively good performance, an adaptive ensemble of them brings a further performance gain. Table 3. Performance of each component ResNet-50 model and the proposed ensemble model Models ResNet-50 for HS ResNet-50 for HVV ResNet-50 for OA Proposed TMME

Accuracy (%) Sensitivity (%) Specificity (%) AUC 91.65 88.35 93.34 0.9685 91.66 88.89 93.32 0.9736 91.73 89.07 93.28 0.9740 93.40 91.43 94.09 0.9778

662

5.3

Y. Xie et al.

Other Pre-trained DCNN Models

Besides ResNet-50, GoogLeNet [18] and VGGNet [19] are two of the most successful DCNN models. Using each of those three models to characterize lung nodules from each of three perspectives, i.e. OA, HVV and HS, we have 27 different configurations. To evaluate the performance of using other DCNN models, we tested all 27 configurations and gave the accuracy and AUC of the top five configurations in Table 4. It shows that ResNet-50 is very powerful and using three ResNet-50 results in the highest accuracy and AUC. Nevertheless, it also suggests that GoogLeNet and VGGNet are good choices as well and using them to replace ResNet-50 may produce very similar accuracy in some configurations. Table 4. Performance of the top five out of 27 ensemble models DCNN for HS ResNet-50 ResNet-50 VGGNet GoogLeNet ResNet-50

5.4

DCNN for VVH ResNet-50 GoogLeNet ResNet-50 ResNet-50 ResNet-50

DCNN for OA ResNet-50 ResNet-50 ResNet-50 ResNet-50 GoogLeNet

Accuracy (%) AUC 93.40 0.9778 93.30 0.9760 93.28 0.9767 93.21 0.9759 93.14 0.9765

Hybrid Ensemble of 27 TMME Models

Using all possible combination of VGGNet, GoogLeNet and ResNet-50 to characterize the OA, HVV and HS of lung nodules, we can have totally 27 proposed TMME models, which can be further combined by using an adaptive weighting scheme learned in the same way. Table 5 shows that the ensemble of 27 TMME models can only slightly improve the classification accuracy, but with a major increase in the computational complexity of training the model. Table 5. Performance of TMME and the ensemble of 27 TMME models Models Accuracy (%) Sensitivity (%) Specificity (%) AUC Runtime TMME 93.40 91.43 94.09 0.9778 12 h 27 TMME 94.04 92.04 94.92 0.9793 5 days

5.5

Computational Complexity

In our experiments, it took about 12 h to train the proposed model and less than 0.5 s to classify each lung nodule (Intel Xeon E5-2678 V3 2.50 GHz  2, NVIDIA Tesla K40c GPU  2, 128 GB RAM, 120 GB SSD and Matlab 2016). It suggests that the proposed algorithm, though computation very complex during the training process that can be performed offline, is very efficient for online testing and could be used in a routine clinical workflow.

Transferable Multi-model Ensemble for Benign-Malignant Lung Nodule Classification

663

6 Conclusion We propose the TMME algorithm for benign-malignant lung nodule classification on chest CT. We used three pre-trained and fine-tuned ResNet-50 models to characterize the OA, HVV and HS of lung nodules, and combined these models using an adaptive weighting scheme learned during the back-propagation process. Our results on the benchmark LIDC-IDRC dataset suggest that our algorithm produces more accurate results than seven state-of-the-art approaches. Acknowledgments. This work was supported in part by the National Natural Science Foundation of China under Grants 61471297, in part by the Seed Foundation of Innovation and Creation for Graduate Students in Northwestern Polytechnical University under Grants Z2017041, and in part by the Australian Research Council (ARC) Grants. We acknowledged the National Cancer Institute and the Foundation for the National Institutes of Health, and their critical role in the creation of the free publicly available LIDC/IDRI Database used in this work.

References 1. Siegel, R.L., Miller, K.D., Jemal, A.: Cancer statistics. CA Cancer J. Clin. 65(1), 5–29 (2015) 2. Bach, P.B., Mirkin, J.N., Oliver, T.K., Azzoli, C.G., Berry, D.A., Brawley, O.W., Byers, T., Colditz, G.A., Gould, M.K., Jett, J.R.: Benefits and harms of CT screening for lung cancer: a systematic review. JAMA, J. Am. Med. Assoc. 307, 2418–2429 (2012) 3. Abraham, J.: Reduced lung-cancer mortality with low-dose computed tomographic screening. New Engl. J. Med. 365, 395–409 (2011) 4. American Thoracic Society: What is a lung nodule? Am. J. Respir. Crit. Care Med. 193, 11–12 (2016) 5. Han, F., Wang, H., Zhang, G., Han, H., Song, B., Li, L., Moore, W., Lu, H., Zhao, H., Liang, Z.: Texture feature analysis for computer-aided diagnosis on pulmonary nodules. J. Digit. Imaging 28(1), 99–115 (2015) 6. Dhara, A.K., Mukhopadhyay, S., Dutta, A., Garg, M., Khandelwal, N.: A combination of shape and texture features for classification of pulmonary nodules in lung CT images. J. Digit. Imaging 29(4), 466–475 (2016) 7. Hua, K.L., Hsu, C.H., Hidayati, S.C., Cheng, W.H., Chen, Y.J.: Computer-aided classification of lung nodules on computed tomography images via deep learning technique. Onco Targets Ther. 8, 2015–2022 (2015) 8. Shen, W., Zhou, M., Yang, F., Yu, D., Dong, D., Yang, C., Zang, Y., Tian, J.: Multi-crop convolutional neural networks for lung nodule malignancy suspiciousness classification. Pattern Recogn. 61, 663–673 (2017) 9. Hu, J., Lu, J., Tan, Y.P.: Deep transfer metric learning. In: CVPR 2015, pp. 325–333. IEEE Press, New York (2015) 10. Metz, S., Ganter, C., Lorenzen, S., Marwick, S.V., Holzapfel, K., Herrmann, K., Rummeny, E.J., Wester, H.J., Schwaiger, M., Nekolla, S.G.: Multiparametric MR and PET imaging of intratumoral biological heterogeneity in patients with metastatic lung cancer using voxel-by-voxel analysis. PLoS ONE 10(7), e0132386 (2014)

664

Y. Xie et al.

11. Chen, S., Qin, J., Ji, X., Lei, B., Wang, T., Ni, D., Cheng, J.Z.: Automatic scoring of multiple semantic attributes with multi-task feature leverage: a study on pulmonary nodules in CT images. IEEE Trans. Med. Imaging 99, 1 (2016) 12. Xie, Y., Zhang, J., Liu, S., Cai, W., Xia, Y.: Lung nodule classification by jointly using visual descriptors and deep features. In: Müller, H., et al. (eds.) MCV 2016, BAMBI 2016. LNCS, vol. 10081. Springer, Cham (2017). doi:10.1007/978-3-319-61188-4_11 13. Iii, S.G.A., Mclennan, G., Bidaut, L., Mcnittgray, M.F., Meyer, C.R., Reeves, A.P., Zhao, B., Aberle, D.R., Henschke, C.I., Hoffman, E.A.: The lung image database consortium (LIDC) and image database resource initiative (IDRI): a completed reference database of lung nodules on CT scans. Med. Phys. 38, 915–931 (2011) 14. Han, F., Zhang, G., Wang, H., Song, B.: A texture feature analysis for diagnosis of pulmonary nodules using LIDC-IDRI database. In: ICMIPE 2013, pp. 14–18. IEEE Press (2013) 15. Anand, S.K.V.: Segmentation coupled textural feature classification for lung tumor prediction. In: 2010 IEEE International Conference on Communication Control and Computing Technologies, pp. 518–524. IEEE Press, New York (2010) 16. Lampert, T.A., Stumpf, A., Gançarski, P.: An empirical study into annotator agreement, ground truth estimation, and algorithm evaluation. IEEE TIP 25(6), 2557–2572 (2016) 17. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR 2016, pp. 770–778. IEEE Press, New York (2016) 18. Szegedy, C., Liu, W., Jia, Y., Sermanet, P.: Going deeper with convolutions. In: CVPR 2015, pp. 1–9. IEEE Press, New York (2016) 19. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409-1556 (2014)

Deep Reinforcement Learning for Active Breast Lesion Detection from DCE-MRI Gabriel Maicas1(B) , Gustavo Carneiro1 , Andrew P. Bradley2 , Jacinto C. Nascimento3 , and Ian Reid1 1

3

School of Computer Science, ACVT, The University of Adelaide, Adelaide, Australia [email protected] 2 School of ITEE, The University of Queensland, Brisbane, Australia Institute for Systems and Robotics, Instituto Superior Tecnico, Lisbon, Portugal

Abstract. We present a novel methodology for the automated detection of breast lesions from dynamic contrast-enhanced magnetic resonance volumes (DCE-MRI). Our method, based on deep reinforcement learning, significantly reduces the inference time for lesion detection compared to an exhaustive search, while retaining state-of-art accuracy. This speed-up is achieved via an attention mechanism that progressively focuses the search for a lesion (or lesions) on the appropriate region(s) of the input volume. The attention mechanism is implemented by training an artificial agent to learn a search policy, which is then exploited during inference. Specifically, we extend the deep Q-network approach, previously demonstrated on simpler problems such as anatomical landmark detection, in order to detect lesions that have a significant variation in shape, appearance, location and size. We demonstrate our results on a dataset containing 117 DCE-MRI volumes, validating runtime and accuracy of lesion detection. Keywords: Deep Q-learning · Q-net · Reinforcement learning lesion detection · Magnetic resonance imaging

1

· Breast

Introduction

Breast cancer is amongst the most commonly diagnosed cancers in women [1,2]. Dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) represents one of the most effective imaging techniques for monitoring younger, high-risk women, who typically have dense breasts that show poor contrast in mammography [3]. DCE-MRI is also useful during surgical planning once a suspicious lesion is found on a mammogram [3]. The first stage in the analysis of these 4D (3D over time) DCE-MRI volumes consists of the localisation of breast lesions. Supported by Australian Research Council through grants DP140102794, CE140100016 and FL130100102. c Springer International Publishing AG 2017  M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 665–673, 2017. DOI: 10.1007/978-3-319-66179-7 76

666

G. Maicas et al.

Fig. 1. Example of the detection process of breast lesions from DCE-MRI with DQN. Depth transformation are not shown for simplicity.

This is a challenging task given the high dimensionality the data (4 volumes each containing 512 × 512 × 128 voxels), the low signal to noise ratio of the dynamic sequence and the variable size and shape of breast lesions (see Fig. 1). Therefore, a computer-aided detection (CAD) system that automatically localises breast lesions in DCE-MRI data would be a useful tool to facilitate radiologists. However, the high dimensionality of DCE-MRI requires computationally efficient methods for lesion detection to be developed to be viable for practical use. Current approaches to lesion detection in DCE-MRI rely on extracting handcrafted features [4,5] and exhaustive search mechanisms [4–6] in order to handle the variability in lesion appearance, shape, location and size. These methods are both computationally complex and potentially sub-optimal, resulting in false alarms and missed detections. Similar issues in the detection of visual objects have motivated the computer vision community to develop efficient detectors [7,8], like the Faster R-CNN [7]. However, these models need large annotated training sets making their application in medical image analysis (MIA) challenging [9]. Alternatively, Caicedo and Lazebnik [8] have recently proposed the use of a deep Q-network (DQN) [10] for efficient object detection that allows us to deal with the limited amount of data. Its adaptation to MIA applications has to overcome two additional obstacles: (1) the extension from visual object classes (e.g., animals, cars, etc.) to objects in medical images, such as tumours, which tend to have weaker consistency in terms of shape, appearance, location, background and size; and (2) the high dimensionality of medical images, which presents practical challenges with respect to the DQN training process [10]. Ghesu et al. [11] have recently adapted DQN [10] to anatomical landmark detection, but did not address the obstacles mentioned above because the visual classes used in their work have consistent patterns and are extracted from fixed small-size regions of the medical images. Here, we introduce a novel algorithm for breast lesion detection from DCEMRI inspired by a previously proposed DQN [8,10]. Our main goal is the reduction of run time complexity without a reduction in detection accuracy. The proposed approach comprises an artificial agent that automatically learns a policy, describing how to iteratively modify the focus of attention (via translation and scale) from an initial large bounding box to a smaller bounding box containing a lesion, if it exists (see Fig. 2). To this end, the agent constructs a deep learning feature representation of the current bounding box, which is used by the DQN to decide on the next action, i.e., either to translate or scale the current bounding box or to trigger the end of the search process. Our methodology is

Deep Reinforcement Learning for Active Breast Lesion Detection

667

the first DQN [10] that can detect such visually challenging objects. In addition, unlike [11] that uses a fixed small-size bounding box, our DQN utilises a variable-size bounding box. We evaluate our methodology on a dataset of 117 patients (58 for training and 59 for testing). Results show that our methodology achieves a similar detection accuracy compared to the state of the art [5,6], but with significantly reduced run times.

2

Literature Review

Automated approaches for breast lesion detection from DCE-MRI are typically based on exhaustive search methods and hand-designed features [4,5,12,13]. Vignati et al. [12] proposed a method that thresholds an intensity normalised DCE-MRI to detect voxel candidates that are merged to form lesion candidates, from which hand-designed region and kinetic features are used in the classification process. As shown in Table 1, this method has low accuracy that can be explained by the fact that this method makes strong assumptions about the role of DCE-MRI intensity and does not utilise texture, shape, location and size features. Renz et al. [13] extended Vignati et al.’s work [12] with the use of additional hand-designed morphological and dynamical features, showing more competitive results (see Table 1). Further improvements were obtained by GubernMerida et al. [4], with the addition of shape and appearance hand-designed features, as shown in Table 1. The run-time complexity of the approaches above can be summarised by the mean running time (per volume) shown by Vignati et al.’s work [12] in Table 1, which is likely the most efficient of these three approaches [4,12,13]. McClymont et al. [5] extended the methods above with the unsupervised voxel clustering for the initial detection of lesion candidates, followed by a structured output learning approach that detects and segments lesions simultaneously. This approach significantly improves the detection accuracy, but at a substantial increase in computational cost (see Table 1). The multi-scale deep learning cascade approach [6] reduced the run-time complexity, allowed the extraction of optimal and efficient features, and had a competitive detection accuracy as shown in Table 1. There are two important issues regarding previously proposed approaches: the absence of a common dataset to evaluate different methodologies and the Table 1. Summary of results from previous approaches. Evaluation criteria

Time

Vignati et al. [12]

0.89 TPR @ 12.00 FPI

7.00 min

Renz et al. [13]

0.96 sensitivity @ 0.75 specificity –

Gubern-Merida et al. [4] 0.89 TPR @ 4.00 FPI



McClymont et al. [5]

1.00 TPR @ 4.50 FPI

O(60) min

Maicas et al. [6]

0.80 TPR @ 2.80 FPI

2.74 min

668

G. Maicas et al.

lack of a consistent lesion detection criterion. Whereas detections in [12,13] were visually inspected by a radiologist, [4,5] considered a lesion detected if a (single) voxel in the ground truth was detected. In [6] a more precise criterion (minimum Dice coefficient of 0.2 between ground truth and candidate bounding box) was used - in the experiment, we adopt this Dice > 0.2 criterion and use the same dataset as a few previous studies [5,6].

3

Methodology

In this section, we first define the dataset, then the training and inference stages of our proposed methodology, shown in Fig. 2.

Fig. 2. Block diagram of the proposed detection system.

3.1

Dataset

  |D| The data is represented by a set of 3D breast scans D = x, t, {s(j) }M , j=1 i i=1 where each, x, t : Ω → R denotes the first DCE-MRI subtraction volume and the T1-weighted anatomical volume, respectively, with Ω ∈ R3 representing the volume lattice of size w × h × d; s(j) : Ω → {0, 1} represents the annotation for the j th lesion present, with s(j) (ω) = 1 indicating presence of lesion at voxel ω ∈ Ω. The entire dataset is patient-wise split such that the mutually  exclusive training and testing datasets are represented by T , U ⊂ D, where T U = D. 3.2

Training

The proposed DQN [10] model is trained via interactions with the DCE-MRI dataset through a sequence of observations, actions and rewards. Each observation is represented by o = f (x(b)), where b = [bx , by , bz , bw , bh , bd ] ∈ R6 (where bx , by , bz represent the top-left-front corner and bw , bh , bd denotes the lower-rightback corner of the bounding box) indexes the input DCE-MRI data x, and f (.) denotes a deep residual network (ResNet) [14,15], defined below. Each action is denoted by a ∈ A = {lx+ , lx− , ly+ , ly− , lz+ , lz− , s+ , s− , w}, where l, s, w represent translation, scale and trigger actions, with the subscripts x, y, z denoting the

Deep Reinforcement Learning for Active Breast Lesion Detection

669

horizontal, vertical or depth translation, and superscripts +, − meaning positive or negative translation and up or down scaling. The reward when the agent chooses the action a = w to move from o to o is defined by:  +η, if d(o , s) ≥ τw  r(o, a, o ) := , (1) −η, otherwise where d(.) is the Dice coefficient between a map formed by the bounding box o = f (x(b)) and the segmentation map s, η = 10 and τw = 0.2 (these values have been empirically defined - for instance, we found that increasing η to 10.0 from 3.0 used in [8] helped triggering when finding a lesion). For the remaining of the actions in A\{w}, the rewards are defined by: r(o, a, o ) := sign(d(o , s) − d(o, s)).

(2)

The training process models a DQN that maximises cumulative future rewards with the approximation of the following action-value function: Q∗ (o, a) = maxπ E[rt + γrt+1 + γ 2 rt+2 + ... | ot = o, at = a, π], where rt denotes the reward at time step t, γ represents a discount factor per time step, and π is the behaviour policy. This action-value function is modelled by a DQN Q(o, a, θ), where θ denotes the network weights. The training of Q(o, a, θ) is based on experience replay memory and the target network [10]. Experience replay uses a dataset Et = {e1 , ..., et } built with the agent’s experiences et = (ot , at , rt , ot+1 ), and the target network with parameters θi− computes the target values for the DQN updates, where the values θi− are held fixed and updated periodically. The loss function for modelling Q(o, a, θ) minimises the mean-squared error of the Bellman equation, as in: 

2   − Q(o , a ; θi ) − Q(o, a; θi ) . (3) Li (θi ) = E(o,a,r,o )∼U (E) r + γ max  a

In the training process, we follow an -greedy strategy to balance exploration and exploitation: with probability , the agent explores, and with probability 1- , it will follow the current policy π (exploitation) for training time step t. At the beginning of the training, we set = 1 (i.e., pure exploration), and decrease as the training progresses (i.e., increase exploitation). Furthermore, we follow a modified guided exploration: with probability κ, the agent will select a random action and with probability 1−κ, it will select an action that produces a positive reward. This modifies the guided exploration in [8] by adding randomness to the process, aiming to improve generalisation. Finally, the ResNet [14,15], which produces the observation o = x(b), is trained to decide whether a random bounding box b contains a lesion. A training sample is labelled as positive if d(o, sj ) ≥ τw , and negative, otherwise. It is important to notice that this way of labelling random training samples can provide a large and balanced training set, extracted at several locations and scales, that is essential to train the large capacity ResNet [14,15]. In addition, this way of representing the bounding box means that we are able to process varying-size input bounding box, which is an advantage compared to [11].

670

3.3

G. Maicas et al.

Inference

The trained DQN model is parameterised by θ∗ learned in (3) and is defined by a multi-layer perceptron [8] that outputs the action-value function for the observation o. The action to follow from the current observation is defined by: a∗ = arg max Q(o, a, θ∗ ). a

(4)

Finally, given that the number and location of lesions are unknown in a test DCE-MRI, this inference is initialised with different bounding boxes at several locations, and it runs until it either finds the lesion (with the selection of the trigger action), or runs for a maximum number of 20 steps.

4

Experiments

The database used to assess our proposed methodology contains DCE-MRI and T1-weighted anatomical datasets from 117 patients [5]. For the DCE-MRI, the first volume was acquired before contrast agent injection (pre-contrast), and the remaining volumes were acquired after contrast agent injection. Here we use only one volume represented by the first subtraction from DCE-MRI: the first postcontrast volume minus pre-contrast volume. The T1-weighted anatomical is used only to extract the breast region from the initial volume [5], as a pre-processing stage. The training set contains 58 patients annotated with 72 lesions, and the testing set has 59 patients and 69 lesions to allow a fair comparison with [6]. The detection accuracy is assessed by the proportion of true positives (TPR) detected in the training and testing sets as a function of the number of false positives per image (FPI), where a candidate lesion is assumed to be a true positive if the Dice coefficient between the candidate lesion bounding box and the ground truth annotation bounding box is at least 0.2 [16]. We also measure the running time of the detection process using the following computer: CPU: Intel Core i7 with 12 GB of RAM and a GPU Nvidia Titan X 12 GB. The pre-processing stage of our methodology consists of the extraction of each breast region (from T1-weighted) [5], and separate each breast into a resized volume of (100 × 100 × 50) voxels. For training, we select breast region volumes that contain at least one lesion, but if a breast volume has more than one lesion, one of them is randomly selected to train the agent. For testing, a breast may contain none, one or multiple lesions. The observation o used by DQN is produced by a ResNet [14] containing five residual blocks. The input to the ResNet is fixed at (100 × 100 × 50) voxels. We extract 16 K patches (8 K positives and 8 K negatives) from the training set to train the ResNet to classify a bounding box as positive or negative for a lesion, where a bounding box is labelled as positive if the Dice coefficient between the lesion candidate and the ground truth annotation is at least 0.6. This ResNet provides a fixed size representation for o of size 2304 (extracted before the last convolutional layer).

Deep Reinforcement Learning for Active Breast Lesion Detection

671

The DQN is represented by a multilayer perceptron with two layers, each containing 512 nodes, that outputs nine actions: six translations (by one third of the size of the corresponding dimension), two scales (by one sixth in all dimensions) and a trigger (see Sect. 3.2). For training this DQN, the agent starts an episode with a centred bounding box occupying 75% of the breast region volume. The experience replay memory E contains 10 K experiences, from which 100 mini-batch samples are drawn to minimise the loss (3). The DQN is trained with Adam, using a learning rate of 1 × 10−6 , and the target network is updated after running one episode per volume of the training set. For the -greedy strategy (Sect. 3.2), decreases linearly from 1 to 0.1 in 300 epochs, and during exploration,the balance between random exploration and modified guided exploration is given by κ = 0.5. During inference, the agent follows the policy in (4), where for every breast region volume, it starts at a centred bounding box that covers 75% of the volume. Then it starts at each of the eight non-overlapping (50, 50, 25) bounding boxes corresponding to each of the corners. Finally, it is initialised at another four (50, 50, 25) bounding boxes centred at the intersections of the previous bounding boxes. The agent is allowed a maximum number of 20 steps to trigger, otherwise, no lesion is detected. 4.1

Results

We compare the training and testing results of our proposed DQN with the multi-scale cascade [6] and structured output approach [5] in the table in Fig. 3. In addition, we show the Free Response Operating Characteristic (FROC) curve in Fig. 3 comparing our approach (using a varying number of initialisations that lead to different TPR and FPI values) with the multi-scale cascade [6]. Finally, we show examples of the detections produced by our method in Fig. 4. FROC Curve TEST True Positive Rate

1

92s

0.8

164s

50s

0.6 0.4

DQN (Ours) Ms-C [6] SL [5]

DQN (13 Ini) DQN (9 Ini) DQN (1 Ini) Cascade

7s

0.2 0 0

2

4

6

8

Test TPR FPR Time 0.80 3.2 92 ± 21s 0.80 2.8 164± 137s 1.00 4.50 O(60) m

10

False Positives per Image

Fig. 3. FROC curve showing TPR vs FPI and run times for DQN and the multi-scale cascade [6] (left) TPR, FPR and mean inference time per case (i.e. per patient) for each method (right). Note run time for Ms-C is constant over the FPI range.

We use a paired t-test to estimate the significance of the inference times between our approach and the multi-scale cascade [6], giving p ≤ 9 × 10−5 .

672

G. Maicas et al.

Fig. 4. Examples of detected breast lesions. Cyan boxes indicate the ground truth, red boxes detections produced by our proposed method and yellow false positive detections.

5

Discussion and Conclusion

We have presented a DQN method for lesion detection from DCE-MRI that shows similar accuracy to state of the art approaches, but with significantly reducing detection times. Given that we did not attempt any code optimisation, we believe that the run times have the potential for further improvement. For example, inference uses several initialisations (up to 13), which could be run in parallel as they are independent, decreasing detection time by a factor of 10. The main bottleneck of our approach is the volume resizing stage that transforms the current bounding box to fit the ResNet input - currently representing 90% of the inference time. A limitation of this work is that we do not have an action to change the aspect ratio of the bounding box, which may improve detection of small elongated lesions. Finally, during training, we noted that the most important parameter to achieve good generalisation is the balance between exploration and exploitation. We observed that the best generalisation was achieved when = 0.5 (i.e. half of the actions correspond to exploration and half to exploitation of the current policy). Future research will improve run-time performance via learning smarter search strategies. For instance, we would like to avoid revisiting regions that have already been determined to be free from lesions with high probability. At present we rely on the training data to discourage such moves, but there may be more explicit constraints to explore. We would like to acknowledge NVIDIA for providing the GPU used in this work.

References 1. Smith, R.A., Andrews, K., Brooks, D., et al.: Cancer screening in the United States, 2016: a review of current American cancer society guidelines and current issues in cancer screening. CA Cancer J. Clin. 66, 96–114 (2016) 2. Siegel, R.L., Miller, K.D., Jemal, A.: Cancer statistics, 2016. CA Cancer J. Clin. 66(1), 7–30 (2016) 3. Siu, A.L.: Screening for breast cancer: US preventive services task force recommendation statement. Ann. Intern. Med. 164, 279–296 (2016) 4. Gubern-M´erida, A., Mart´ı, R., Melendez, J., et al.: Automated localization of breast cancer in DCE-MRI. Med. Image Anal. 20(1), 265–274 (2015) 5. McClymont, D., Mehnert, A., Trakic, A., et al.: Fully automatic lesion segmentation in breast MRI using mean-shift and graph-cuts on a region adjacency graph. JMRI 39(4), 795–804 (2014)

Deep Reinforcement Learning for Active Breast Lesion Detection

673

6. Maicas, G., Carneiro, G., Bradley, A.P.: Globally optimal breast mass segmentation from DCE-MRI using deep semantic segmentation as shape prior. In: 14th International Symposium on Biomedical Imaging (ISBI), pp. 305–309. IEEE (2017) 7. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS, pp. 91–99 (2015) 8. Caicedo, J.C., Lazebnik, S.: Active object localization with deep reinforcement learning. In: CVPR, pp. 2488–2496 (2015) 9. Akselrod-Ballin, A., Karlinsky, L., Alpert, S., Hasoul, S., Ben-Ari, R., Barkan, E.: A region based convolutional network for tumor detection and classification in breast mammography. In: Carneiro, G., et al. (eds.) LABELS/DLMIA -2016. LNCS, vol. 10008, pp. 197–205. Springer, Cham (2016). doi:10.1007/978-3-319-46976-8 21 10. Mnih, V., Kavukcuoglu, K., Silver, D., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015) 11. Ghesu, F.C., Georgescu, B., Mansi, T., Neumann, D., Hornegger, J., Comaniciu, D.: An artificial agent for anatomical landmark detection in medical images. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9902, pp. 229–237. Springer, Cham (2016). doi:10.1007/ 978-3-319-46726-9 27 12. Vignati, A., Giannini, V., De Luca, M., et al.: Performance of a fully automatic lesion detection system for breast DCE-MRI. JMRI 34(6), 1341–1351 (2011) 13. Renz, D.M., B¨ ottcher, J., Diekmann, F., et al.: Detection and classification of contrast-enhancing masses by a fully automatic computer-assisted diagnosis system for breast MRI. JMRI 35(5), 1077–1088 (2012) 14. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016) 15. Huang, G., Sun, Y., Liu, Z., Sedra, D., Weinberger, K.Q.: Deep networks with stochastic depth. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 646–661. Springer, Cham (2016). doi:10.1007/ 978-3-319-46493-0 39 16. Dhungel, N., Carneiro, G., Bradley, A.P.: Automated mass detection in mammograms using cascaded deep learning and random forests. In: DICTA. IEEE (2015)

Pancreas Segmentation in MRI Using Graph-Based Decision Fusion on Convolutional Neural Networks Jinzheng Cai1 , Le Lu3 , Yuanpu Xie1 , Fuyong Xing2 , and Lin Yang1,2(B) 1

Department of Biomedical Engineering, University of Florida, Gainesville, FL 32611, USA [email protected] 2 Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL 32611, USA 3 Department of Radiology and Imaging Sciences, National Institutes of Health Clinical Center, Bethesda, MD 20892, USA

Abstract. Deep neural networks have demonstrated very promising performance on accurate segmentation of challenging organs (e.g., pancreas) in abdominal CT and MRI scans. The current deep learning approaches conduct pancreas segmentation by processing sequences of 2D image slices independently through deep, dense per-pixel masking for each image, without explicitly enforcing spatial consistency constraint on segmentation of successive slices. We propose a new convolutional/recurrent neural network architecture to address the contextual learning and segmentation consistency problem. A deep convolutional sub-network is first designed and pre-trained from scratch. The output layer of this network module is then connected to recurrent layers and can be fine-tuned for contextual learning, in an end-to-end manner. Our recurrent sub-network is a type of Long short-term memory (LSTM) network that performs segmentation on an image by integrating its neighboring slice segmentation predictions, in the form of a dependent sequence processing. Additionally, a novel segmentation-direct loss function (named Jaccard Loss) is proposed and deep networks are trained to optimize Jaccard Index (JI) directly. Extensive experiments are conducted to validate our proposed deep models, on quantitative pancreas segmentation using both CT and MRI scans. Our method outperforms the state-of-the-art work on CT [11] and MRI pancreas segmentation [1], respectively.

1

Introduction

Detecting unusual volume changes and monitoring abnormal growths in pancreas using medical images is a critical yet challenging diagnosis task. This would require to dissect pancreas from its surrounding tissues in radiology images (e.g., CT and MRI scans). Manual pancreas segmentation is laborious, tedious, and sometimes prone to inter-observer variability. One major group of related work c Springer International Publishing AG 2017  M. Descoteaux et al. (Eds.): MICCAI 2017, Part III, LNCS 10435, pp. 674–682, 2017. DOI: 10.1007/978-3-319-66179-7 77

Pancreas Segmentation in MRI Using Graph-Based Decision Fusion on CNN

675

on automatic pancreas segmentation in CT images are based on multi-atlas registration and label fusion (MALF) [8,15,16] under leave-one-patient-out evaluation protocol. Due to the high deformable shape and vague boundaries of pancreas in CT, their reported segmentation accuracy results (measured in Dice Similarity Coefficient or DSC) range from 69.6 ± 16.7% [16] to 75.1 ± 15.4% [8]. On the other hand, deep convolutional neural networks (CNN) based pancreas segmentation work [1,3,10–12,18] have revealed promising results and steady performance improvements, e.g., from 71.8 ± 10.7% [10], 78.0 ± 8.2% [11], to 81.3 ± 6.3% [12] evaluated using the same NIH 82-patient CT dataset https:// doi.org/10.7937/K9/TCIA.2016.TNB1KQBU. In comparison, deep CNN approaches appear to demonstrate the noticeably higher segmentation accuracy and numerically more stable results (significantly lower in standard deviation, or std) than their MALF counterparts. [11,12] are built upon the fully convolutional network (FCN) architecture [5] and its variant [17]. However, [11,12] are not completely end-to-end trained due to their segmentation post processing steps. Consequently, the trained models may be suboptimal. For pancreas segmentation on a 79-patient MRI dataset, [1] achieves 76.1 ± 8.7% in DSC. In this paper, we propose a new deep neural network architecture with recurrent neural contextual learning for improved pancreas segmentation. All previous work [1,11,18] perform deep 2D CNN segmentation on either CT or MRI image or slice independently1 . There is no spatial smoothness consistency constraints enforced among successive slices. We first follow this protocol by training 2D slice based CNN models for pancreas segmentation. Once this step of CNN training converges, inspired by sequence modeling for precipitation nowcasting in [13], a convolutional long short-term memory (CLSTM) network is further added to the output layer of the deep CNN to explicitly capture and constrain the contextual segmentation smoothness across neighboring image slices. Then the whole integrated CLSTM network can be end-to-end fine-tuned via stochastic gradient descent (SGD) until converges. The CLSTM module will modify the segmentation results produced formerly by CNN alone, by taking the initial CNN segmentation results of successive axial slices (in either superior or interior direction) into account. Therefore the final segmented pancreas shape is constrained to be consistent among adjacent slices, as a good trade-off between 2D and 3D segmentation deep models. Next, we present a novel segmentation-direct loss function to train our CNN models by minimizing the jaccard index between any annotated pancreas mask and its corresponding output segmentation mask. The standard practice in FCN image segmentation deep models [1,5,11,17] use a loss function to sum up the cross-entropy loss at each voxel or pixel. Segmentation-direct loss function can 1

Organ segmentation in 3D CT and MRI scans can also be performed by directly taking cropped 3D sub-volumes as input [4, 6, 7]. Even at the expense of being computationally expensive and prone-to-overfitting, the result of very high segmentation accuracy has not been reported for complexly shaped organs [6]. [2, 14] use hybrid CNN-RNN architectures to process/segment sliced CT or MRI images in sequence.

676

J. Cai et al.

Fig. 1. Network architecture: Left is the CBR block (CBR-B) that contains convolutional layer (Conv-L), batch normalization layer (BN-L), and ReLU layer (ReLU-L). While, each scale block (Scale-B) has several CBR blocks and followed with a pooling layer. Right is the CLSTM for contextual learning. Segmented outcome at slice τ would be regularized by the results of slice τ − 3, τ − 2, and τ − 1. For example, contextual learning is activated in regions with × markers, where sudden losses of pancreas areas occurs in slice τ comparing to consecutive slices.

avoid the data balancing issue during CNN training between the positive pancreas and negative background regions. Pancreas normally only occupies a very small fraction on each slice. Furthermore, there is no need to calibrate the optimal probability threshold to achieve the best possible binary pancreas segmentation results from the FCN’s probabilistic outputs [1,5,11,17]. Similar segmentation metric based loss functions based on DSC are concurrently proposed and investigated in [7,18]. We extensively and quantitatively evaluate our proposed deep convolutional LSTM neural network pancreas segmentation model and its ablated variants using both a CT (82 patients) and one MRI (79 patients) dataset, under 4-fold cross-validation (CV). Our complete model outperforms 4% of DSC comparing to previous state-of-the-arts [1,11]. Although our contextual learning model is only tested on pancreas segmentation, the approach is directly generalizable to other three dimensional organ segmentation tasks.

2

Method

Simplifying Deep CNN Architecture: We propose to train deep CNN network from scratch and empirically observe that, for the specific application of pancreas segmentation in CT/MRI images, ImageNet pre-trained CNN models do not noticeably improve the performance. More importantly, we design our CNN network architecture specifically for pancreas segmentation where a much smaller CNN model than the conventional models [5,17]is found to be most effective. This model reduces the chance of over-fitting (against small-sized medical image datasets) and can speed up both training and inference. Our specialized deep network architecture is trained from scratch using pancreas segmentation datasets, without being first pre-trained using ImageNet [5,17] and then finetuned. It also outperforms the ImageNet fine-tuned conventional CNN models [5,17] from our empirical evaluation.

Pancreas Segmentation in MRI Using Graph-Based Decision Fusion on CNN

677

First, Convolutional layer is followed by ReLU and Batch normalization layers to form the basic unit of our customized network, namely the CBR block. Second, following the deep supervision principle proposed in [17], we stack several CBR blocks together with an auxiliary loss branch per block and denote this combination as Scale block. Figure 1 shows exemplar CBR block (CBR-B) and Scale block (Scale-B). Third, we use CBR block and Scale block as the building blocks to construct our tailored deep network, with each Scale block is followed with a pooling layer. Hyper parameters of the numbers of feature maps in convolutional layers, the number of CBR blocks in a Scale block, as well as the number of Scale blocks to fit into our network can be determined via a model selection process on a subset of training dataset (i.e., split for model validation). 2.1

Contextual Regularization

From above, we have designed a compact CNN architecture which can process pancreas segmentation on individual 2D image slices. However as shown in the first row of Fig. 2, the transition among the resulted CNN pancreas segmentation regions in consecutive slices may not be smooth, often implying that segmentation failure occurs. Adjacent CT/MRI slices are expected to be correlated to each other thus segmentation results from successive slices need to be constrained for shape consistence. To achieve this, we concatenate long short-term memory (LSTM) network to the 2D CNN model for contextual learning, as a compelling architecture for sequential data processing. That is, we slice any 3D CT (or MRI) volume into a 2D image sequence and process to learn the segmentation contextual constraints among neighboring image slices with LSTM. Standard LSTM network requires the vectorized input which would sacrifice the spatial information encoded in the output of CNN. We therefore utilize the convolutional LSTM (CLSTM) model [13] to preserve the 2D image segmentation layout by CNN. The second row of Fig. 2 illustrates the improvement by enforcing CLSTM based segmentation contextual learning.

Fig. 2. NIH Case51: segmentation results with and without contextual learning are displayed in row 1 and row 2, respectively. Golden standards are displayed in white, and automatic outputs are rendered in red.

678

J. Cai et al.

2.2

Jaccard Loss

We propose a new jaccard loss (JACLoss) for training neural network image segmentation model. To optimize JI (a main segmentation metric) directly in network training makes the learning and inference procedures consistent and generate threshold-free segmentation. JACLoss is defined as follows:    ˆf ) ˆj |Y+ Yˆ+ | f ∈Y+ (1 ∧ y j∈Y yj ∧ y   (1) =1− Ljac = 1 −  ˆ =1− ˆj |Y+ | + b∈Y (0 ∨ yˆb ) |Y+ Y+ | j∈Y yj ∨ y −

where Y and Yˆ represent the ground truth and network predictions. Respectively, we have Y+ and Y− defined as the foreground pixel set and the background pixel set, and |Y+ | is the cardinality of Y+ . Similar definitions are also applied to Yˆ . yj and yˆj ∈ {0, 1} are indexed pixel values in Y and Yˆ . In practice, yˆj is relaxed to the probability number in range [0, 1] so that JACLoss can be approximated by   ˆf ) ˆf f ∈Y+ min(1, y f ∈Y+ y ˜   =1− Ljac = 1 − (2) |Y+ | + b∈Y− max(0, yˆb ) |Y+ | + b∈Y− yˆb ˜ jac are sharing the same optimal solution of Yˆ , with slight Obviously, Ljac and L abuse of notation, we use Ljac to denote both. The model is updated by: ⎧ 1 ⎨ − |Y+ |+b∈Y yˆb , for j ∈ Y+ ∂Ljac −  = (3) ˆf f ∈Y+ y  ⎩− ∂ yˆj 2 , for j ∈ Y− (|Y+ |+

b∈Y−

yˆb )



 Since the inequality ˆf < (|Y+ | + b∈Y− yˆb ) holds by definition, the f ∈Y+ y JACLoss assigns larger gradients to foreground pixels that intrinsically balances the foreground and background classes. It is empirically works better than the cross-entropy loss or the classed balanced cross-entropy loss [17] when segmenting small objects, such as pancreas in CT/MRI images. Similar loss functions are independently proposed and utilized in [7,18].

3

Experimental Results and Analysis

Datasets: Two annotated pancreas datasets are utilized for experiments. The first NIH-CT-82 dataset [10,11] is publicly available and contains 82 abdominal contrast-enhanced 3D CT scans. We obtain the second dataset UFL-MRI-79 from [1], with 79 abdominal T1-weighted MRI scans acquired under multiple controlled-breath protocol. For the case of comparison, 4-fold cross validation is conducted similar to [1,10,11]. Unlike [11], no sophisticated post processing is employed. We measure the quantitative segmentation results using dice similarity coefficient (DSC): DSC = 2(|Y+ ∩ Yˆ+ |)/(|Y+ | + |Yˆ+ |), and jaccard index (JI): JI = (|Y+ ∩ Yˆ+ |)/(|Y+ ∪ Yˆ+ |).

Pancreas Segmentation in MRI Using Graph-Based Decision Fusion on CNN

679

Network Implementation: Hyper-parameters are determined via model selection inside training dataset. The network that contains five Scale blocks with four CBR blocks in each Scale block produces the best empirical performance, while remaining with the compact model size (