AI and Big Data in Cardiology: A Practical Guide 3031050703, 9783031050701

This book provides a detailed technical overview of the use and applications of artificial intelligence (AI), machine le

361 116 4MB

English Pages 219 [220] Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

AI and Big Data in Cardiology: A Practical Guide
 3031050703, 9783031050701

Table of contents :
Preface
Contents
1 Introduction
Andrew King and Nicolas Duchateau
1.1 Aims and Motivation
1.2 What are AI and Machine Learning?
1.3 A Brief History
1.4 AI in Medicine
1.5 The Role of Big Data
1.6 Outlook
References
2 AI and Machine Learning: The Basics
Nicolas Duchateau, Esther Puyol-Antón, Bram Ruijsink and Andrew King
2.1 Introduction
2.2 Defining the Problem
2.3 Types of Model
2.4 Model Design
2.5 Model Validation
2.6 Machine Learning Is Not a Panacea!
2.7 Sources of Data for Machine Learning in Cardiology
2.8 Imaging Sources
Echocardiography
2.9 Closing Remarks
2.10 Exercises
2.11 Tutorial—Introduction to Python and Jupyter Notebooks
References
3 From Machine Learning to Deep Learning
Pierre-Marc Jodoin, Nicolas Duchateau and Christian Desrosiers
3.1 Introduction
3.2 Machine Learning and Neural Networks
3.3 K-Class Prediction
3.4 Handling Non-linearly Separable Data
3.5 Convolutional Neural Networks
3.6 Closing Remarks
3.7 Exercises
3.8 Tutorial—Classification From Linear to Non-linear Models
References
4 Measurement and Quantification
Olivier Bernard, Bram Ruijsink, Thomas Grenier and Mathieu De Craene
4.1 Clinical Introduction
4.2 Overview
4.3 AI Models for Cardiac Quantification
4.4 Quantification of Cardiac Function From CMR and Echocardiography
4.5 Quantification of Calcium Scoring From CT Imaging
4.6 Quantification of Coronary Occlusion From SPECT
4.7 Leveraging Clinical Reports as a Base of Annotations
4.8 Closing Remarks
4.9 Exercises
4.10 Tutorial—Cardiac MR Image Segmentation With Deep Learning
4.11 Opinion
References
5 Diagnosis
Daniel Rueckert, Moritz Knolle, Nicolas Duchateau, Reza Razavi and Georgios Kaissis
5.1 Clinical Introduction
5.2 Overview
5.3 Classical Machine Learning Pipeline for Diagnosis
5.4 Deep Learning Approaches for Diagnosis
5.5 Machine Learning Applications for Diagnosis
5.6 Machine Learning Approaches Based on Radiomics
5.7 Machine Learning Approaches for Large-Scale Population Studies
5.8 Challenges
5.9 Closing Remarks
5.10 Exercises
5.11 Tutorial—Two-Class and Multi-class Diagnosis
5.12 Opinion
References
6 Outcome Prediction
Buntheng Ly, Mihaela Pop, Hubert Cochet, Nicolas Duchateau, Declan O'Regan and Maxime Sermesant
6.1 Clinical Introduction
6.2 Overview
6.3 Current Clinical Methods to Predict Outcome
6.4 AI-Based Methods to Predict Outcome
6.5 Application: Prediction of Response Following Cardiac Resynchronization Therapy (CRT)
6.6 Application: AI Methods to Predict Atrial Fibrillation Outcome
6.7 Application: Risk Stratification in Ventricular Arrhythmia
6.8 Closing Remarks
6.9 Exercises
6.10 Tutorial—Outcome Prediction
6.11 Opinion
References
7 Quality Control
Ilkay Oksuz, Alain Lalande and Esther Puyol-Antón
7.1 Clinical Introduction
7.2 Overview
7.3 Motion Artefact Detection
7.4 Poor Planning Detection and Automatic View Planning
7.5 Missing Slice Detection
7.6 Segmentation Failure Detection
7.7 Closing Remarks
7.8 Exercises
7.9 Tutorial—Quality Control
7.10 Opinion
References
8 AI and Decision Support
Mariana Nogueira and Bart Bijnens
8.1 Introduction
8.2 What Does AI Bring to the Table to Support the Clinician?
8.3 Current Challenges and the Importance of Interpretability
8.4 Addressing Challenges With Interpretable AI—The Potential of Representation Learning
8.5 Closing Remarks
References
9 AI in the Real World
Alistair A. Young, Steffen E. Petersen and Pablo Lamata
9.1 Introduction
9.2 Asking the Right Question
9.3 Provenance of Data
9.4 Structural Risk
9.5 Shallow Learning
9.6 Does My Model Look Good in This?
9.7 Mechanistic Models for AI Interpretability
9.8 Utility of Community-Led Challenges
9.9 Closing Remarks
References
10 Analysis of Non-imaging Data
Nicolas Duchateau, Oscar Camara, Rafael Sebastian and Andrew King
10.1 Introduction
10.2 Electrophysiology
10.3 ECG Analysis
10.4 Electronic Health Records
10.5 Closing Remarks
References
11 Conclusions
Andrew King and Nicolas Duchateau
Supplementary Information
Solution of the Exercises
Chapter 2
Exercise 1
Exercise 2
Exercise 3
Exercise 4
Exercise 5
Exercise 6
Chapter 3
Exercise 1
Exercise 2
Exercise 3
Exercise 4
Exercise 5
Chapter 4
Exercise 1
Exercise 2
Exercise 3
Exercise 4
Exercise 5
Exercise 6
Chapter 5
Exercise 1
Exercise 2
Exercise 3
Exercise 4
Exercise 5
Exercise 6
Exercise 7
Chapter 6
Exercise 1
Exercise 2
Exercise 3
Exercise 4
Chapter 7
Exercise 1
Exercise 2
Exercise 3
Exercise 4
Exercise 5
Index

Citation preview

AI and Big Data in Cardiology

Nicolas Duchateau · Andrew P. King Editors

AI and Big Data in Cardiology A Practical Guide

Editors Nicolas Duchateau CREATIS, Lyon 1 University Villeurbanne Cedex, France

Andrew P. King School of Biomedical Engineering and Imaging Sciences King’s College London London, UK

ISBN 978-3-031-05070-1 ISBN 978-3-031-05071-8  (eBook) https://doi.org/10.1007/978-3-031-05071-8 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

V

Preface How to Use This Book Our aim in compiling this book was to bring together people from different specialities working in the area of artificial intelligence (AI) for cardiology. The structure of the book reflects the diversity of the backgrounds of its contributors. Many chapters begin with an introduction by a practising clinician or clinical scientist to give the reader an idea of current clinical workflows and challenges. This sets the scene for the technical material to follow, in which AI specialists provide an overview of their field of specialism in an accessible style. To reinforce this technical review, hands-on tutorials are provided to allow readers to gain practical experience with the concepts and techniques just discussed. Finally, clinical opinion pieces are included, which attempt to ‘look into the future’ to predict how AI might affect the field in years to come. The way in which a reader might want to use this book will depend upon their needs. To fully engage with the clinical, technical, and practical parts, we advise readers to work through the sections in order, attempting the exercises that are included in the technical reviews and then following up with the practical tutorials. The later chapters, which cover specific issues regarding the use and application of AI in cardiology, are recommended for readers who have already learnt the basics of AI methods and algorithms in the earlier chapters. For those interested only in getting a good understanding of the state of the art, without also developing practical skills, the tutorial sections can be skipped. But we would like to encourage readers to devote some time to work through these tutorials. We believe that many of the most exciting scientific discoveries come from this kind of inter-disciplinarity. Sometimes it can be good to work outside of your comfort zone!

Contents and Organisation The book consists of eleven chapters. The first three chapters contain important introductory material to set the scene for the more clinically focused chapters to follow. 5 In 7 Chap. 1, we introduce the global context that motivated us to produce this book, describe our target audience, and set the scene for the chapters to come. We also introduce key terms, such as the distinction between ‘machine learning’, ‘artificial intelligence’, and ‘big data’, and review some of the fundamental concepts of the field. 5 In 7 Chap. 2, we provide an introduction to the field of machine learning and its use in cardiology. The main sources of data exploited by AI techniques in cardiology are discussed. The chapter finishes with an introduction to the tools used in the practical tutorials in later chapters.

VI

Preface

5 7 Chapter 3 complements 7 Chap. 2 by focusing on the specific subfield of machine learning known as ‘deep learning’. Many of the most exciting recent developments in AI are based on deep learning, which uses the concept of neural networks, so we introduce the key concepts of neural networks from first principles. The chapter closes with a practical tutorial on building a simple neural network model. The next four chapters focus on specific clinical applications of AI in cardiology. 5 7 Chapter 4 deals with the measurement and quantification of cardiac anatomy and function. Deriving cardiac biomarkers is a key part of many clinical pipelines in cardiology, but the processes involved have traditionally been manual and time-consuming. In recent years, several AI techniques have been proposed to (partially) automate these processes. The practical tutorial gives the reader some hands-on experience of such techniques. 5 In 7 Chap. 5, we cover the clinical task of diagnosis. Determining the specific diagnosis of a heart patient is essential for optimising their treatment, but many pipelines for diagnosis and stratification of patients for treatment remain suboptimal. AI has the potential to play an important role in addressing this problem. The practical tutorial for this chapter showcases this potential by helping the reader to build a simple classification model based on image features. 5 7 Chapter 6 focuses on outcome prediction in cardiology. This encompasses such tasks as predicting the effectiveness of treatments and predicting individual risks of cardiovascular disease. The practical tutorial illustrates how to build a deep learning model for a simple prognosis prediction task. 5 In 7 Chap. 7, we deal with the important issue of quality control. Increasingly, large databases of imaging and non-imaging data are becoming available, from which important insights can be gained about cardiovascular health and disease. In order to be able to effectively process and analyse these databases, automated AI-based pipelines are essential. Quality control techniques play an important role in these pipelines. They can identify failure cases due to poor-quality data or algorithmic errors. Such cases can be excluded or flagged for clinician review. A practical tutorial shows how to build a quality control model for detecting image artefacts. The remaining chapters deal with more specific issues linked with the application of AI in cardiology. These chapters do not feature practical tutorials, but rather act to highlight important considerations that practitioners and users of AI should take note of before developing and deploying AI tools in cardiology. 5 7 Chapter 8 deals with the use of AI-based decision-making tools by clinicians. It is often thought that maximising the performance of such tools in terms of classification accuracy is the prime, or indeed the only consideration. But in the real world, AI models for decision-making will likely not act as stand-alone tools, but rather will be used by a clinician or clinicians for

VII Preface

decision support. This raises other important requirements that should be considered during model design and development, and this chapter reviews and discusses these considerations. 5 7 Chapter 9 deals with a range of other important issues that arise when deploying AI tools in the real world. Questions such as how to ensure that the right clinical question is being addressed and how to ensure the provenance of data are discussed in depth. 5 7 Chapter 10 discusses AI-based analysis of non-imaging data in cardiology. Imaging data obviously plays a crucial role in cardiology, and for this reason, much of the book’s contents focus on this area. But there are invaluable non-imaging data sources such as electrocardiograms (ECGs) that still have a central role to play, and these are reviewed in this chapter. 5 Finally, 7 Chap. 11 provides some concluding remarks and reflections on the book and speculates as to the future direction of the exciting role of AI research in cardiology.

Electronic Supplementary Materials The practical tutorials play a central role in this book. These tutorials are provided as Electronic Supplementary Material, in the form of Jupyter notebooks, and are briefly summarised at the end of each chapter. We encourage readers to experiment and engage with these useful materials.

Acknowledgements We would like to thank the publishers, Springer, for their assistance in developing and publishing this book. Specifically, we thank Grant Weston and Leo Johnson for their support in the early stages of shaping the book’s contents and Anand Chozhan and Antony Raj Joseph for supporting us during the preparation of the book. Finally, and most importantly, we give our immense thanks to all of the authors who have contributed writing and practical tutorials to this book. We believe that we have assembled a cast of world leaders in the field of AI for cardiology and that their expertise and valuable insights into the field will help to shape the future direction of research in this area. We know that all authors have busy schedules and we thank them all for their hard work and expertise in putting together their contributions—all credit for their work is due to them. As editors, we take full responsibility for any oversights or mistakes and we are always keen to get feedback from readers.

IX

Contents 1

Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

Andrew King and Nicolas Duchateau 2

AI and Machine Learning: The Basics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Nicolas Duchateau, Esther Puyol-Antón, Bram Ruijsink and Andrew King

3

From Machine Learning to Deep Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Pierre-Marc Jodoin, Nicolas Duchateau and Christian Desrosiers

4

Measurement and Quantification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Olivier Bernard, Bram Ruijsink, Thomas Grenier and Mathieu De Craene

5

Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Daniel Rueckert, Moritz Knolle, Nicolas Duchateau, Reza Razavi and Georgios Kaissis

6

Outcome Prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Buntheng Ly, Mihaela Pop, Hubert Cochet, Nicolas Duchateau, Declan O’Regan and Maxime Sermesant

7

Quality Control. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 Ilkay Oksuz, Alain Lalande and Esther Puyol-Antón

8

AI and Decision Support. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 Mariana Nogueira and Bart Bijnens

9

AI in the Real World . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 Alistair A. Young, Steffen E. Petersen and Pablo Lamata

10

Analysis of Non-imaging Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 Nicolas Duchateau, Oscar Camara, Rafael Sebastian and Andrew King

11

Conclusions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 Andrew King and Nicolas Duchateau

Supplementary Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 Solution of the Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

1

Introduction Andrew King and Nicolas Duchateau Contents 1.1

Aims and Motivation – 2

1.2

What are AI and Machine Learning? – 3

1.3

A Brief History – 5

1.4

AI in Medicine – 5

1.5

The Role of Big Data – 7

1.6

Outlook – 9 References – 9

Authors’ contribution: • Main chapter: AK, ND. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Duchateau and A. P. King (eds.), AI and Big Data in Cardiology, https://doi.org/10.1007/978-3-031-05071-8_1

1

2

1

A. King and N. Duchateau

n Learning Objectives: At the end of this chapter you should be able to: O1.A Explain the relationship between AI, machine learning and deep learning O1.B Give some prominent examples of the application of AI in medicine O1.C Explain the meaning of ‘big data’ and describe some of its potential benefits and risks.

1.1

Aims and Motivation

These days, it seems as if barely a week passes by without a new story in the media about the ways in which artificial intelligence (AI) has changed or will change our lives. From robotics and game playing to self-driving cars and machine translation, AI is projected to hugely impact on our lives in coming years and decades. But without doubt, one of the most promising areas in which AI has the potential to bring benefit to humanity is in healthcare. Projects are underway all over the world that aim to exploit the power of the AI revolution in areas such as drug development, automated diagnosis, treatment optimization and computer-assisted surgery. In a recent Accenture report,1 it was estimated that investment in the use of AI in healthcare would be $6.6 billion in 2021, with potential benefits to the USA economy alone of $150 billion by 2026. Because of this upsurge in interest and investment, a new breed of medical practitioner has begun to emerge in recent years. Whereas once there was little overlap between the clinical and technological specialisms, increasingly researchers and practitioners are occupying a space at the interface between these two exciting fields. The availability of easy-to-use but powerful software tools is enabling doctors to experiment with AI-based technological solutions and to develop their knowledge in this area. Similarly, AI researchers are engaging more and more with clinicians and learning to speak the language of medicine. This process is blurring the boundaries between the fields of medicine and AI technology and represents a huge opportunity to transform patient care. We believe that this engagement from both sides is an essential step to ensure translation of AI into clinical practice. This book is aimed at people who wish to begin or continue their journey into this cross-over field. We emphasize that this is a journey that will never end. For people beginning their journey from both sides (medicine and AI) there is always more to discover and learn and both fields are developing fast. We hope that this book and the many valuable and insightful contributions of the authors can help readers in this lifelong learning journey. We have tried to assemble a team of authors who are global leaders in this emerging cross-over field. There is a mix of practising cardiologists, radiologists and AI specialists, but all have a strong interest in seeing their work translated into clinical practice. We have made a deliberate effort to include perspectives from a variety of backgrounds and specialisms to reflect the diversity of the field. None of 1

7 https://www.accenture.com/sg-en/insight-artificial-intelligence-healthcare.

Introduction

3

1

us are experts in all areas covered in this book, and we believe that we all can and should learn from each other. This book represents a forum for such learning.

1.2

What are AI and Machine Learning?

We seem to hear the terms AI and machine learning a lot these days, whether through reading scientific papers or the popular media. More recently, the term deep learning is starting to take centre stage. These terms are often used imprecisely and interchangeably, so we start by introducing their precise meanings. AI, machine learning and deep learning are strongly related, and are in fact progressively more precise concepts, as illustrated in . Fig. 1.1. In general, AI refers to the ability of machines (i.e. computers) to exhibit ‘intelligent’ behaviour. Although seemingly a simple definition, in reality what we term ‘intelligent’ is difficult to pin down, and also shifts over time. In what has become known as the ‘AI effect’, many tasks, once they can be effectively performed by computers, are no longer considered to be intelligent. Whatever the definition of intelligence, a wide range of different types of approach can be employed to try to achieve intelligent behaviour. Originally, researchers tried to program intelligence directly, by specifying a series of rules and/or algorithmic steps that, when executed, would lead to the desired behaviour. An example of such an approach is the expert system, which is a rule-based inference system that aims to draw conclusions from data given a set of rules that represent expert domain knowledge. Some early attempts were made at using expert systems as clinical decision support tools [1]. For example, the MYCIN expert system [2], developed in the 1970s, was able to recommend antibiotics for patients suffering from severe infections. However, although it worked well, it was never actually used in practice due to technological limitations with deploying it at scale. The term machine learning was first introduced by Arthur Samuel in the 1950s, who defined it as a “field of study that gives computers the ability to learn without being explicitly programmed.” It came to prominence after the traditional AI approach of attempting to directly program intelligence proved hard to achieve. In reality, most tasks that require intelligence are difficult to fully specify in this way. Machine learning tries to overcome this problem by giving the computer the ability to learn new behaviours as it is exposed to new data. Potentially, these behaviours can go beyond what human beings can specify. For example, IBM’s ‘Watson’machine learning system has been applied to range of applications, including the management of lung cancer.2 Similarly, Babylon Health3 have developed a machine learning based online consulting system based on knowledge graphs which can learn from user-reported symptoms [3]. Deep learning is closely linked to the idea of neural networks, sometimes also called artificial neural networks. Neural networks are a specific way of implementing 2 3

7 https://www.forbes.com/sites/bruceupbin/2013/02/08/ibms-watson-gets-its-first-piece-ofbusiness-in-healthcare/#40cf2fd05402. 7 https://www.babylonhealth.com/ai.

4

A. King and N. Duchateau

1

. Fig. 1.1

The relationship between Artificial Intelligence, Machine Learning and Deep Learning

machine learning. They work by defining artificial ‘neurons’, which in some sense mimic the operation of neurons in biological systems such as the human brain. These artificial neurons are grouped together into layers, which are connected to form a network. The network as a whole takes a stimulus as input and produces a response as output. For example, consider a simple example in which an AI tool aims to diagnose a disease from a medical image—in this case the image intensities would be the stimulus and the diagnosis would be the response. Deep learning and neural networks are strongly related terms and are sometimes used synonymously. However, deep learning is typically used to refer to neural networks with many layers. Recent hardware and algorithmic developments have made such deep networks possible and they have exhibited great power in many applications, as we will see in . Chap. 3. Recent examples of the use of deep learning in medicine include the tool developed by Google Research for the automated detection of diabetic retinopathy and diabetic macular edema in retinal fundus photographs [4] and the tools for automatic contouring of cardiac chambers introduced by, for example, Circle Cardiovascular Imaging.4 In this book we use the term AI to refer to any computer-based technique that aims to achieve intelligent behaviour, and we take a fairly broad definition of intelligence. But as we shall see, many of the most promising techniques these days are based on machine learning, and more specifically deep learning.

4

7 https://www.circlecvi.com/cvi42/cardiac-mri/.

Introduction

1.3

5

1

A Brief History

AI has been around a lot longer than most people realize. . Figure 1.2 shows a brief timeline of the major eras and milestones in the history of AI. The term ‘AI’ was first coined at a conference in 1956, and some early successes led to much optimism as to what could be achieved using these new techniques. For example, in 1962 an AI program developed by Arthur Samuel for playing checkers5 was able to defeat a master’s checkers player. This early optimism led to some bold, and probably unwise claims. In 1970, Marvin Minsky famously told Life Magazine that “in from three to eight years we will have a machine with the general intelligence of an average human being.” When these claims were not realized, and more investigation revealed some limitations of the techniques being used, interest in AI waned. This led to what has become known as the “AI Winter”, in which research funding dried up and little progress was made. It was not until the 1990s that interest in AI grew significantly again. As computers became more powerful it became more feasible to address many complex problems through ‘brute force’ (i.e. by using lots of processing power), as well as algorithmic sophistication. A famous milestone was reached in 1997 when IBM’s ‘Deep Blue’ AI system beat the world champion Garry Kasparov at chess. This was significant because chess is a massively more complex game than checkers— the ‘state space’ (the number of possible board positions) of chess is approximately 1047 compared to 1018 for checkers. This success continued to grow into the new millenium. In 2011, IBM’s questionanswering AI program ‘Watson’ won the TV show Jeopardy. Around 2015 there was an explosion of interest in deep learning, prompted by technical developments and also the availability of fast hardware to make the training of very deep networks possible. This led to an acceleration in progress and much interest from large technology companies such as Google, Facebook and Twitter. In 2016 Google’s AlphaGo program beat the 18-time world champion Lee Sedol at Go.6 This success brought with it public interest and research funding, much of it from industry. This represented a significant shift—for the first time, AI moved beyond a primarily university-based research topic to a tool that many believed could be exploited to solve real-world problems, and hence make money. Considering this brief history, we can characterize the evolution of AI as starting off with a broad range of approaches, but gradually focusing on more promising specific approaches (i.e. machine learning and now deep learning). This specialization is reflected in this book.

1.4

AI in Medicine

The recent success of AI and deep learning has prompted researchers to address many problems in medicine with these techniques, including in cardiology. 5 6

Known as draughts in UK English. Go has a state-space of approximately 10170 .

6

A. King and N. Duchateau

1

. Fig. 1.2

A timeline of AI

. Figure 1.3 shows the number of papers listed on the medical search engine Pubmed that mention the terms ‘artificial intelligence’ and ‘deep learning’. We can see a gradual rise in interest in AI that has accelerated since the start of the new millennium. But interest in deep learning has grown exponentially in a much shorter time-frame. It can sometimes be hard to “sort the wheat from the chaff” when faced with such a large number of recent publications, so below we have tried to pick out some of the more significant recent advances in the use of AI in medicine.7 A prerequisite to the development of many of the most powerful AI models is the formation of large scale databases. A significant milestone here was the 7

Any such summary will necessarily be selective and subjective to a degree, and we apologize and take responsibility for any omissions. There are many good papers that we have not been able to include here.

Introduction

7

1

. Fig. 1.3 The number of papers between 1960 and 2019 on the Pubmed search engine on the subjects of a Artificial Intelligence, and b Deep Learning

commencement of the UK Biobank project8 in 2006. This is a 30-year programme to collect genetics, imaging and other data from half a million participants, and to make the data available to researchers all over the world. This project has so far led to more than 1,000 research publications, as well as inspiring other similar initiatives. Such databases have enabled machine learning models to be trained using much larger amounts of data than was previously feasible. In part because of this, it has recently become possible for AI to approach and even match the level of human expert performance in some applications. For example, in 2016 expert performance was achieved in the detection of diabetic retinopathy [4] and in 2017 in the identification of cancerous skin lesions [5]. These, and other achievements have prompted an increase in interest in commercialization of AI in medicine. A milestone here was achieved in 2018 when HeartFlow’s ‘FFRCT ’ analysis software was approved for reimbursement in the USA healthcare system.9 This tool allows clinicians to noninvasively determine the impact of artery narrowing on blood flow to the heart, and hence better select patients for intervention. It is likely that coming years will see further commercial exploitation of AI. Other promising recent research includes the prediction of the onset of disease from electronic health records [6], survival prediction based on motion extracted from cardiac magnetic resonance (CMR) images [7], automation of cardiac functional analysis from CMR [8] and breast cancer screening from mammography data [9, 10].

1.5

The Role of Big Data

As noted in the previous section, many of the most powerful machine learning and deep learning techniques rely on large amounts of data to train and evaluate their performance. One of the reasons for the recent upsurge in interest in such

8 9

7 https://www.ukbiobank.ac.uk/. 7 https://www.heartflow.com/reimbursement-resources.

8

1

A. King and N. Duchateau

techniques is the growing availability of large scale public databases. The term big data is often used to refer to such databases. There is no hard-and-fast definition for what constitutes big data, but typically these days several thousands of datasets are available for many medical applications. As well as a large number of datasets (i.e. individuals), big data refers to the gathering of many different variables from each individual. For example, the UK Biobank project, mentioned above, plans to acquire a range of lifestyle, nutritional and medical data from half a million volunteers, as well as MR imaging data (brain, heart, abdomen, bones and carotid artery) from 100,000 volunteers. For cardiology and in particular cardiac imaging, such initiatives pave the way for a new era of very large databases, compared to the previous generation of imaging studies which typically comprised hundreds to thousands of subjects [11]. As an aside, it is important to be precise when discussing the sizes of databases. For example, the data from a 100 × 100 × 100 voxel MR scan acquired from a patient could be described as consisting of one million voxel intensity values, or one hundred slices, or just data from a single patient. How we count sizes of databases depends upon what we want to do with them. For example, if our task is to diagnose a disease, then the units of interest are patients, and so the size of the database should be based upon the number of patients. Alternatively, if our task is to segment10 a two-dimensional image, then the units of interest will be image slices. Much of the value of big data comes from annotations. Annotation refers to adding information to, or ‘labelling’ the data. For example, to return to our example of diagnosing a disease from a medical image, annotation in this case would involve adding information about disease diagnosis to the images used for training the AI tool. Alternatively, for a segmentation task annotations would consist of ‘ground truth’ segmentations of certain structures. What we choose to annotate depends upon the task we want to address. But most tasks involve some form of data annotation to create a useful database for training our AI model. The process of annotation is normally at least partly manual, and so can be quite time-consuming. Gathering, organising and annotating data has become an important activity in AI. Collectively, this process is known as data curation. Of course, the benefits of big data come with associated risks, as the new data may not always come with new knowledge [12]. Smaller and well-controlled databases often consist of a limited number of variables that are more specific to the application of interest, meaning that they already focus on the question that we wish to address. They also come with better ‘fidelity’ of the data. By data ‘fidelity’ we refer to the presence of all data fields for all individuals and the absence of corruption or errors in their values. For example, high fidelity data for our disease diagnosis example would mean having high quality images with no serious artefacts and diagnosis annotations without errors for all images. With big data, ensuring data fidelity becomes much harder and requires automated techniques for controlling the quality of data and annotations, as we will see in . Chap. 7. Collecting 10 Segmentation refers to the process of outlining, or delineating, structures of interest in images, such as organs or tumours.

Introduction

9

1

more samples and variables also means adopting a broader view on the studied diseases/applications, with potential redundancy between variables, lower or unknown relevance of some variables to the disease/application, and higher demands in storage and computational power. A global roadmap for balancing these risks with the potential for new discoveries would be of high value, as promoted by the neuroscience community [12]. In the field of cardiology, some of these practical issues will be specifically discussed in . Chap. 9.

1.6

Outlook

This chapter has provided an overview of AI and machine learning in medicine, deliberately avoiding any technical details. But it is clear that we are living in interesting and exciting times, and the opportunities for turning technological advances into patient benefit are huge. The use of AI in medicine is a fast-moving field— a book like this written in 10 or 20 years’ time would almost certainly look very different. But realising the potential benefits of the AI revolution in medicine will require hard work and commitment—a commitment to learning and expanding our knowledge into unfamiliar areas. The next chapter is intended to help readers in this learning journey by reviewing some more technical aspects of AI and machine learning in an accessible way. Acknowledgements AK was supported by the EPSRC (EP/P001009/1), the Well-

come/EPSRC Centre for Medical Engineering at the School of Biomedical Engineering and Imaging Sciences, King’s College London (WT 203148/Z/16/Z) and the UKRI London Medical Imaging & Artificial Intelligence Centre for Value Based Healthcare. ND was supported by the French ANR (LABEX PRIMES of Univ. Lyon [ANR11-LABX-0063] within the program “Investissements d’Avenir” [ANR-11-IDEX0007], and the JCJC project “MIC-MAC” [ANR-19-CE45-0005]).

References 1. Abu-Nasser B. Medical expert systems survey. Int J Eng Inf Syst (IJEAIS). 2017;1(7):218−24. 2. Shortliffe E. Computer-based medical consultations: MYCIN. Elsevier; 1976. 3. Richens J, Lee C, Johri S. Improving the accuracy of medical diagnosis with causal machine learning. Nat Commun. 2020;11(3923). 4. Gulshan V, Peng L, Coram M, Stumpe MC, Wu D, Narayanaswamy A, Venugopalan S, Widner K, Madams T, Cuadros J, Kim R, Raman R, Nelson PC, Mega JL, Webster DR. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA. 2016;316(22):2402−10. 5. Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, Thrun S. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017;542:115−8. 6. Miotto R, Li L, Kidd BA, Dudley JT. Deep Patient: An unsupervised representation to predict the future of patients from the electronic health records. Sci Rep. 2016;6:26094.

10

1

A. King and N. Duchateau

7. Bello GA, Dawes TJW, Duan J, Biffi C, de Marvao A, Howard LSGE, Gibbs JSR, Wilkins MR, Cook SA, Rueckert D, O’Regan DP. Deep-learning cardiac motion analysis for human survival prediction. Nat Mach Intell. 2019;1:95−104. 8. Ruijsink B, Puyol-Antón E, Oksuz I, Sinclair M, Bai W, Schnabel JA, Razavi R, King AP. Fully automated, quality-controlled cardiac analysis from CMR: Validation and large-scale application to characterize cardiac function. JACC: Cardiovasc Imaging. 2020;13(3):684−95. 9. McKinney SM, Sieniek M, Shetty S. International evaluation of an AI system for breast cancer screening. IEEE Trans Med Imaging. 2020;577(4):89−94. 10. Wu N, Phang J, Park J, Shen Y, Huang Z, Zorin M, Jastrz¸ebski S, Févry T, Katsnelson J, Kim E, Wolfson S, Parikh U, Gaddam S, Lin LLY, Ho K, Weinstein JD, Reig B, Gao Y, Toth H, Pysarenko K, Lewin A, Lee J, Airola K, Mema E, Chung S, Hwang E, Samreen N, Kim SG, Heacock L, Moy L, Cho K, Geras KJ. Deep neural networks improve radiologists’ performance in breast cancer screening. IEEE Trans Med Imaging. 2020;39(4):1184−94. 11. Suinesiaputra A, Medrano-Gracia P, Cowan BR, Young AA. Big heart data: Advancing health informatics through data sharing in cardiovascular imaging. IEEE J Biomed Health Inf. 2015;19(4):1283−90. 12. Frégnac Y. Big data and the industrialization of neuroscience: A safe roadmap for understanding the brain? Science. 2017;358(6362):470−7.

11

2

AI and Machine Learning: The Basics Nicolas Duchateau, Esther Puyol-Antón, Bram Ruijsink and Andrew King Contents 2.1

Introduction – 12

2.2

Defining the Problem – 12

2.3

Types of Model – 13

2.4

Model Design – 15

2.5

Model Validation – 19

2.6

Machine Learning Is Not a Panacea! – 22

2.7

Sources of Data for Machine Learning in Cardiology – 22

2.8

Imaging Sources – 23

2.9

Closing Remarks – 28

2.10

Exercises – 28

2.11

Tutorial—Introduction to Python and Jupyter Notebooks – 30 References – 31

Supplementary Information The online version contains supplementary material available at 7 https://doi.org/10.1007/978-3-031-05071-8_2. Authors’ contribution: • Main chapter: ND, BR, AK. • Tutorial: EPA, ND. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Duchateau and A. P. King (eds.), AI and Big Data in Cardiology, https://doi.org/10.1007/978-3-031-05071-8_2

12

N. Duchateau et al.

n Learning Objectives: At the end of this chapter you should be able to: O2.A Clearly define the right problem and justify why machine learning is needed to solve it O2.B Describe the different classes of machine learning model and in what types of situation they can be applied O2.C Outline a design for a machine learning model to address a given problem in a given medical scenario O2.D Describe how machine learning models can be fairly and quantitatively validated O2.E Describe the main sources of data for machine learning models in cardiology

2

2.1

Introduction

In this chapter we will delve into the world of AI and machine learning in a bit more detail. We will look at what issues we need to consider and what decisions we should make when looking to develop a machine learning model to address a specific problem. We focus on machine learning in general, but everything that we write is also applicable to the specific field of deep learning.1 The chapter closes with some exercises intended to reinforce what has been learnt, as well as the first of our practical tutorials, which is a chance for you to “get your hands dirty” by starting to do some simple programming using Python and Jupyter. This tutorial acts a groundwork for the more specific tutorials on different topics that will be presented in future chapters.

2.2

Defining the Problem

As well as curating a database for training our AI model, it is important to think about and clearly define which problem we want to address. For example, our problem could be the diagnosis of a disease, the characterization of the function of an organ, or simply the anatomical alignment of two or more medical images. Identifying and clearly defining the problem is an essential step—as we discussed in the previous chapter, the details of which annotations (if any) we add to our data depends upon our problem. The way in which we define our problem also impacts upon which AI model(s) can be used to address it, as we will see in the next section. Key considerations in defining the problem are the role of the AI model in the clinical workflow, as well as the potential risks involved in incorporating it. For example, if we want our model to diagnose a disease that is normally diagnosed by a radiologist, do we want to replace the radiologist or assist the radiologist by automating ‘obvious’diagnoses whilst flagging up ‘difficult’ones for manual review? If we aim to identify potential disease at an earlier stage, what would happen to patients who are identified in this way? Do effective treatments exist? How invasive 1

We introduce the technical aspects of deep learning in . Chap. 3.

AI and Machine Learning: The Basics

13

2

are they and does their benefit outweigh their risk? Such considerations are often overlooked when proposing AI models in medicine, and we revisit this important topic in . Chap. 9.

2.3

Types of Model

Once the problem has been clearly defined and we are sure that there is a beneficial role for AI to play, we can start to think about which model to employ. Focusing now specifically on machine learning techniques, it is normal to break down types of model into two main classes: 5 Supervised models: The aim of a supervised model is to predict an output given an input. To train a supervised model it must be provided with a database of input/output pairs, and typically the outputs are produced by annotating the database. For example, to revisit our disease diagnosis problem, in this case the inputs would be medical images such as MR or CT scans, and the outputs would be binary labels (i.e. disease/no disease). 5 Unsupervised models: With unsupervised models no output label is used. The aim of the machine learning model is to analyse the input data (e.g. images) and try to uncover patterns that might be useful for subsequent processing. These patterns can be as simple as identifying ‘clusters’ of similar inputs, or they can be more sophisticated representations of relations between inputs, as we will see below. The reason for not using labels could be that they are not available, that they are insufficiently trusted (e.g. distinguishing normal and reduced ejection fraction may be too reductive against the spectrum of heart failure [1]) or that supervised formulations showed their limits [2]. Because annotation can be a time-consuming process, we are often in the situation where we only have annotations for a subset of the training database. In such cases, rather than using supervised learning on the smaller subset, a class of techniques known as semi-supervised learning [3] can be employed. These techniques are able to exploit both the annotated and unannotated data to produce a model with better performance. A third class of machine learning models, which has been less widely used in medicine so far, is reinforcement learning. Reinforcement learning techniques are neither supervised nor unsupervised. To understand the way in which reinforcement learning works, consider a toy problem of a mouse trying to navigate a maze to find a piece of cheese (see . Fig. 2.1). An AI agent is defined, which always has a current state in the environment. For example, the mouse agent will always have a location in the maze. To train the mouse, it will choose an action (a direction in the maze) which will result in a new state (location) as well as a reward. Good actions (i.e. those which result in eventually getting the cheese) are rewarded and bad actions are punished. The idea is that, by trying to solve the problem enough times and being rewarded/punished for its actions, the agent will learn to choose good actions. Although seemingly an abstract concept, applications in medicine have

14

N. Duchateau et al.

2

. Fig. 2.1 Reinforcement learning. An agent continually chooses an action which results in a reward as well as a new state in the environment

been proposed, for example in learning sampling strategies in MR [4], operator guidance in ultrasound imaging [5] and personalising computational models [6]. The choice of which class of machine learning model to employ depends upon what type of problem we have. If our problem can be clearly defined in terms of inputs and known and trusted output labels, then supervised learning can be employed. If a larger amount of extra unannotated inputs are available, semisupervised learning can be considered. If no output labels are available or they are not sufficiently trusted, and the aim is simply to learn about the structure and patterns in the input data, then unsupervised learning should be used. Finally, if the problem can be formulated in terms of actions, states and rewards, then reinforcement learning can be investigated. After the problem has been analysed and an appropriate class of technique has been identified, a specific machine learning model must be chosen. There has been a wide range of models proposed over the years for supervised and unsupervised learning. In . Fig. 2.2 we summarize some of the more commonly used ones. For a more detailed review and specific references we recommend [7]. We can see that supervised learning models can be broken down further into classification and regression methods. The distinction here lies simply in what type of output we want to estimate. If the output type is categorical or ranked (see . Fig. 2.3), then a classification model must be used. If the output type is discrete or continuous then a regression model must be used. For example, in our disease diagnosis example the output label (disease/no disease) is binary and categorical, so a classification model would be appropriate. Similarly, the segmentation of anatomical structures involves assigning a category to each pixel of an image, and can be seen as a (pixelwise) classification problem. On the other hand, estimating a numerical biomarker directly from an image or set of images, such as left ventricular ejection fraction (EF) in cardiac imaging, would require a regression model. Unsupervised models can be broken down into clustering and dimensionality reduction methods. With clustering the aim is to identify a limited number of groups of inputs that are similar in some way, i.e. they represent clusters in the distribution

15

AI and Machine Learning: The Basics

2

Supervised learning Classificaon

Regression

Linear discriminant analysis (LDA) Support vector machines (SVM) Logisc regression Decision trees/forests Genec algorithms Neural networks

Linear regression Ridge/kernel regression Decision trees/forests Genec algorithms Neural networks

Unsupervised learning Clustering

Dimensionality reducon

K-means Mean shi Expectaon maximizaon Hierarchical

Principal component analysis (PCA) Independent component analysis (ICA) Manifold learning

. Fig. 2.2

Examples of machine learning models broken down by class

Categorical (a.k.a. nominal)

Ranked (a.k.a. ordinal)

Discrete

Connuous

Non-quantave No ordering

Non-quantave, Meaningful ordering of values

Quantave, Can take a limited number of values

Quantave, Can take any value within a range

E.g. blood type

E.g. tumour grade (grade I, grade II, etc.)

E.g. number of tumours

E.g. blood pressure

. Fig. 2.3

A summary of statistical types of data

of inputs. In dimensionality reduction, the input data are mapped, or transformed to a new coordinate system, in which further analysis can take place. Standard techniques for this include the use of linear (principal component analysis—PCA) or nonlinear (manifold learning) transformations.

2.4

Model Design

Having considered the type of machine learning model we can employ, we now move on to a range of other design considerations, mostly related to the data used to train and evaluate the model.

16

N. Duchateau et al.

Data Descriptors

2

As for clinical observations and standard statistical analyses, choosing adequate inputs is key for effectively training a machine learning model. A data descriptor, also referred to as a feature, summarizes the information available in each of the studied samples. The traditional machine learning paradigm generally dissociates the feature selection and problem solving tasks. This means that the machine learning developer relies on prior knowledge of the application area to select one or several features, which are then used as inputs to the model during training and evaluation. This process is known as hand-crafting of features or descriptors. In contrast, in deep learning a feature representation (based on ‘raw’ inputs provided by the user) that is optimized for the problem being addressed is learnt from the data and used to solve the problem. It therefore stands as a powerful tool for new discoveries from the data, although this often comes with the cost of reduced interpretability2 and control that a hand-crafted feature set could provide. The simplest type of feature consists of single values, also called scalar measurements. These can be previously extracted from images such as cardiac chamber dimensions or EF, or correspond to more advanced image characteristics at the pixel level, such as radiomics features [8]. They can also be measured by other means (such as pressures or brain natriuretic peptide (BNP) levels) or even correspond to patient characteristics or external factors. In the case of scalar input features, machine learning stands as a way to model more complex associations between the input features (and output labels, if any) than standard statistical approaches. However, inputs can also consist of more complex data structures such as signals or images, or even descriptors extracted at each location of these signals or images. The complexity of such descriptors is quantified by their dimensionality.3 Nonetheless, the intrinsic dimensionality of these descriptors is generally much lower than the dimensionality of the data: the intrinsic dimensionality is the actual number of degrees of freedom that govern the observed data. Dimensionality reduction techniques from the field of representation learning [9, 10] provide an approximation of this intrinsic dimensionality, and a simplified representation of the data that can be used as a new input for the machine learning model. . Figure 2.4 illustrates these considerations for the study of myocardial deformation from cardiac imaging data, using a single scalar value at each American Heart Association (AHA) segment or more complex descriptors at each point of the left ventricular myocardium. Using several input descriptors is rather straightforward for scalar measurements, which can be considered as elements of a higher-dimensional vector that concatenates them (after they have been normalized). In contrast, combining several high-dimensional descriptors of potentially heterogeneous types is an ongoing 2

3

Interpretability refers to the ability of humans (e.g. end-users or model developers) to understand the process by which a machine learning model arrived at its output based on the input data. We deal with model interpretability in more detail in . Chap. 8 (. Sect. 8.3) and . Chap. 9 (. Sect. 9.7). Dimensionality of data refers to the number of degrees of freedom they have, for example 104 for a two-dimensional (2-D) image made of 100 × 100 pixels.

AI and Machine Learning: The Basics

17

2

. Fig. 2.4 Different choices of data descriptor for myocardial deformation (strain) on a threedimensional mesh of the heart’s left ventricle and their associated dimensionality

field of research, addressed both with machine learning or deep learning algorithms [11] and will be further discussed in . Chap. 8. Data Constraints Working with medical data requires specific care to preserve the properties of the data descriptors, and guarantee the soundness of observations. A first example from cardiac imaging may help for understanding: let us consider again the analysis of myocardial deformation using strain data. In one dimension, strain is a scalar that represents the relative change in length of an object with respect to a reference state, typically between end-diastole and end-systole (Lagrangian strain). In three dimensions, strain quantifies the deformation of a three-dimensional (3-D) object (e.g. a cube representing a small portion of the myocardium at a given location), and is represented by a 3 × 3 tensor (a symmetric matrix that belongs to a specific family of matrices). This means that strain is no longer represented by a single scalar value but by 6 matrix coefficients (because the matrix is symmetric). It also means that standard operations such as addition, multiplication and averaging across a population may not preserve the tensor properties of the strain descriptor, and may result in physiologically implausible results. A second example, also from cardiac imaging, complements this view on the allowed operations on such descriptors. Consider a dataset of segmented acute myocardial infarcts, from the same coronary territory. Estimating a representative infarct pattern across a subgroup of subjects may be highly informative. Nonetheless, computing the linear average of several binary infarct patterns (previously aligned to a common reference, see the next section) results in a non-binary pattern with intermediate values that no longer resembles a plausible infarct. In this case, the machine learning model needs to consider the nonlinear structure of the space of infarct patterns so that the analysis always corresponds to plausible infarct pat-

18

N. Duchateau et al.

2

. Fig. 2.5 Linear and nonlinear average of two synthetic binary infarct patterns. As the space of infarct patterns is nonlinear, the average of two cases lies out of this space and does not correspond to a plausible infarct pattern (if the pixel labels for myocardium and infarct respectively correspond to 0 and 1, intermediate values of 0.5 are observed around the infarct zone shared between the two cases, as pointed out by the blue arrows). Machine learning models that handle this type of data should also consider potential nonlinearities in the data space to prevent bias in the analysis

terns (. Fig. 2.5). In general, when choosing data descriptors for use by machine learning models, one should be aware of these limitations and decide the level of approximation that can be tolerated on the computations and results, and adapt the learning algorithms accordingly. To return to our myocardial strain example, this means that one can decide to work with (see . Fig. 2.4): 5 A single scalar value that summarizes myocardial deformation, such as strain in a given direction, at a given instant and averaged over the myocardium (e.g. peak global longitudinal strain). Here, standard comparisons between values are allowed. 5 A high-dimensional object that encodes strain in a given direction, but for several instants in the cardiac cycle and/or several locations across the myocardium. Here, the model may consider each temporal instant or spatial location independently from the others, or find metrics or data representations that take into account the spatiotemporal consistency of these patterns, such as dimensionality reduction techniques. 5 A strain tensor at several instants in the cardiac cycle and/or several locations across the myocardium. Here, the model should also preserve the properties of such tensors, often addressed with specific metrics and nonlinear operations [12]. Naturally, these choices are conditioned by the complexity of the question to be addressed, the amount of samples available (as more complex descriptors/questions/models require larger populations) and the risk associated to the approximations made.

AI and Machine Learning: The Basics

19

2

Data Standardization If descriptors of heterogeneous types are used as inputs for learning, standardization of their values may be required to prevent imbalanced contributions due to incompatible units or scaling. As noted earlier, several scalar descriptors can be concatenated to form a new one of higher dimensionality, but it is important to remember that they should be preprocessed so that their minimum/maximum values or their average/variance values match. Specific algorithms may require binarizing or categorizing the descriptors, or more advanced schemes such as one hot encoding.4 A detailed list of normalization operations can be found in many standard machine learning libraries.5 For high-dimensional descriptors of heterogeneous types, preprocessing may consist of finding a new representation of the data where more standard average/ variance normalization can be achieved, using dimensionality reduction techniques such as linear PCA or nonlinear manifold learning. Among nonlinear techniques, one interesting standardization approach consists of replacing the input descriptors by affinity matrices that encode the similarities between pairs of samples, generally achieved by (Gaussian) kernel functions [13]. Finally, prior standardization of the descriptors may be required to lower the effect of anatomical and timing differences between subjects. These techniques belong to the field of computational anatomy [14, 15] and statistical atlases [16], which is under active research. Nonetheless, well established techniques already provide acceptable reference systems of coordinates to which each subject’s data can be transported. Popular methods consist of Procrustes alignment [17], registration, or parameterization techniques to estimate inter-subject correspondences and a reference anatomy, followed by interpolation or parallel transport of the subject-specific data to this reference. Temporal alignment may consist of interpolation based on physiological events [18], dynamic time warping [19] or temporal registration.

2.5

Model Validation

Training a machine learning model means that the model parameters are optimized to solve the targeted problem on a given dataset (the training set). However, the actual challenge of machine learning is to guarantee enough model performance on new samples not used for training, also referred to as the generalization ability of the model. Otherwise, the model would be overfitted to the training data and therefore be of less practical use. In the example introduced above of diagnosing a disease from a medical image, which can be seen as a supervised classification problem, the training set samples consist of pairs of images and diagnosis labels that serve as ground truth to guide 4 5

One hot encoding refers to the binarization of categorical data, resulting in a sequence of binary values, one for each category, in which only one value is equal to 1. 7 https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing.

20

2

N. Duchateau et al.

the optimization process. During training, the optimization process determines the model parameters (e.g. the logistic regression coefficients, or the neuron weights) that lead to the best classification on the training set (potentially balanced by some regularization that we will discuss later on). Then, the optimized model is applied to new images not necessarily from the same study, from the same institution and/or acquired with the same device, etc. This new set of samples is referred to as the testing set, and the model is expected to show a comparable classification performance on this new dataset, which would validate its relevance. In general, the machine learning developer should prepare three different datasets: 5 The training set, which is used for optimizing the model parameters. 5 The validation set, which is used to evaluate the performance of the trained model on new samples not used for optimizing the model parameters.6 5 The testing set, which consists of the actual data to analyze with a previously validated model. The validation set is different from the training set, and therefore is not used to optimize the model parameters. However, it may be used for selection of optimal hyperparameters of the model (external values that control the model behavior, which are fixed during training). This can be seen as a complementary training of the model. The validation set may consist of samples from another study, and in this case the validation procedure is referred to as external validation. However, in practice, internal validation is generally performed: the validation set consists of a subset of the training set. A more robust evaluation is obtained by repeating this procedure several times and averaging the performance results. A typical scheme for this consists in partitioning the training set into blocks, and each time performing the validation on a different block. This procedure is known as k-fold cross validation when validation is repeated on k different blocks, or leave-one-out cross validation when a single sample is left out for validation (the remaining samples being used for training), the process being repeated to cover all samples. The validation of supervised learning models provides two types of measures: the model performance on the training set, also called bias, and its performance on the validation set, also called variance. A non-optimized or wrong model would result in a large error on both datasets and would underfit the data, resulting in a high bias. Conversely, a model may overfit the training data and therefore perform poorly on the validation data, resulting in a low bias but a high variance. A validated model should therefore propose a trade-off between bias and variance, so that it generalizes well to the new samples from the testing set (see . Fig. 2.6). Standard metrics for assessing the performance of supervised models consist of error measurements that are problem-specific:

6

But in some cases they can be used during training, e.g. for deciding when to terminate an iterative optimization process such as that used in training artificial neural networks, see . Chap. 3.

AI and Machine Learning: The Basics

21

2

. Fig. 2.6 Determining the optimal model through bias and variance curves. Example on a nonlinear regression model with a hyperparameter σ that controls the smoothness of the regression, namely the simplicity of the fit to the training data

5 For segmentation, the Dice coefficient (overlap between segmentation and ground truth) or the Hausdorff distance (maximal distance between segmentation and ground truth boundaries), etc. 5 For classification, measures derived from the amount of well predicted (true positive and true negative) and mispredicted (false positive and false negative) samples, such as sensitivity and specificity, or precision and recall, the area under the ROC curve,7 etc. 5 For regression, errors on the predicted values such as the sum-of-squared differences or the root mean square error, etc. The validation of unsupervised learning models is more challenging as labels are not available or used, and the user should find alternative ways of justifying the generalization of the model: 5 For clustering, the separability of the estimated clusters, their consistency across different datasets or different parameters, etc. 5 For dimensionality reduction, the proportion of dimensions that explain most of the data (the model compactness), the realism/relevance of samples generated from the low-dimensional representation (the model specificity), etc. For the sake of fairness, these metrics should differ from the measures that are minimized during the model optimization. One can easily appreciate that better generalizability of the model can be achieved from the data perspective by increasing diversity in the training set, and from the model perspective by improving the model while balancing the adherence to the training samples, so that the validation samples are also well modelled. This last process is achieved by adding regularization constraints to the model, for exam7

The receiver operating characteristic (ROC) curve is used when assessing performance in situations where we have predicted and ground truth binary labels (e.g. disease classification). The ROC curve plots sensitivity against one-minus-specificity, for different predictor threshold values. The area under the receiver operating characteristic curve (AUC) is another measure of performance and is equal to the area under the ROC curve. AUC values range between 0 and 1 with 1 indicating perfect performance. See 7 https://en.wikipedia.org/wiki/Receiver_operating_characteristic for further details.

22

2

N. Duchateau et al.

ple ensuring that the regression trend or the classification border are smooth, or that the segmented structures have smooth boundaries. We encourage the reader to carefully consider these aspects, which are key for deploying a model on new cohorts and having a fair estimation of its relevance. Testing state-of-the-art algorithms on different datasets or applications is a good start: the more variety in the data, the higher will be the clinical trust in the model generalizability. Starting with simple models is highly recommended, as they may have lower performance compared to more sophisticated models but they can often generalize better to new samples.

2.6

Machine Learning Is Not a Panacea!

We would like to remind both starting and experienced developers that machine learning tools are actually models applied to data. By the principle of Occam’s razor, a model should be as simple as possible whilst still enabling the problem to be addressed satisfactorily. Model complexity is related to the number of parameters in the model (e.g. for deep learning methods: the neurons’ weights and the hyperparameters that govern the global behavior of the model). One should therefore start by carefully looking at the available data, using simple descriptors and simple models (including standard statistical methods), carefully test state-of-the-art methods on their own data, and then decide whether the complexity of the question and the amount/diversity of samples warrant investigation of more advanced models or data descriptors. In short: start simple! But naturally, the model performance is only as good as the data. One may not expect stunning results on testing data that differ significantly from the training data, or with different data quality and/or confidence in the labels. Data are produced and curated by humans, so machine learning models are subject to the same biases and prejudices as humans. Furthermore, some machine learning models (e.g. deep learning) work best when trained with a lot of data, and such datasets can often be difficult to curate. In this context, recent research on model interpretability [20] (see . Chap. 8, Sect. 8.3) and uncertainty [21] (see . Chap. 5, Sect. 5.4) are certainly promising areas to explore to complement model validation in the near future.

2.7

Sources of Data for Machine Learning in Cardiology

We have referred several times already to the importance of data, both in terms of data (and annotation) quality and the amount of data available. In this section, we review the data sources that are commonly available for training and validating machine learning models in cardiology. Care of patients in cardiology relies heavily on data. During initial assessment as well as follow-up, detailed descriptions of data related to the patient’s disease is recorded in health records. This includes a description of a patient’s history, the symptoms he/she experienced, findings during physical examination, biophysical measures (heart rate, blood pressure etc.), additional testing results (electrocardio-

AI and Machine Learning: The Basics

23

2

grams, biochemistry, imaging etc.) and finally treatments that have been administered. In principle, all of these data can be exploited by machine learning models, although the extent to which these possibilities have been explored varies. In the last decade, most hospitals have implemented electronic health record (EHR) systems, allowing patient data to be stored in a systematic way. Raw image data from imaging exams are usually not stored in the EHR itself. Because of their size, the imaging exams are usually stored in separate, dedicated image storage systems (PACS: picture archiving and communication systems). The medical data stored in EHR and PACS systems forms a valuable resource for large data-driven studies in cardiology and other fields in medicine. Below, we will review the most widely used imaging and non-imaging data sources in cardiology, and briefly discuss their technical background, practical use and place within clinical care. Brief summaries of machine learning models based upon these data sources will be provided.

2.8

Imaging Sources

Echocardiography Echocardiography is the cornerstone of imaging in cardiology. It is fast, relatively cheap and can be performed at the bedside, although most scans are prospectively planned and performed in dedicated echocardiography departments. Echocardiography was first developed in 1955. Using a time-motion display of the ultrasound wave along a single line of the ultrasound beam, called Mmode imaging, it allowed a simple visualization of the contractile motion of the myocardium. Echocardiography has since developed significantly. Nowadays, a typical exam includes 2-D and even 3-D cine imaging of the heart chambers, interrogation of blood flow using pulsed or continuous wave Doppler signals, and myocardial wall motion velocities and strain using tissue Doppler and speckle tracking technology. . Figure 2.7a shows a typical 2-D echocardiography scan with blood flow velocity measured using Doppler imaging shown as a colour overlay. The duration of a typical echo exam is approximately 15−20 min. Due to its speed, mobility and costs, echocardiography is often the first imaging technique used to investigate cardiac function in patients with (suspected) heart disease. It allows a screening of ventricular size, assessment of contractile (systolic) and relaxation (diastolic) function, and interrogation of the anatomy and function of the heart valves. Except for 3-D imaging, most images are reconstructed from the sound wave reflections in real time, resulting in a sharp contrast between blood (black) and tissue (grey-white) at frame rates ranging from 40−120 frames per second depending, for example, on the width, depth and sample line density of the ultrasound beam. This fast imaging with sharp blood-tissue contrast and the presence of consistent speckle patterns makes echocardiography suitable for assessing fast cardiac events, and in particular the motion of the myocardium and

24

N. Duchateau et al.

2

. Fig. 2.7 Examples of cardiac imaging modalities: a echocardiography (with blood flow velocity measured using Doppler imaging shown as a colour overlay), b a frame from a cine CMR acquisition, c coronary CT angiograph, and d myocardial perfusion SPECT. Cine CMR image adapted by permission from Springer Nature from [22]. Coronary CT angiography image adapted by permission from Springer Nature from [23]. Myocardial perfusion SPECT image adapted by permission from Springer Nature from [24]

heart valves, as well as flow acceleration and regurgitation through the diseased heart valves or stenotic regions of blood vessels. Several limitations of echocardiography may strongly impact the use of machine learning techniques on these image sequences. Firstly, the ultrasound beam is hindered by the bony structures of the chest wall and air in the lungs. As a result, imaging planes are limited and the quality of the images can vary significantly between patients. This impedes accurate, reproducible measurements of cavity volumes to calculate EF, as well as impeding assessment of certain structures, notoriously atria and the right ventricle (RV), from trans-thoracic echocardiography. Trans-oesophageal echocardiography reduces some of these disadvantages but is invasive. The second disadvantage is that echocardiography does not allow characterization of myocardial tissue structure and also provides no information about myocardial perfusion, which is an important factor in coronary artery disease, the most common disease in cardiology. Despite these limitations, machine learning has started to be applied to the analysis of echocardiography images [25]. For example, machine learning models have

AI and Machine Learning: The Basics

25

2

been developed for automatically classifying standard view planes [26], quantification of cardiac function [27] and disease detection [28]. Cardiac MR Cardiac magnetic resonance (CMR) is a more recently developed technique for imaging of the heart. In CMR, the spin speed and direction of hydrogen atoms is manipulated using magnetic gradients. Echoes of changes in electromagnetic charge are received by the scanner and utilized to construct images of the anatomical structures. In comparison with echocardiography, CMR allows imaging of the heart and all other structures in the chest, without being restricted by imaging windows or depth of the imaging beam. As a result, it allows for more reliable quantification of cardiac volumes and function. Moreover, as the signals are based on the quantity of hydrogen molecules in tissues, it also allows for characterization of the composition of the myocardium. This way, it can be used to detect fibrotic tissue (scars of previous ischemic events) or the presence of inflammation or molecular deposits in the tissue. A typical CMR exam currently takes about 30−40 min. Multiple different image sequences are acquired to obtain all relevant information: cine imaging is used to acquire dynamic cardiac images and myocardial motion information, late gadolinium enhancement (LGE) imaging is used for scar detection and T1 and T2 maps are used for characterization of deposits and inflammation. CMR images are typically reconstructed using information obtained over multiple heartbeats. Therefore, breath-holds or breathing navigators are needed to ensure a similar position of the heart during acquisition. A sample frame from a cine CMR acquisition is shown in . Fig. 2.7b. The main disadvantage of CMR is that MR machines are bulky and expensive. Moreover, metal implants, such as internal defibrillators or pacemakers, cause distortions to the images and the narrow bore of the machine is challenging for patients experiencing claustrophobia. In clinical practice, CMR exams are not currently used in the initial screening for heart diseases in patients. They are typically requested in patients with established heart disease in whom investigation of the underlying cause (using tissue characterization and scar detection) or reliable quantification of right and left ventricular volumes is needed to inform further treatment decisions. The role of CMR in cardiology is still growing. The increasing speed and higher quality of CMR images, as well as increased presence in clinical guidelines, is resulting in more patients being referred for CMR to investigate causes of heart failure or monitor treatments. Because of the generally better image quality of CMR compared to echocardiography and the availability of large annotated databases, machine learning models for CMR analysis are more mature [29]. For example, robust models have been proposed for image reconstruction [30], segmentation [31] and automatic biomarker estimation with quality control [32]. Of relevance to such models is the fact that CMR images are typically acquired “slice-by-slice”, and image resolution is normally good within-plane, but is less good through-plane. This has consequences for subsequent algorithms for image analysis, introducing extra uncertainty into

26

2

N. Duchateau et al.

measurements made in the through-plane direction. Furthermore, the 3-D nature of many CMR images also introduces extra computational cost if fully 3-D processing is attempted, and so many models instead limit themselves to 2-D slice-by-slice analysis. Processing 2-D slices also offers more images to train machine learning algorithms, but may raise spatial consistency issues that are currently under active research. Cardiac CT Computed tomography (CT) imaging utilizes X-ray radiation to create an image of the internal organs of the body. In cardiology, it is often used for static imaging of the structural anatomy of the heart and the structures related to it, such as the coronary arteries and great vessels. This use reflects the main benefit of CT: its high spatial resolution and good contrast between myocardium, blood and more calcified structures. The main application of cardiac CT is for qualitative and quantitative assessment of atherosclerotic decompositions and stenosis in the coronary arteries. For example, a sample coronary angiograph is shown in . Fig. 2.7c. CT is also frequently used to investigate atrial anatomy prior to procedures that involve ablation or isolation of electrical foci of atrial fibrillation, or to assess structural abnormalities of the cardiac and vascular anatomy in patients with congenital heart disease. Dynamic imaging of the heart during contraction is possible using CT, but the significant radiation dosages involved currently make it less attractive than CMR. However, in patients with metal implants or claustrophobia, 4-D cardiac CT can be an option. Machine learning models have been proposed for applications including reconstructing CT images from incomplete X-ray projection data [33], segmentation [31] and assessment of coronary artery disease [34]. Other Imaging Modalities Single photon emission computed tomography (SPECT) and positron emission tomography (PET) are nuclear imaging techniques that can be used to quantify myocardial perfusion (see . Fig. 2.7d). Myocardial perfusion defects, originating from occlusive coronary artery or microvascular disease can be detected and quantified using these scans, similar to CMR perfusion imaging. Radioisotopes are injected that emit gamma rays, which are detected by gamma cameras. By obtaining recordings at rest and during physical or pharmacological stress (which increases blood flow to the myocardium) myocardial perfusion defects can be detected. SPECT or PET are currently the standard option for myocardial perfusion assessment in many hospitals. However, the newer CMR perfusion exams have started to replace these techniques in some hospitals. PET scans can also be used to investigate metabolic active tissues other than the myocardium, such as cardiac tumours or infections of endocardial structures (endocarditis). Examples of the use of machine learning in these modalities include PET reconstruction [35] and prediction of coronary artery disease from SPECT [36].

AI and Machine Learning: The Basics

27

2

Non-imaging Sources Electrocardiogram The electrocardiogram (ECG) is one of the earliest technologies developed to investigate the function of the heart. In 1903, Willem van Einthoven published his invention for use of the electrocardiogram and introduced the standard leads that allow investigation of the heart’s electrical activity. Myocytes are negatively charged with respect to their outside surroundings at rest. Contraction of the myocytes is activated by a rapid shift of electrolytes (most predominantly calcium ions) that results in depolarization of the cells. Consequently, relaxation of the heart is the result of myocyte repolarization due to a rapid reserve shift of calcium ions. The sum of changes in myocyte polarization in the heart can be detected using ECG. Moreover, the sequential activation of the cardiac structures (sinus node—atria—atrioventricular node—ventricles) results in a change in size and direction of the electrical field, and can be identified from the ECG traces. ECG signals are affected by size of the cardiac structures, myocardial muscle mass, muscle oxygenation and the speed of the activation wave front through the ventricles. An ECG is obtained using a small, mobile and cheap device and can be performed within a minute. Due to its sensitivity for changes in cardiac structure or function and its ease of use, ECG recordings are one of the most used tests in cardiology. The ECG is the main diagnostic tool to identify arrhythmias (disturbances in the sequence of activation in the heart) and diagnose acute coronary artery disease (acute hypoxia and necrosis of myocytes). For patients with chronic cardiac disease ECG recordings are used to monitor changes in electrical activation that suggest disease progression. Recently, some papers on machine learning based analysis of ECGs have started to emerge. One notable example is [37], who demonstrated how a convolutional neural network could predict 1-year all-cause mortality from 12-lead ECG signals. Other notable examples come from the PhysioNet and Computing in Cardiology communities, who organize public data challenges on ECG processing and diagnosis on a yearly basis. The 2020 challenge was extremely popular and involved more than 200 teams using machine learning algorithms to diagnose 12-lead ECG signals from several large databases totalling 66,000+ recordings [38]. Machine learning also offers relevant solutions for the modeling and analysis of electrophysiological data, including 3-D mappings acquired from catheter recordings [39] and personalized computational cardiac simulations [40]. These applications are discussed further in . Chap. 10. Electronic health records In EHRs, doctors record all patient-related information in a systematic fashion. A typical daily report of a patient includes a medical history, details verbally given by the patient about his or her complaints, findings found during physical examination, a brief description of test results (e.g. important biochemical abnormalities or imaging findings), a conclusion and a treatment plan. Cardiologists use these detailed reports, made during every visit for an outpatient clinic or daily during in-hospital stays, to evidence their care, hand over between different professionals

28

2

N. Duchateau et al.

in the medical team and evaluate and register treatment effects. Apart from the reports written by doctors, the EHR also contains separate modules that display biochemistry lab results, ECG recordings, imaging exam result reports (such as a report of the analysed echo exam or CMR scan) and structured lists of contact moments (such as outpatient visits or admissions), previous diagnosis and current and previously prescribed medication. PACS systems are similar to the EHR, except that these are dedicated solutions for archiving of the acquired medical images and do not contain other information apart from those relevant to the images, such as patient identifiers. The use of machine learning with EHRs has focused on two different applications: (i) automated generation of EHRs from imaging data [41], and (ii) machine learning based analysis of EHR data [42]. We review each of these fields in more detail in . Chap. 10.

Closing Remarks

2.9

We hope that this chapter has provided the reader with a grounding in the fundamental concepts of traditional machine learning models, as well as an awareness of some of the potential pitfalls and difficulties developers might face and the data sources that such models typically exploit in cardiology. Next, we provide several exercises to let you self-test and reinforce your knowledge, followed by our first hands-on tutorial that we hope will help you to get started in your explorations of machine learning model development. However, as we saw in . Chap. 1, much of the recent success of, and interest in, machine learning comes not from the types of traditional model that we have discussed in this chapter, but rather from models based upon artificial neural networks, or deep learning. In the next chapter, we introduce the fundamental theory behind such models, and also provide a tutorial to help you to develop your own neural network model. 2.10

Exercises

Exercise 1 In what situations might an unsupervised machine learning model be an appropriate choice?

Exercise 2 What imaging and non-imaging data are typically popular for the development of machine learning algorithms in cardiology? Are some more challenging than others and why?

AI and Machine Learning: The Basics

29

Exercise 3 A machine learning model has been developed for automated diagnosis of some types of cardiovascular disease based on CT images. To train the model, the developers have used a training set of 100 CT images and associated diagnoses. They have implemented a number of different supervised machine learning models, each with different hyperparameter settings. The best-performing model on a test set of 50 CT images and diagnoses has been chosen for deployment. What concerns do you have about the validation strategy adopted by the developers? Would you expect the chosen model to perform as well when deployed on real clinical data?

Exercise 4 A company is developing an automated tool to segment the aorta from CMR images, with a view to using the segmentations to derive functional biomarkers. The company plans to train a supervised segmentation model using annotations produced by manual contouring. However, the manual contouring process is very laborious and time-consuming. What alternative approach would you recommend?

Exercise 5 An implantable cardioverter-defibrillator (ICD) is a small battery-powered device that is implanted in the chest to monitor heart rhythm and detect irregular heartbeats. An ICD can deliver electric shocks via one or more wires connected to the heart to fix abnormal heart rhythms. A research team is investigating more targeted use of ICDs to avoid unnecessary interventions. They would like to use machine learning to exploit routine clinical data in order to more accurately predict which patients are likely to suffer life-threatening arrythmias in the future. Explain how you would go about designing a machine learning solution for this problem.

Exercise 6 A clinical study is investigating whether automated measurements of global longitudinal left ventricular strain made from echocardiography can be a useful predictor of major adverse cardiac events (MACE—a composite endpoint that combines nonfatal stroke, nonfatal myocardial infarction and cardiovascular death). The team would like to use machine learning techniques in their study.

2

30

N. Duchateau et al.

Suggest some ways in which machine learning could help in the study. What type(s) of model would be appropriate and how could they be validated?

2 2.11

Tutorial—Introduction to Python and Jupyter Notebooks

Tutorial 1 As for the other notebooks, the contents of this notebook are accessible as Electronic Supplementary Material. Overview In this first hands-on tutorial, you will go through the basics of the Python language and objects. We will use a Jupyter Notebook, which is a very convenient didactic and interactive tool that can mix written explanations and sections of code. Our notebooks are tailored for a specific problem related to the chapter preceding each notebook. You will be asked to run existing sections of code, examine the outputs, and fill in missing code or adapt it to test different behaviours of an algorithm. The figure below shows an example of an interactive Jupyter Notebook cell to be run in this notebook:

Objectives • Become familiar with the basics of Python and Jupyter Notebooks. • Understand the main objects (variables, functions, operators, etc.) that will be handled in the subsequent hands-on tutorials. • Gain practice on simple illustrative exercises.

AI and Machine Learning: The Basics

31

2

Computing Requirements Each notebook starts with a brief “System setting” section, which imports the necessary packages, installs the potentially missing ones, and imports our own modules. You will need Python installed on your computer and a software tool to run the notebooks (we recommend for example the free software JupyterLab (7 https:// jupyter.org/). We assume that you have already installed very common packages such as Numpy, Matplotlib, and scikit-learn. In case you are missing these packages, or another one, we recommend you to run the following command (here illustrated for one of these packages): p i p i n s t a l l s c i k i t −l e a r n

We hope you’ll enjoy these contents!

Acknowledgements ND was supported by the French ANR (LABEX PRIMES of

Univ. Lyon [ANR-11-LABX-0063] within the program “Investissements d’Avenir” [ANR-11-IDEX-0007], and the JCJC project “MIC-MAC” [ANR-19-CE45-0005]). EPA was supported by the EPSRC (EP/R005516/1) and by core funding from the Wellcome/EPSRC Centre for Medical Engineering (WT 203148/Z/16/Z). BR was supported by the NIHR Cardiovascular MedTech Co-operative award to the Guy’s and St Thomas’NHS Foundation Trust and Wellcome/EPSRC Centre for Medical Engineering at Kings College London (WT 203148/Z/16/Z). AK was supported by the EPSRC (EP/P001009/1), the Wellcome/EPSRC Centre for Medical Engineering at the School of Biomedical Engineering and Imaging Sciences, King’s College London (WT 203148/Z/16/Z) and the UKRI London Medical Imaging & Artificial Intelligence Centre for Value Based Healthcare.

References 1. Keulenaer GWD, Brutsaert DL. Systolic and diastolic heart failure are overlapping phenotypes within the heart failure spectrum. Circulation. 2011;123(18):1996−2005. 2. Daubert C, Behar N, Martins RP, Mabo P, Leclercq C. Avoiding non-responders to cardiac resynchronization therapy: A practical guide. Eur Heart J. 2016;38(19):1463−72. 3. Cheplygina V, de Bruijne M, Pluim JP. Not-so-supervised: A survey of semi-supervised, multiinstance, and transfer learning in medical image analysis. Med Image Anal. 2019;54:280−96. 4. Pineda L, Basu S, Romero A, Calandra R, Drozdzal M. Active MR k-space sampling with reinforcement learning. Proc MICCAI. Springer LNCS. 2020;12262:23−33. 5. Milletari F, Birodkar V, Sofka M. Straight to the point: Reinforcement learning for user guidance in ultrasound. In: Wang Q, Gomez A, Hutter J, McLeod K, Zimmer V, Zettinig O, Licandro R, Robinson E, Christiaens D, Turk EA, Melbourne A, editors. Smart ultrasound imaging and perinatal, preterm and paediatric image analysis. Cham: Springer; 2019. p. 3−10. 6. Neumann D, Mansi T, Itu L, Georgescu B, Kayvanpour E, Sedaghat-Hamedani F, Amr A, Haas J, Katus H, Meder B, Steidl S, Hornegger J, Comaniciu D. A self-taught artificial agent for multi-physics computational model personalization. Med Image Anal. 2016;34:52−64. 7. Qiu J, Wu Q, Ding G, Xu Y, Feng S. A survey of machine learning for big data processing. EURASIP J Adv Signal Process. 2016;67.

32

2

N. Duchateau et al.

8. Gillies RJ, Kinahan PE, Hricak H. Radiomics: Images are more than pictures, they are data. Radiology. 2015;278(2). 9. Bengio Y, Courville A, Vincent P. Representation learning: A review and new perspectives. IEEE Trans Pattern Anal Mach Intell. 2013;35(8):1798−828. 10. Yan S, Xu D, Zhang B, Zhang H, Yang Q, Lin S. Graph embedding and extensions: A general framework for dimensionality reduction. IEEE Trans Pattern Anal Mach Intell. 2007;29: 40−51. 11. Li Y, Yang M, Zhang Z. A survey of multi-view representation learning. IEEE Trans Knowl Data Eng. 2018;PP:1−1. 12. Pennec X, Fillard P, Ayache N. A Riemannian framework for tensor computing. Int J Comput Vis. 2006;66(1):41−66. 13. Coifman RR, Lafon S. Diffusion maps. Appl Comput Harmon Anal. 2006;21(1):5−30. 14. Miller MI. Computational anatomy: Shape, growth, and atrophy comparison via diffeomorphisms. NeuroImage. 2004;23:S19−33. 15. Miller MI, Qiu A. The emerging discipline of computational functional anatomy. NeuroImage. 2009;45(1), Suppl 1:S16−39. 16. Young AA, Frangi AF. Computational cardiac atlases: From patient to population and back. Exp Physiol. 2009;94(5):578−96. 17. Gower JC. Generalized procrustes analysis. Psychometrika. 1975;40:33−51. 18. Perperidis D, Mohiaddin RH, Rueckert D. Spatio-temporal free-form registration of cardiac MR image sequences. Med Image Anal. 2005;9(5):441−56. 19. Sakoe H, Chiba S. Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans Acoust Speech Signal Process. 1978;26(1):43−9. 20. Murdoch WJ, Singh C, Kumbier K, Abbasi-Asl R, Yu B. Definitions, methods, and applications in interpretable machine learning. Proc Natl Acad Sci. 2019;116(44):22 071−80. 21. Ghahramani Z. Probabilistic machine learning and artificial intelligence. Nature. 2015;521:452−9. 22. Oksuz I, Ruijsink B, Puyol-Antón E, Bustin A, Cruz G, Prieto C, Rueckert D, Schnabel JA, King AP. Deep learning using k-space based data augmentation for automated cardiac MR motion artefact detection. In: International conference on medical image computing and computer-assisted intervention. Springer; 2018. p. 250−8. 23. Cesare ED, Patriarca L, Panebianco L, Bruno F, Palumbo P, Cannizzaro E, Splendiani A, Barile A, Masciocchi C. Coronary computed tomography angiography in the evaluation of intermediate risk asymptomatic individuals. Radiol Med. 2018;123:686−94. 24. Dorbala S, Ananthasubramaniam K, Armstrong IS, Chareonthaitawee P, DePuey EG, Einstein AJ, Gropler RJ, Holly TA, Mahmarian JJ, Park M-A, Polk DM, Russell R III, Slomka PJ, Thompson RC, Wells RG. Single photon emission computed tomography (SPECT) myocardial perfusion imaging guidelines: Instrumentation, acquisition, processing, and interpretation. J Nucl Cardiol. 2018;25:1784−846. 25. Alsharqi M, Woodward WJ, Mumith JA, Markham DC, Upton R, Leeson P. Artificial intelligence and echocardiography. Echo Res Pract. 2018;5(4):R115−25. 26. Madani A, Arnaout R, Mofrad M, Arnaout R. Fast and accurate view classification of echocardiograms using deep learning. npj Digit Med. 2018;1:6. 27. Ghorbani A, Ouyang D, Abid A, He B, Chen JH, Harrington RA, Liang DH, Ashley EA, Zou JY. Deep learning interpretation of echocardiograms. npj Digit Med. 2020;3(10). 28. Zhang J, Gajjala S, Agrawal P, Tison GH, Hallock LA, Beussink-Nelson L, Lassen MH, Fan E, Aras MA, Jordan C, Fleischmann KE, Melisko M, Qasim A, Shah SJ, Bajcsy R, Deo RC. Fully automated echocardiogram interpretation in clinical practice. Circulation. 2018;138(16):1623−35. 29. Leiner T, Rueckert D, Suinesiaputra A, Baeßler B, Nezafat R, Iš gum I, Young A. Machine learning in cardiovascular magnetic resonance: Basic concepts and applications. J Cardiovasc Magn Reson. 2019;21:12. 30. Hammernik K, Klatzer T, Kobler E, Recht M, Sodickson D, Pock T, Knoll F. Learning a variational network for reconstruction of accelerated MRI data. Magn Reson Med. 2017;79:04. 31. Chen C, Qin C, Qiu H, Tarroni G, Duan J, Bai W, Rueckert D. Deep learning for cardiac image segmentation: A review. Front Cardiovasc Med. 2020;7:25.

AI and Machine Learning: The Basics

33

2

32. Ruijsink B, Puyol-Antón E, Oksuz I, Sinclair M, Bai W, Schnabel JA, Razavi R, King AP. Fully automated, quality-controlled cardiac analysis from CMR: Validation and large-scale application to characterize cardiac function. JACC: Cardiovasc Imaging. 2020;13(3):684−95. 33. Dong J, Fu J, He Z. A deep learning reconstruction framework for x-ray computed tomography with incomplete data. PLOS ONE. 2019;14(11):1−17. 34. Hampe N, Wolterink J, Velzen S, Leiner T, Iš gum I. Machine learning for assessment of coronary artery disease in cardiac CT: A survey. Front Cardiovasc Med. 2019;6:11. 35. Reader AJ, Corda G, Mehranian A, da Costa-Luis C, Ellis S, Schnabel JA. Deep learning for PET image reconstruction. IEEE Trans Radiat Plasma Med Sci. 2020;1−1. 36. Betancur J, Commandeur F, Motlagh M, Sharir T, Einstein AJ, Bokhari S, Fish MB, Ruddy TD, Kaufmann P, Sinusas AJ, Miller EJ, Bateman TM, Dorbala S, Di Carli M, Germano G, Otaki Y, Tamarappoo BK, Dey D, Berman DS, Slomka PJ. Deep learning for prediction of obstructive disease from fast myocardial perfusion SPECT: A multicenter study. JACC: Cardiovasc Imaging. 2018;11(11):1654−63. 37. Raghunath S, Ulloa Cerna AE, Jing L, VanMaanen DP, Stough J, Hartzel DN, Leader JB, Kirchner HL, Stumpe MC, Hafez A, Nemani A, Carbonati T, Johnson KW, Young K, Good CW, Pfeifer JM, Patel AA, Delisle BP, Alsaid A, Beer D, Haggerty CM, Fornwalt BK. Prediction of mortality from 12-lead electrocardiogram voltage data using a deep neural network. Nat Med. 2020;26(6):886−91. 38. Alday EAP, Gu A, Shah AJ, Robichaux C, Wong AKI, Liu C, abd FL, Bahrami Rad A, Elola A, Seyedi S, Li Q, Sharma A, Clifford GD, Reyna MA. Classification of 12-lead ECGs: the PhysioNet/computing in cardiology challenge 2020. Nature. 2021;41(12):124003. 39. Cantwell CD, Mohamied Y, Tzortzis KN, Garasto S, Houston C, Chowdhury RA, Ng FS, Bharath AA, Peters NS. Rethinking multiscale cardiac electrophysiology with machine learning and predictive modelling. Comput Biol Med. 2019;104:339−51. 40. Lopez-Perez A, Sebastian R, Izquierdo M, Ruiz R, Bishop M, Ferrero JM. Personalized cardiac computational models: From clinical data to simulation of infarct-related ventricular tachycardia. Front Physiol. 2019;10:580. 41. Messina P, Pino P, Parra D, Soto A, Besa C, Uribe S, Andía M, Tejos C, Prieto C, Capurro D. A survey on deep learning and explainability for automatic image-based medical report generation; arXiv. 2020. 42. Miotto R, Li L, Kidd BA, Dudley JT. Deep patient: An unsupervised representation to predict the future of patients from the electronic health records. Sci Rep. 2016;6:26094.

35

From Machine Learning to Deep Learning Pierre-Marc Jodoin, Nicolas Duchateau and Christian Desrosiers Contents 3.1

Introduction – 36

3.2

Machine Learning and Neural Networks – 38

3.3

K-Class Prediction – 46

3.4

Handling Non-linearly Separable Data – 49

3.5

Convolutional Neural Networks – 52

3.6

Closing Remarks – 54

3.7

Exercises – 54

3.8

Tutorial—Classification From Linear to Non-linear Models – 55 References – 56

Supplementary Information The online version contains supplementary material available at 7 https://doi.org/10.1007/978-3-031-05071-8_3. Authors’ contribution: • Main chapter: PJ, CD. • Tutorial: ND, CD, PJ. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Duchateau and A. P. King (eds.), AI and Big Data in Cardiology, https://doi.org/10.1007/978-3-031-05071-8_3

3

36

P.-M. Jodoin et al.

n Learning Objectives: At the end of this chapter you should be able to: O3.A Explain how the equation for a straight line leads naturally to the formulation of a simple binary classifier O3.B Describe the differences between the main gradient descent optimization algorithms O3.C Explain the operation of simple artificial neural networks such as the perceptron and the logistic regression O3.D Decide which approach to use to extend a machine learning model to classify data which are not linearly separable O3.E Explain the basic principles of convolutional neural networks (CNNs), and summarize their relevance to analyse medical images and signals

3

3.1

Introduction

As mentioned in the previous chapter, a very active area of AI is machine learning, which accounts for mathematical models whose behaviour adapts to the data they are faced with. This adaptation process is called learning, an obscure term that we ought first to disambiguate. In this chapter, we lay the mathematical foundations of supervised machine learning through a simple medical example. This example will soon lead to the notions of classification function, training, neuron, neural network, and deep neural networks as well as fundamental concepts associated to these notions. The example goes as follows: for a few days, patients are showing up at a clinic to adjust their medication. Based on their symptoms, some patients need to extend their treatment while others, who had a successful reaction to the medication, are now healthy and may stop their treatment. This example includes two (and only two) classes of patients: the ones which are sick and those that are healthy. In this example, we assume that the status of a patient can be determined by the inspection of two characteristics: the body temperature in degrees Celsius and the heart rate in cycles/minute. In a retrospective analysis, N patients were analysed and their information stored in a dataset that we shall call D. These characteristics can be visualized for the whole population using a scatter plot as shown in . Fig. 3.1a, where each patient is represented by a point in a 2-D feature space. In mathematical terms, this translates into a dataset D = {(x 1 , t1 ), (x 2 , t2 ), . . . , (x N , tN )} where x i ∈ R2 is a vector containing the body temperature and the heart rate of patient i while ti ∈ {healthy, sick} stands for the patient’s status. Technical Note In the equations of this chapter, standard letters like i or t designate scalar variables. A bold letter in lower case, such as x or w, stands for a vector, while a bold variable in

From Machine Learning to Deep Learning

3

37

. Fig. 3.1 a 2-D scatter plot of linearly separable healthy and sick patients. b Implicit equation of a line with its normal vector N , and examples of the line equation for three points

upper case, such as W , stands for a matrix. Also, a variable with a subscript indicates that it is a single element of a set (e.g. x i and ti are variables of the ith patient).

As one can see from . Fig. 3.1a, the sick patients are those with a fever and/or a high heart rate, therefore their corresponding points lie away from those of healthy ones in the plot. The goal of supervised learning is to learn a function f : R2 → {healthy, sick} which can correctly identify the status of a patient according to the data contained in D, i.e. f (x i ) = ti ,

∀i ∈ {1, . . . , N}.

(3.1)

Put another way, for any given patient, the machine learning function f must be capable of converting a vector of characteristics x (in our example, temperature and heart rate) into a class label (in our example, sick or healthy). As such, the function f (x) is called a classification function. Technical Note The notation f : R2 → {healthy, sick} means that we define a function f that maps from (→) a vector of 2 real values (R2 ) to a single label which can take either of the values {healthy, sick}. We call R2 the domain of the function and {healthy, sick} the range. The ∀i ∈ {1, . . . , N} in Eq. (3.1) means that this mapping should be correct for all (∀) values of i between 1 and N.

38

3.2

3

P.-M. Jodoin et al.

Machine Learning and Neural Networks

If the distribution of sick and healthy patients was known a priori,1 it would be easy for a programmer to write a deterministic algorithm that would satisfy Eq. (3.1). All one would have to do is determine on which side of the line a patient lies in the feature space of . Fig. 3.1a to know the status of that patient. However, for the sake of our example, we will assume that we do not know a priori that sick patients are those with a high body temperature and a high heart rate, and that the healthy patients are the other ones. Instead, we will implement a system for discovering automatically how to separate these patients by adjusting its parameters based on the content of D via a training procedure. Since D is at the core of this procedure, it is usual to call it the training set (see . Chap. 2, Model Validation). But before we dive into the specifics of the training procedure, let us first consider what f (x i ) looks like mathematically. Two-Class Prediction Different machine learning approaches may lead to different types of classification function f (x). However, one of the simplest and most widely used functions is the linear classifier. In . Fig. 3.1a, we see that the sick and healthy patients can be separated by a straight line. That line is a symbolic representation of a linear classifier. To understand the mathematics behind a linear classifier, we shall go back to its very roots, i.e. the definition of a line. One may recall from previous school years that the equation of a line is given by y = mx + b

(3.2)

where x and y are the horizontal and vertical coordinates of a 2-D point, m is the slope and b the intercept (i.e. the distance to the origin). This equation is known as the explicit formulation of a line (explicit because one variable is expressed in relation to the other). y . If we replace One may also remember that the slope is given by a ratio: m = x m by this ratio in Eq. (3.2) and then rearrange the terms, we get the following implicit formulation of a line, 0 = y · x − x · y + x · b.

(3.3)

This formulation stipulates that each point (i.e. each pair of patient characteristics) located on the line satisfies this equation. By renaming some variables, namely y → w1 , −x → w2 , and xb → w0 , we get a less convoluted implicit formulation, 0 = w1 x + w2 y + w0 . 1

I.e. before looking at the data.

(3.4)

From Machine Learning to Deep Learning

3

39

While x and y is a convenient naming convention for a two-dimensional space, it is far less convenient in a higher dimensional space. To illustrate this, another experiment could require more than two characteristics like age, body mass index, blood sugar level, cholesterol level, etc. In that case, having more characteristics would lead to more dimensions and more variables to name. As such, it is usual to rename x, y to x1 , x2 where xj stands for the j-th characteristic (in our case, x1 is the body temperature and x2 is the heart rate). The resulting implicit line equation is as follows: 0 = w1 x1 + w2 x2 + w0 .

(3.5)

While different from y = mx + b, this equation is still the equation of a line. To convince ourselves, let us consider . Fig. 3.1b where the line parameters are (w0 , w1 , w2 )T = (4, 1, −2)T . If we take a point that falls on the line, say (2, 3) and plug it into Eq. (3.5), we get 1 × 2 − 2 × 3 + 4 which, indeed, equals zero. Technical Note The implicit equation of a line can be represented by the dot product between two vectors: the parameter vector w = (w0 , w1 , w2 )T and the augmented characteristic vector x  = (1, x1 , x2 )T where 1 is added to x to account for the bias w0 . Assuming the column vector notation, the implicit line equation can be represented as 0 = wT x 

(3.6)

where .T is the transpose operator. Please note that for the rest of this chapter, we will drop the prime and refer to x as an augmented vector. Furthermore, one can prove that the vector N = (w1 , w2 )T is the normal of the line. Any point lying in the direction of the normal is said to be in front of the line while the other ones are said to be behind the line. As for w0 , it is the so-called bias, which is zero when the line crosses the origin.

Linear separation of a feature space: Things get interesting when we feed Eq. (3.5) with points that do not lay on the line. For example, if we take the point (1, 6), located above the line in . Fig. 3.1b, we get 1 × 1 − 2 × 6 + 4 = −7, i.e. a negative value. For the point (4, 1), located below the line, we get 1 × 4 − 2 × 1 + 4 = 6, a positive value. This little experiment underlines an important fact. By its very nature, the implicit equation of a line separates the space into two regions: the region for which the line equation produces a positive value and the one for which it is negative. Since our goal is to classify patients as being healthy or sick, we need a binary classifier. As such, we can use the following sign function,  sign(t) =

+1 if t > 0 −1 otherwise

(3.7)

40

P.-M. Jodoin et al.

to convert the negative values into the −1 label and the positive scores into the +1 label (i.e. healthy = +1 and sick = −1 in our case). This leads to the following binary classification function: f (x) = sign(w T x).

3

(3.8)

Training an AI model: At this point, the reader may contemplate the fact that a binary linear classification is no more than a dot product between a vector of parameters w and an augmented vector of characteristics x. If the vector w has the right values (as in . Fig. 3.1a) the system can successfully separate the sick and the healthy patients by predicting the right positive and negative values. In the context of our simple medical example, the process by which the parameter vector w is adjusted to the content of the dataset D (and thus fulfill the requirement of Eq. (3.1)) is called the training operation. The dataset D contains a series of pairs (x i , ti ) and the goal of the classification function is to predict the right label ti for a given feature vector x i . In this section, we describe how w can be estimated from training data and how this process generalizes to more complex models such as the perceptron and logistic regression. Training a model can only be done with the support of a function that can measure how good a set of parameters is at discriminating, for example between healthy and sick patients. This function is called the loss function and its value is typically proportional to the error rate of the model. As such, the goal of the training procedure is to estimate the parameter vector w which produces the lowest possible loss L(w). One way of illustrating this is through the plot of a ‘loss landscape’ as shown in . Fig. 3.2. This plot shows the loss value (the vertical axis) for different values of parameters w1 and w2 . The best pair of parameters is at the bottom of the trough in the blue area. The training objective can thus be formulated as follows: ˆ = arg min L(w) w w

(3.9)

which can be translated as: “out of every possible parameter vector w, find one that minimizes the loss function L(w) and thus best classifies the data.” Technical Note The goal of any machine learning algorithm is to find the parameter vector w that minimizes the loss L(w). The scientific field oriented towards the development of such algorithms is called mathematical optimization. While mathematical optimization is a broad field, we will focus on a specific optimization algorithm that is widely used to train neural networks: gradient descent.

A vector w minimizes the loss L(w) when its derivative with respect to its dimensions is zero:

41

From Machine Learning to Deep Learning

3

. Fig. 3.2 Illustration of a loss landscape of a space defined by two parameters w1 and w2 . The black dots are the parameter values at 4 iterations of a gradient descent optimization. The arrows illustrate the opposite direction to the gradient. Please note that this figure does not include the bias w0 for illustrative purposes

∇w L(w) = 0

(3.10)

The left-hand side of this equation is the gradient of the loss with respect to the parameter vector w. Since we have two characteristics in our example (again, body temperature and heart rate), ∇w L(w) is a vector containing two values and the previous equation can be reformulated as ⎛

⎞ ⎛ ⎞ ∂L/∂w0 0 ⎝ ∂L/∂w1 ⎠ = ⎝ 0 ⎠ . 0 ∂L/∂w2

(3.11)

Gradient descent is an iterative algorithm that minimizes the loss function by successively updating w in the opposite direction to the gradient. This is illustrated by the black dots in . Fig 3.2 which iteratively go from a high loss down to a lower loss. Since computing the exact gradient requires an infinite amount of pairs (x, t), one approach is to use the (entire) training dataset D to compute it, i.e. ∇w L(w) ≈ ∇w L(w; D) =

1 N

 (x i ,ti )∈D

∇w L(w, (x i , ti ))

(3.12)

42

P.-M. Jodoin et al.

 where N is the total number of patients in D, and (xi ,ti )∈D is a summation operation over every training data pair. Note that the notation (w; D) is used to indicate that this version of the loss function is solely defined by the data contained in the training set, D. Using this gradient approximation based on the entire training set leads to the batch gradient descent algorithm (see Algorithm 1). In this case, we say that the parameter vector w is updated once every epoch.

3

Technical Note In machine learning, the term epoch refers to the optimization algorithm seeing all data in the training set once. An iteration is one update of the learning parameters, and an iteration considers a mini-batch of the training data (which has a batch size). Therefore, the size of the training set is equal to the batch size multiplied by the number of iterations in an epoch. In batch gradient descent, the batch size is the entire training set, so there is one iteration in an epoch.

Unfortunately, for various technical reasons, batch gradient descent is prohibitively slow and memory intensive. An alternative strategy, known as stochastic gradient descent, (see Algorithm 2) is to approximate the gradient and update the parameters w for each training sample, i. Thus, there are N iterations per epoch for a stochastic gradient descent (where N is the training set size). Another slight variant use bundles of data pairs (mini-batches) to approximate the gradient. This would give rise to a mini-batch stochastic gradient descent algorithm. In all cases, the change to the model parameters is weighted by a learning rate η, which is a predefined constant typically between 0 and 1. η is a famous hyperparameter that one must determine, for example through a cross validation procedure.2

Algorithm 1: Batch gradient descent algorithm input : Training set D, learning rate η output: Trained weights w Init w with [small] random values; for epoch = 1 to epochMAX do w ← w − η∇w L(w, D) ; end

The first machine learning model—the perceptron: The perceptron is often cited as the oldest neural network model [1]. It is a binary classifier that implements the classification function of Eq. (3.8). The perceptron is typically illustrated by a graph such as the one shown in . Fig. 3.3a. In this illustration, each grey circle on the left embodies an input variable. Since our example has two characteristics (body 2

Recall our discussion of the terms ‘hyperparameter’ and ‘cross validation’ in Validation.

. Chap. 2, Model

From Machine Learning to Deep Learning

43

3

Algorithm 2: Stochastic gradient descent algorithm input : Training set D, learning rate η output: Trained weights w Init w with [small] random values; for epoch = 1 to epochMAX do for (x i , ti ) ∈ D do w ← w − η∇w L(w, (x i , ti )) ; end end

temperature and heart rate) the perceptron has two input variables and a 1 for the bias. Each input variable has a connecting arrow showing the direction of the flow of information, and each arrow has an associated weight value wj ∈ R. These arrows connect to a red circle, which embodies two crucial operations: 1. The dot product between the input vector x and the weights w. As mentioned before, this operation is the implicit formulation of a linear function (e.g. a line in 2-D or a plane in 3-D) which, by its very nature, linearly separates the feature space into two regions. 2. The sign function, which converts the result of the dot product to one of the two values {−1, 1}, as mentioned before. This non-linear function is referred to as an activation function.

Technical Note The red circle in . Fig. 3.3a is a so-called artificial neuron. Such an artificial neuron is nothing more than a dot product followed by a non-linear activation function. This gives rise to the concept of an artificial neural network, which is a network of such artificial neurons.

Like any machine learning model, the perceptron has a loss function which is based on two properties of the implicit line equation (i.e. the dot product wT x). First, considering that the target t can be either +1 or −1 (in our example: sick = +1 and healthy = −1), the dot product of a misclassified point has an opposite sign to its associated target value. In other words, a sick patient with target t = +1 is misclassified when wT x < 0. Second, the more misclassified a point is, the larger will be the magnitude of its dot product. This is explained by the fact that the distance between a point and the line is related to the magnitude of the dot product. With these two properties in mind, the perceptron loss is as follows,

44

P.-M. Jodoin et al.

3 . Fig. 3.3 Graphical representation of a the perceptron, and b logistic regression. These are the simplest types of neural network and both are constructed using a single artificial neuron. On the left are the input variables (and 1 for the bias). The red circles represent artificial neurons that compute a dot product followed by an activation function (sign for the perceptron and σ for logistic regression)

L(w, D) =

1 M



−ti w T x i

(3.13)

(x i ,ti )∈M

where M is the set of misclassified samples, of size M. Note that the perceptron loss increases with the amount of misclassified samples. On the other hand, the perceptron loss reaches zero when every point (x i , ti ) ∈ D is correctly classified and hence M is empty. Technical Note The perceptron loss being linear with respect to w, its batch gradient is given by ∇w L(w, D) =

1 M



−ti x i

(3.14)

(x i ,ti )∈M

and its gradient for one misclassified sample (x i , ti ) is given by ∇w L(w, (x i , ti )) = −ti x i .

(3.15)

One can therefore plug these two equations into Algorithms 1 and 2 to train the network.

Logistic regression: Like the perceptron, the logistic regression is a linear classifier. As shown in . Fig. 3.3b, it can be viewed as a simple artificial neural network with a dot product between the parameters w and the input vector x. However, the logistic regression network has a different activation function called a sigmoid. The sigmoid σ : R → [0, 1] is a mathematical function defined as follows: σ (t) =

1 1 + e−t

(3.16)

From Machine Learning to Deep Learning

45

3

. Fig. 3.4 Plot of a sigmoid activation function. This function returns a value between 0 and 1 and a value of 0.5 at the origin t = 0

where σ (t = 0) = 0.5, σ (t 0) → 1 and σ (t 0) → 0.3 A plot of the sigmoid function is shown in . Fig. 3.4. The sigmoid is an appealing activation function when considered in conjunction with the properties of the implicit line equation, i.e. the dot product of Eq. (3.6). As mentioned before, while the dot product of a point lying on the line is zero, the same dot product of a point located in front of the line is positive, and it is negative for a point behind the line. Thus, for a logistic regression, a point lying on the line will have a score of 0.5 whereas a point in front of the line will have a score larger than 0.5 and behind the line a score lower than 0.5. Moreover, a point located far in front of the line will have a score ≈ 1 while a point far behind the line will have a score ≈ 0. Technical Note While the use of a sigmoid activation function does not seem to bring much, it has nonetheless tremendous consequences. In fact, it turns the neural network into a machine capable of predicting the conditional probability of class C1 : P(C1 |x). This conditional probability can be translated into English as: “the probability of being in class C1 (the class sick in our example) given the input vector x ”. Put another way, when properly trained, a logistic regression neural network has the sole property of predicting the probability of being in class C1 (and by default class C0 since P(C0 |x) + P(C1 |x) = 1) given the input vector x it receives as input.

In this way, a point lying in front of the line will have a large probability of being in class C1 , a point behind the line will have a low probability of being in class C1 3

The right arrows (→) here mean that the output of the sigmoid function approaches 1 when t is very positive and 0 when it is very negative. Note that this is not the same use of → that we saw earlier when defining the domain and range of functions (see Technical Note, Sect. 3.2).

46

P.-M. Jodoin et al.

(and thus a high probability of being in class C0 ) and a point on the line will have a 50% chance of being in class C1 . This is why the sigmoid activation function (as well as the softmax function that we will soon introduce) is widely used at the end of classification and segmentation neural networks. The loss function of the logistic regression network is the well-known cross entropy loss:

3

L(w, D) = −

1 N



ti ln(yw (x i )) + (1 − ti ) ln(1 − yw (x i ))

(3.17)

(x i ,ti )∈D

where N is the total number of patients in the training dataset D and ti = {0, 1} (instead of {−1, +1} for the perceptron). According to this function, the loss is minimum when the output of the network yw (x i ) = ti ∀(x i , ti ) ∈ D. In other words, the cross entropy loss is close to zero when the network correctly classifies the samples, meaning a conditional probability close to 1 when ti = 1 and close to 0 when ti = 0. Technical Note Considering that yw (x) = σ (wT x), one can prove that the batch gradient of the loss with respect to w is ∇w L(w, D) =

1 N



(yw (x i ) − ti )x i

(3.18)

(x i ,ti )∈D

and the gradient for one data pair (x i , ti ) is ∇w L(w, (x i , ti )) = (yw (x i ) − ti )x i .

(3.19)

Again, Algorithms 1 and 2 can be used with these gradient equations to train the logistic regression neural network.

3.3

K-Class Prediction

So far, we have studied a two-class example whose goal was to separate the sick patients from the healthy ones. Obviously, one can imagine classification problems with more than two classes, for example: influenza, cold, and healthy. Fortunately, neural networks naturally scale to the number of classes. When the number of classes is larger than 2, one can can simply use K output neurons (the red neurons in . Fig. 3.5a), where K is the number of classes. This gives rise to the multi-class perceptron and multi-class logistic regression networks.

From Machine Learning to Deep Learning

47

3

. Fig. 3.5 a Three-class linear neural network. b Scatter plots of patients associated with three classes: Healthy, Cold, Influenza. The dotted lines are the linear functions of each class. Note that the activation function h will vary depending on the nature of the loss. The point (38.0, 195.7) contains the body temperature and heart rate of a patient suffering from Influenza. Note that the values reported in this plot are for illustrative purposes only

Like the neurons we have seen before, these output neurons perform a dot product on the input vector x. Like any dot product, these neurons linearly separate the feature space. Technical Note The output of a three-class neural network is thus a vector of three dot products that can be expressed as a matrix-vector product: ⎡ T ⎤ ⎡ 1 1 1⎤ ⎡ T⎤ w1 x w0 w1 w2 w1 ⎥ ⎢ ⎥ ⎢ ⎢ 2T ⎥ ⎢ 2T ⎥ ⎢ (3.20) ⎢ w x ⎥ = ⎣ w ⎦ x = ⎣ w02 w12 w22 ⎥ ⎦ x = Wx ⎦ ⎣ 3T T 3 3 3 w w0 w1 w2 w3 x As can be seen, the ith row of matrix W contains the parameters of the ith classifier.

For the neural network shown in . Fig. 3.5a, the linear functions of the three output neurons are illustrated by the dotted lines in . Fig. 3.5b. For the multi-class perceptron, the output neurons have no activation function (thus h in . Fig. 3.5a is an identity function). Instead, the class predicted by the model is the one with the largest score. To illustrate this, let us consider the three-class example of . Fig. 3.5b. Here, we have a feature point x = (38, 195.7)T which corresponds to a patient whose body temperature is 38 degrees Celsius and heart rate is 195.7 beats per minute. This point lies in the green section of the space, i.e. the area associated with

48

P.-M. Jodoin et al.

class 3: Influenza. If we use the nine parameters of the system to form matrix W and multiply this by the augmented vector x we get ⎡

⎤⎡ ⎤ ⎡ ⎤ 1057.5 −31 0.5 1 −22.7 ⎣ −213 21 −3 ⎦ ⎣ 38 ⎦ = ⎣ −2.1 ⎦ −831.0 9 2.5 195.7 0.25

3

(3.21)

i.e. a negative value for classes 1 and 2 because x is located behind the blue and red dotted lines and a positive value for class 3 because it is located in front of the 3rd separation line. The multi-class perceptron loss is given by, L(W , D) =



1 M

T

T

(w j x i − w ti x i )

(3.22)

(xi ,ti )∈M

where j is the wrongly predicted class index and ti the target class index. Here again, the loss reaches zero when every training sample is well classified. Technical Note The stochastic gradient of the multi-class perceptron loss for a pair of misclassified samples (xi , ti ) ∈ M is given by

∇w L(w, (x i , ti )) :

⎧ ∂L ⎪ ⎨ ∂wj = x i ⎪ ⎩ ∂L = −x i ∂w ti

(3.23)

and the batch gradient is obtained by averaging these partial derivatives across M. Here again, the two gradient functions can be used in Algorithms 1 and 2.

The multi-class logistic network is very similar to the multi-class perceptron in the sense that the output neurons embed a dot product. However, in place of the sigmoid activation function, the output layer is followed by a softmax operation which is a normalized exponential function. If we call fi the output of the ith neuron (in Eq. (3.21), f1 = −22.7, f2 = −2.1, f3 = 0.25) the output of the softmax function for that neuron is efi Si = K

k=1 e

fk

.

(3.24)

If we apply the softmax operation to the output values of Eq. (3.21), we get S = (0.0, 0.087, 0.913)T .

From Machine Learning to Deep Learning

3

49

As for the two-class logistic network, this output can be seen as the conditional probability P(Ci |x). Put another way, according to the output S, data x has 91.3% chance of belonging to class 3 and a 8.3% chance of belonging to class 2. The loss of the multi-class logistic network is also a cross entropy, L(W , D) = −

1 N



ln Sti .

(3.25)

(xi ,ti )∈D

where Sti is the probability (or the softmax) of the correct class. Technical Note The gradient for a given pair (x i , ti ) of the cross entropy loss with respect to the weights W is given by ∇W L(W , D) =

1 N



(Si − t i )x T .∇W L(W , D) = (Si − t i )x T

(3.26)

(xi ,ti )∈D

and the batch gradient is obtained by averaging these partial derivatives across M.

3.4

Handling Non-linearly Separable Data

Linear decision functions such as those we have seen so far work well for well separated subgroups. However, it often happens that subgroups cannot be separated by a linear function, as illustrated in . Fig. 3.6a. These problems require more sophisticated and complex solutions. To tackle the problem of non-linearly separable data, three approaches are available:

. Fig. 3.6 a Example of non-linearly separable 2-D data, and b its augmented version with a third dimension (age) with a 3-D plane separating the two groups of patients. c A 3-D plane can be mathematically represented by a neural network with four input variables

50

P.-M. Jodoin et al.

1. Using a non-linear decision function. 2. Gathering more information. 3. Transforming the data. While the first solution goes beyond the scope of this chapter, we will focus on the latter two solutions and underline how they fit within the scope of neural networks.

3

Gather more information: In the context of our example, non-linearly separable data means that body temperature and heart rate measurements are not discriminative enough to separate the two classes with a linear classifier. As a solution, one might acquire a third measurement such as, for example the age. By doing so, x ∈ R3 becomes a point in a 3-D space (see . Fig. 3.6b) and the classification function becomes a plane defined in this 3-D space. Interestingly, the implicit equation of a plane is a generalization of the equation of a line with a third dimension x3 , 0 = w1 x1 + w2 x2 + w3 x3 + w0

(3.27)

where w0 is still the bias. As for the implicit equation of a line, this equation can be represented by a dot product 0 = wT x where w, x ∈ R4 . Without much surprise, we can further generalize the formulation to q measurements. In that case, a patient would become a point in a q-dimensional space where the sick and the healthy patients can be separated by a hyperplane. This hyperplane is again represented by a dot product 0 = wT x where w, x ∈ Rq+1 . As for the implicit line equation, the implicit hyperplane equation has the sole property of splitting the feature space between a positive region located in front of the hyperplane and a negative region behind the hyperplane. As shown in . Fig. 3.6c, the use of 3 input variables (and 1 for the bias) does not change the nature of the perceptron nor the logistic regression neural network as it only increases the size of the input layer. Transform the data: Another well known solution to address the non-linearity issue is to project x into a new feature space where the data are linearly separable. The functions that perform this mapping are called basis functions, φ(x) : Rq → Rp . Once the data are projected to the new feature space, a linear classifier can be used to separate the classes. Technical Note The use of a basis function also gave rise to the kernel methods (and the iconic kernel SVM [2]) which arguably were the most widely-used machine learning methods before the deep learning wave struck the scientific community [3].

One important limitation of basis functions is that not every function φ(.) (or its associated kernel k(.) [2]) can successfully disambiguate two classes. As such, one often has to manually adjust φ(.) (or k(.)) to make it fit the training data distribution.

From Machine Learning to Deep Learning

51

3

. Fig. 3.7 Multi-layer neural network made of an input layer (the four grey circles on the left) followed by two hidden layers (the yellow neurons) and an output layer (the red neurons). Mathematically, the purpose of the hidden layers is to act as a basis function φ(x) that projects the input data (here points in a 3-D space) into a linearly separable space

One great advantage of neural networks is their ability to simultaneously learn the basis function φ(.) as well as its associated classification function. This can be done by increasing the number of neurons. . Figure 3.7 shows one such neural network organized into 4 layers, namely, the input layer (which corresponds to the grey circles on the left), the hidden layers (the yellow neurons in the middle) and the output layer (the red neurons). As before, the neurons encode a dot product between the output of the previous layer and an associated weight vector. This architecture is called a multi-layer perceptron (MLP). The more hidden layers a MLP has, the more complex the overall neural network will be, i.e. the better it will be at estimating complex relationships between input samples x and target values ti . As before, the output layer of the MLP can be neurons without an activation function in which case the loss would the multi-class perceptron loss of Eq. (3.22). One could also add a softmax operation at the end of the network and get the cross entropy loss of Eq. (3.25). Note that the gradient of the loss with respect to the parameters of a multi-layer perceptron is done though an operation called back propagation. For more details on back propagation, please refer to [4]. The use of multiple layers of neurons leads to so-called deep neural networks and deep learning. There you have it! The more layers a neural network has, the deeper it gets. Technical Note As shown in . Fig. 3.7, a multi-layer neural network can be seen as a three-part machine:

52

P.-M. Jodoin et al.

1. An input vector x. 2. A series of hidden layers, which act as a basis function φ(x) whose goal is to project x into a space where the data are linearly separable. 3. An output layer which is a linear classifier.

3

As the weights W of the network are learned all together, we say that deep neural networks are end-to-end trainable since φ(x) and the classification function are learned at the same time.

3.5

Convolutional Neural Networks

Multi-layer neural networks are not without their limitations. One of the most important limitations comes from the substantial increase in parameters when the size of the input vector increases. For example, if the input signal is a greyscale image containing 28 × 28 pixels (as for images from the iconic MNIST dataset [5]) each neuron of the first layer will be connected to a total of 28×28+1 input neurons (1 is for the bias). Therefore, if the first hidden layer has 100 neurons, the network will have 78,500 parameters just in the first layer ((28 × 28 + 1) × 100). Even worse, if the input is, for example, a 3-D 256 × 256 × 256 MR brain volume, the first layer will contain more than 16 million parameters. Without much surprise, very large neural networks pose important memory and computing challenges. Furthermore, it is empirically known that very large multi-layer networks are difficult to train, and often converge towards sub-optimal solutions. The answer to this problem is to reduce the number of connections between two consecutive layers. While this is fundamentally difficult for an arbitrary input signal, there is an appealing solution when the input signal is temporally and/or spatially structured such as for an audio signal (1-D), a greyscale or color image (2-D) or a 3-D or 4-D medical image volume. In these cases, one can connect a neuron to a subset of neighboring neurons in the previous layer. This is illustrated in . Fig. 3.8 where each neuron in the first layer is connected to 3 × 3 grid of input nodes (here representing pixels). In this way, each neuron has a total of 9 weights instead of the very large number we would have with a fully-connected layer. The set of 3 × 3 weights that connects a neuron to the previous layer is called a filter. Furthermore, the “images” in the middle and on the right illustrate the output of each neuron of the first and second hidden layers. These “images” are called feature maps. As usual, these artificial neurons perform a dot product followed by an activation function. One may reduce even more the number of parameters by forcing every filter of a layer to share the same set of weights. By doing so, the two hidden layers in . Fig. 3.8 would have a total of just 9 weights each. More interestingly, by doing so the number of weights is constant with respect to the size of the input signal.

From Machine Learning to Deep Learning

53

3

. Fig. 3.8 Illustration of two convolutional layers. On the left is the input layer containing a CT scan of a knee. In the middle and on the right are the ‘feature maps’ of the first and second hidden layers. Each element of these feature maps is an artificial neuron which embodies a dot product and a nonlinear activation function. In this illustration, the feature maps show the neuron outputs. Each neuron is connected to a 3 × 3 grid of neurons in the previous layer and hence has 9 weights

Technical Note By connecting the neurons as in . Fig. 3.8 and sharing the weights across a layer, the dot product computed for each neuron of a hidden layer is mathematically identical to that of a convolution, hence why we call these network layers convolution layers. These types of neural networks are called convolutional neural networks or CNNs for short.

Like the multi-layer neural networks that we have seen before, a CNN may have an arbitrary number of layers. Furthermore, K > 1 filters can be used at each layer which would produce K feature maps (. Fig. 3.8 shows one filter and one feature map per layer). Furthermore, like any other neural network, the number of neurons in the last layer corresponds to the number of classes (or variables in case of a regression) it should predict. CNNs are also trained with the same gradient descent and loss functions as any other neural network. CNNs are the cornerstone of the machine learning revolution in medical imaging. Applications such as medical image reconstruction, denoising, disease recognition, tumor localization, and tissue segmentation to name a few are all intimately tied to CNNs. Further details on specific types of CNN are provided in . Chap. 4.

54

3.6

3

P.-M. Jodoin et al.

Closing Remarks

In this chapter, we have introduced the fundamental concepts of artificial neural networks, and seen how the formulation of such networks is based upon a simple linear algebra operation, i.e. the dot product. We have presented how neural networks can vary from the very simple (perceptron and logistic regression) to the more complex deep neural networks that are so widespread in cardiology and other fields of medicine today. Next, we include a set of self-assessment exercises to help you reinforce your knowledge of these fundamentals. Having built up our knowledge of the key concepts of machine and deep learning, the subsequent chapters provide more focused reviews of specific topics in cardiology and the ways in which AI has, and will, impact these fields.

3.7

Exercises

Exercise 1 Explain the meanings of the terms ‘iteration’ and ‘epoch’ in the content of machine learning optimization. How does the choice of batch size affect the relationship between epochs and iterations?

Exercise 2 What are the main differences between the batch gradient descent, stochastic gradient descent and mini-batch stochastic gradient descent optimization algorithms? What is the main disadvantage of batch gradient descent?

Exercise 3 What are the similarities and most fundamental difference between the perceptron and logistic regression artificial neural networks?

Exercise 4 Describe three ways in which machine learning models can be extended from classifying linearly separable data to non-linearly separable data.

From Machine Learning to Deep Learning

55

Exercise 5 You have been asked to design a machine learning solution for analysing 3-D medical images and producing automated diagnoses. It is likely that the mapping from images to diagnoses is highly complex, but a large amount of training data are available to learn this mapping. Suggest which type of machine learning model might be appropriate for this application and justify your answer.

3.8

Tutorial—Classification From Linear to Non-linear Models

Tutorial 2 As with the other notebooks, the contents of this notebook are accessible as Electronic Supplementary Material. Overview In this hands-on tutorial, you will test some of the concepts introduced in . Chap. 3, in particular, simple classifiers such as the perceptron, logistic regression and multi-layer perceptron. You will examine the performance of different classifiers and the effects of hyperparameters on two synthetic datasets, which are either linearly or non-linearly separable. The figure below shows the output of three simple classifiers on non-linearly separable data, to be tested in this notebook:

Objectives • Become more familiar with Python and the essential tools for machine learning such as scikit-learn. • Conduct a simple classification problem by testing progressively the contents described in . Chap. 3. Computing Requirements Like in the other hands-on tutorials, this notebook starts with a brief “System setting” section, which imports the necessary packages, installs the potentially missing ones, and imports our own modules.

3

56

P.-M. Jodoin et al.

Acknowledgements ND was supported by the French ANR (LABEX PRIMES of

Univ. Lyon [ANR-11-LABX-0063] within the program “Investissements d’Avenir” [ANR-11-IDEX-0007], and the JCJC project “MIC-MAC” [ANR-19-CE45-0005]).

3

References 1. Rosenblatt F. The Perceptron, a Perceiving and Recognizing Automaton Project Para, ser. Report: Cornell Aeronautical Laboratory. Cornell Aeronautical Laboratory; 1957. 2. Lampert CH. Kernel methods in computer vision. Hanover, MA, USA: Now Publishers Inc.; 2009. 3. Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems 25; 2012. p. 1097−1105. 4. Bishop CM. Pattern recognition and machine learning (information science and statistics). Berlin, Heidelberg: Springer; 2006. 5. LeCun Y, Cortes C. MNIST handwritten digit database; 2010. [Online]. 7 http://yann.lecun.com/ exdb/mnist/

57

4

Measurement and Quantification Olivier Bernard, Bram Ruijsink, Thomas Grenier and Mathieu De Craene Contents 4.1

Clinical Introduction – 58

4.2

Overview – 60

4.3

AI Models for Cardiac Quantification – 60

4.4

Quantification of Cardiac Function From CMR and Echocardiography – 64

4.5

Quantification of Calcium Scoring From CT Imaging – 72

4.6

Quantification of Coronary Occlusion From SPECT – 75

4.7

Leveraging Clinical Reports as a Base of Annotations – 76

4.8

Closing Remarks – 77

4.9

Exercises – 77

4.10

Tutorial—Cardiac MR Image Segmentation With Deep Learning – 78

4.11

Opinion – 79 References – 81

Supplementary Information The online version contains supplementary material available at 7 https://doi.org/10.1007/978-3-031-05071-8_4. Authors’ contribution: • Introduction, Opinion: BR. • Main chapter: OB, MD. • Tutorial: OB, TG. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Duchateau and A. P. King (eds.), AI and Big Data in Cardiology, https://doi.org/10.1007/978-3-031-05071-8_4

58

O. Bernard et al.

n Learning Objectives At the end of this chapter you should be able to: O4.A Explain how standard deep learning problems in cardiac quantification are defined in the literature O4.B Define problem formulations in terms of data inputs and outputs O4.C Explain how to construct relevant databases for different problems and identify public datasets when they exist O4.D Describe deep learning architectures that exist in the literature for cardiac measurement and quantification problems.

4 4.1

Clinical Introduction

Images and other data obtained in cardiology clinics are used to diagnose heart diseases and to inform treatment decisions. In . Chap. 2, the most common data sources in cardiology, their technical background and use cases have been discussed. In this chapter, we will discuss the quantification of biomarkers from cardiac imaging sources and the role AI can play in obtaining them. First, we will describe typical measurements for the most important imaging modalities used in cardiology. In echocardiography, a wide range of cardiac measurements are routinely obtained. Together, these measurements give a good overview of global systolic (contractile phase) and diastolic (filling phase) function of the heart, as well as the condition and function of the heart valves. From 2-D cine echocardiography images, fractional area change (FAC) of the left ventricle (LV) can be quantified by segmenting the LV blood volume at end diastole and end systole in 2- and 4chamber orientation images. FAC provides a surrogate measure of EF of the heart. Together with peak myocardial contractile velocity, obtained from tissue Doppler, and myocardial strain calculations, obtained from speckle tracking, this measure allows investigation of the systolic function of the heart. Flow patterns over the cardiac valves are another important set of echocardiography biomarkers. For example, from a mitral valve inflow Doppler image, peak early inflow rate (peak Ewave), E-wave deceleration time and peak inflow rate during atrial contraction are obtained to measure diastolic function of the heart. Stenosis of the heart valves is typically investigated by estimating the mean pressure gradient and valve orifice area. After image acquisition, analysis to obtain these biomarkers is typically performed manually by the operator, for example by segmenting the LV blood pool (in the case of FAC) or annotating the Doppler signals. To measure strain, operators draw a region of interest (ROI) in the myocardium. An image-tracking algorithm subsequently modifies the ROI to track the myocardium over the cardiac cycle. Strain analysis has only recently migrated from research to clinical practice, as sensitivity to tracking failures meant a labour intensive process of scrutiny of myocardial tracking was necessary to ensure accurate biomarker estimation. To obtain quantitative biomarkers from CMR data, clinicians typically manually segment the areas of interest in the selected images. These manual operations

Measurement and Quantification

59

4

are often aided by semi-automatic thresholding operations, and more recently neural network-based algorithms that provide initial rough segmentations that subsequently are refined by the operators. From the segmentations the target information is extracted. To quantify cardiac volumes and function, cine CMR images are segmented at end diastole and end systole. From these segmentations, volumes (blood and myocardial) are estimated and EF can be calculated. For T1/T2, late gadolinium scar and perfusion imaging the pixel intensities are used to characterize tissue health (i.e. the presence of scar or fibrosis) or quantify blood flow through the myocardium, respectively. Phase contrast CMR allows quantification of blood flow and maximal velocity in the heart and great vessels, allowing interrogation of valvular stenosis, regurgitation or shunting in the case of a heart defect. Lastly, perfusion imaging allows quantification of myocardial blood flow, which can help to diagnose and assess the severity of ischemic heart disease. The most common use of cardiac CT is to screen for coronary atherosclerotic disease in patients with angina chest pain. Using a triggered CT scan, the extent of coronary artery sclerosis is scored using a calcium score. For coronary calcium scoring, the lesions are segmented semi-automatically (often using contrast thresholding). CT coronary angiography is an extension of this technique, in which the coronaries are visualized using high resolution CT imaging in combination with a vascular contrast agent. CT coronary angiography can be used to segment the extent of atherosclerotic lesions and non-calcified plaques in the coronary arteries in detail. Several other imaging modalities can be used for myocardial perfusion imaging. Single photon emission computed tomography (SPECT) and positron emission tomography (PET) are most frequently used to assess and quantify myocardial muscle perfusion. In these techniques, radioactive isotopes are injected into the body and their uptake in the myocardium is assessed. The myocardial signal is subsequently segmented to obtain a map of regional perfusion differences. Recently, PET has been combined with CT or MRI to provide information about coronary anatomy and myocardial aspect. ECG is used in the acute setting of patients presenting with chest pain, syncope or palpitations, as well as during follow-up of patients with known heart diseases. In the acute setting, changes in the sequence of activation events in the atria and ventricles allow diagnosis of arrhythmias. Changes in depolarization and repolarization, such as S-T segment elevation, can be used to diagnose acute hypoxia of the myocardium (i.e. myocardial infarction). In long-term follow-up, ECG is used to screen for progressive heart failure, especially widening of the QRS duration (the activation of the ventricles) is used to determine suitability of patients for treatment with cardiac resynchronization devices. Typical measures obtained from ECG are: the amplitude and width of the p-wave (atrial activation), the QRS complex (biventricular activation) and the T-wave (repolarization), the delay between these waves and the degree of S-T segment elevation. In general, deriving the wide range of biomarkers used in cardiology currently requires a significant investment of time by skilled cardiologists to perform routine tasks such as manual contouring. This leads to overwork and stress as well as reducing the amount of time such specialists have to devote to patient care and management. Additionally, a degree of subjectivity and the existence of

60

O. Bernard et al.

inter-observer and intra-observer variability between the measurements limit the sensitivity of detecting subtle changes in biomarkers over time. These small changes can be highly relevant for early detection of disease or when evaluating initiated treatments. Therefore, there is a need for more automated tools to assist cardiologists and reduce these burdens. The following technical review summarizes the key principles and the current state-of-the-art in such techniques.

4

4.2

Overview

Because of the large number of cardiac images that are routinely acquired using a range of modalities (as reviewed in the Clinical Introduction), there is much interest in applying AI methods to automate a number of tasks, including the quantification of clinical indices, the detection of diseases, risk prediction or the generation of medical reports [1]. In this chapter, we will first outline the variety of AI methods that have been successfully applied to cardiac image analysis tasks. We will then focus on three applications with high potential for the quantification of clinical indices, as illustrated in . Fig. 4.1.

4.3

AI Models for Cardiac Quantification

Different types of neural networks, belonging to the deep learning branch of AI algorithms, have proven their utility depending on the task under consideration (e.g. volumetric measurements, disease quantification, image enhancement). Each network is defined by its own architecture in terms of the number of layers, the number

. Fig. 4.1 Three examples of successful applications of AI for the quantification of clinical indices from cardiovascular imaging data. All can be formulated as supervised problems and recently led to expert-level performance

Measurement and Quantification

61

4

. Fig. 4.2 Illustration of the most popular neural networks used in cardiac imaging. The input data correspond to either signals or images. The decision layers output a scalar value for classification/regression and an image for segmentation/image enhancement

of neurons per layer, the type of layers and how they are connected. . Figures 4.2 and 4.3 provide an overview of the most popular neural network architectures that have been successfully applied in cardiovascular applications and these are reviewed below. Fully-connected neural networks (FCN): In . Chap. 3 we introduced the basic types of neural network, starting with the perceptron. These networks are also known as fully-connected neural networks (FCNs). FCNs are the basic class of neural networks and their architecture builds upon a set of layers composed of artificial neurons. The neurons of each layer are fully connected to the neurons of the next layer. The input data have a one-dimensional shape (that is, they are considered as a column vector, with e.g. for an image as many neurons as pixels in the image). The intermediate layers allow the creation of features that become more and more abstract and relevant to the targeted application as the layers get deeper. The final layer exploits the features thus created to make a decision. FCNs have not been commonly used on their own in cardiac imaging, but are often employed as subnetworks for more complex architectures, as described below.

62

O. Bernard et al.

4

. Fig. 4.3 Examples of quantification applications on cardiac data where AI has been successfully applied. Segmentation tasks can be seen as classification tasks applied to each pixel

Recurrent neural networks (RNN): Recurrent neural networks (RNNs) [2] are often used in situations where the data being analysed have a sequential nature, e.g. a series of images acquired over time. This type of network takes the first instance of a sequence, makes a prediction, and then takes its own output in combination with the next instance of the sequence for subsequent predictions. RNNs have been successfully applied in several applications, such as the analysis of electrocardiograms (ECGs) [3], the prediction of incident heart failure [4] and the extraction of vessel centerlines [5]. Convolutional neural networks (CNN): We introduced the concept of a convolutional neural network (CNN) in . Chap. 3. Compared to FCNs, CNNs considerably reduce the number of parameters to be learned by using convolutions to leverage the information from neighboring pixels, and to share weights across the network. CNNs also exploit downsampling layers to reduce the amount of spatial information and enable the extraction of higher level features. At the end of the

Measurement and Quantification

63

4

network, the output of the last convolutional layer can be reshaped into a vector and fed to a fully-connected neural sub-network. CNNs have been successfully applied in many applications, such as view recognition in echocardiography [6], quality assessment of echocardiograms [7], calcium scoring in low-dose chest CT imaging [8] and anatomical structure localization in CT [9]. Encoder-decoder networks (EDN): Although encoder-decoder networks (EDNs) can be fully-connected, their use in medical applications has been mostly in combination with CNNs. CNN-based EDNs, when they feature only convolution and pooling operations, are also known as fully convolutional networks.1 CNN-based EDNs are based on a two-stage convolutional network architecture. The first part, known as the encoder, is similar to conventional CNNs and extracts high level information. The second part is the decoder, which uses information from the encoder and applies a set of convolutions and upsampling operations to gradually transform feature maps with the purpose of reconstructing a higher dimensional representation. In an unsupervised setting, the output can be the same as the input and the task of the EDN is simply to reconstruct the input via the encoded representation. In this sense, EDNs are a technique for dimensionality reduction (see . Chap. 2, Types of Model). EDNs can also be used in a supervised setting, for example to produce segmentations from images. Examples of encoder-decoder architectures include (variational) autoencoders [10] and the well-known ‘ U-Net’ segmentation model proposed by Ronneberger et al. in 2015 [11]. The U-net integrates skip connections between the encoder and decoder parts with the goal of retrieving details that were potentially lost during the downsampling while also stabilizing the learning procedure. EDNs have been widely used for automatic cardiac structure delineation in echocardiography [12], CMR [13, 14] and CT imaging [15]. Generative adversarial networks (GAN): Generative adversarial networks (GANs) involve two networks that have complementary goals. Typically, one network acts as a generator to produce the desired output (e.g. an image or a segmentation) while the other network acts as a discriminator, evaluating whether the generated output is realistic. The successive optimization of both networks allows the generation of more and more realistic outputs without the need to have exact labelling or annotation. GANs are not used directly for quantification or measurement. However, they have been successfully applied as a pre-processing step in several quantification pipelines, such as the reduction of noise in low dose CT imaging for the quantification of low-density calcified inserts [16] or the generation of realistic images from another modality (e.g. CMR from CT scans or the opposite) for the quantification of the volumes of cardiac chambers when access to labelled images for one modality is limited [17]. Graph convolutional neural networks (GCN): Graph convolutional neural networks (GCNs) operate on graphs rather than imaging data. For example, a 3-D surface or volumetric mesh of the LV, or a network representing a patient population can both 1

Confusingly, fully convolutional networks are sometimes abbreviated as FCNs, conflicting with the use of the same term for fully-connected networks. In this book we limit the use of the abbreviation FCN to fully-connected networks.

64

4

O. Bernard et al.

be seen as graphs. Such graphs consist of a set of nodes (each node could stand for a point of the LV mesh, or one subject in a population) and edges that link some pairs of points based on specific relationships (for example, the closest neighbours in the LV mesh). Just as CNNs exploit the spatial nature of image pixels, GCNs take into account the connectivity of the nodes in a graph by encoding the neighbourhood relationships between the individual data points. Specific layers are used to compute convolution and downsampling operations directly on the graph structure. As in CNNs, the output of the last convolutional layer can be reshaped into a vector and serve as input to a fully-connected neural sub-network. For the moment, the use of GCNs in cardiac imaging remains marginal. However, this method has been recently successfully applied to coronary artery disease detection using polar maps derived from cardiac perfusion imaging [18] and for the quantification of cardiac motion [19].

4.4

Quantification of Cardiac Function From CMR and Echocardiography

The quantification of cardiac function through the delineation (segmentation) of anatomical structures has been widely studied in deep learning since 2018. Among the clinical targets, we can mention the volumes of the cardiac chambers [20, 21], the LV and RV being the most studied [12, 13, 22], EF (both LV and RV) [12, 23], myocardial mass [13] and left ventricular longitudinal strain [24]. These clinical indices have been investigated through different imaging modalities, CMR imaging in the lead [13, 25−27], followed by ultrasound imaging [12, 23, 28] and then CT imaging [29]. A complete description of all the deep learning approaches that have been used for cardiac segmentation is beyond the scope of this chapter. However, we invite interested readers to go through the following articles for a more technical review of the underlying methods [1, 30]. We propose in this section to focus on the automatic assessment of ventricular volumes and the corresponding EF by deep learning in both CMR and echocardiography imaging. Given that there are currently several dozens or even more deep learning methods for cardiac segmentation, one may wonder which methods are the most efficient at the moment. One of the current best ways to answer this challenging question is to look for studies that have utilized open access databases. Indeed, the interest of such databases (see . Chap. 1, The Role of Big Data) is to provide the community with all the tools (dataset, expert annotations and associated evaluation platform) to objectively analyse, reproduce and compare the different deep learning approaches. Complementary to these open access datasets, the UK Biobank [30] represents the largest existing CMR dataset which could be used to train and test deep learning methods. However, one limitation of this dataset is that it is not free to access, which inevitably limits its use by research teams, making comparison of different algorithms and architectures more difficult. . Tables 4.1 and 4.2 list the different open access datasets that have been proposed so far for the automatic estimation of ventricular volumes and EF in CMR and echocardiography, respectively. These tables illustrate the need to build the most complete dataset

100 32 200 50 200

100

16

500

100

150

2011

2012

2015

2017

2020

MICCAI RV [33]

Kaggle [34]

ACDC [35]

M&Ms [36]



STACOM [32]

45

Test

2009

Train

Sunnybrook [31]

Nb Subjects

Year

Name













LV













RV













Myo

Ground truth

CMR datasets

. Table 4.1 Representative publicly available CMR datasets tailored for cardiac segmentation













Pathology













× Centre













× Vendor

Diversity













Online evaluation

Measurement and Quantification 65

4

Year

2014

2019

2019

CETUS [37]

CAMUS [12]

EchoNet [38]

10036

450

15

Train



50

30

Test

Nb Subjects







LVendo







LVepi







LA

Ground truth







Pathology







A4C

A2C







View













× Centre × Vendor

Diversity

4

Name

Echocardiography datasets

. Table 4.2 Representative publicly available echocardiography datasets tailored for cardiac segmentation







Online evaluation

66 O. Bernard et al.

Measurement and Quantification

67

4

. Fig. 4.4 Architecture of one of the best current deep learning solutions for the quantification of cardiac function from CMR [39]. The U shape of the 2-D and 3-D sub-networks is composed of an encoder part to extract high level information and a decoder part to gradually transform this information to segmentation maps. The skip connections (grey arrows) between the encoder and decoder enable the recovery of details and stabilize the learning process. The use of two sub-networks allows a better exploitation of 2-D and 3-D features of the images. The outputs of the sub-networks are finally merged to compute the final segmentation mask

in order to have the greatest impact. The richness of a dataset is defined by the following aspects: total number of patients, number of annotated structures, number of involved pathologies, data collected from several hospitals, data acquired from different vendors and the presence of an evaluation platform to provide an objective framework for analysis. Based on these tables, we propose to detail the current best performing methods on the ACDC dataset for CMR [13] and on the CAMUS dataset for echocardiography imaging [12]. CMR imaging analysis: One of the currently best performing methods on the ACDC dataset was proposed by Isensee et al. [39]. The network architecture is illustrated in . Fig. 4.4 and corresponds to an ensemble (see Technical Note, below) of two U-Net architectures, one designed to process 2-D images, the other to process the full stack of slices as a 3-D volume. The segmentation results of the two sub-networks are then merged by a simple averaging operation to compute the final output. Recently, a more complete version of this network named nnU-net has been proposed by the same authors [40]. This network is currently one of the best performing methods across a large number of public challenges in medical image segmentation [41]. Technical Note Ensemble methods involve training multiple models in slightly different ways (e.g. different hyperparameters, different weight initializations, etc.). At inference time,

68

O. Bernard et al.

all models are applied and their outputs combined in some way, e.g. averaging the softmax probabilities of the models for each class in a classification problem. This strategy usually provides better performance and/or higher robustness compared to a single model.

4

. Figure 4.5 provides an example of a segmentation obtained by this method on a patient with a myocardial infarction with altered LV EF. This figure allows the appreciation of the segmentation accuracy of the method compared to the annotation drawn by an expert. . Tables 4.3 and 4.4 give the average geometrical and clinical scores obtained from 50 subjects and patients with several diseases (heart failure

. Fig. 4.5 Automatic CMR segmentation in a patient with a myocardial infarction. Each row corresponds to a given slice of the same patient at end diastole. From top to bottom: basal slice towards apical slice. From left to right: input image, ground truth segmentation, automatic segmentation results obtained using the method of Isensee et al. [39]. This method obtained a Hausdorff distance of 2.7 mm, 3.9 mm and 4.9 mm, respectively for the LV blood pool, LV myocardium and RV blood pool

4

69

Measurement and Quantification

. Table 4.3 CMR-based segmentation accuracy of the deep learning method proposed by Isensee et al. assessed on 50 patients [39] Methods

LV Hausdorff dist. (mm)

RV Hausdorff dist. (mm)

Myocardium Hausdorff dist. (mm)

Inter-observer

7.1

13.2

7.4

Intra-observer

4.7

8.4

5.6

Isensee et al. [39]

6.2

9.9

7.2

. Table 4.4 CMR-based clinical metric accuracy of the deep learning method proposed by the group of Isensee et al. assessed on 50 patients [39] Method

LV volume correlation

RV volume correlation

LV EF correlation

RV EF correlation

Isensee [39]

0.997

0.988

0.991

0.901

with infarction, dilated cardiomyopathy, hypertrophic cardiomyopathy, abnormal RV). The segmentation accuracy was assessed among other indices by the Hausdorff distance, which corresponds to the maximum distance between the reference and the estimated contours (see . Chap. 2, Model Validation). A lower Hausdorff distance thus indicates a higher accuracy. Moreover, the inter- and intra-observer variability were also provided, to give performance ranges to be matched by automated solutions. The clinical results were assessed through a correlation coefficient, the best scores corresponding to a value of one. Regarding the segmentation accuracy, it is interesting to note that the Hausdorff distances obtained by the ensemble U-Net model are lower than the inter-observer variability and close to, but still slightly higher (by less than 2 mm) than the intraobserver variability. Although further investigations should be made to validate these findings (especially for images acquired under more heterogeneous settings), these results tend to show that, when properly trained, deep learning techniques are able to reproduce the annotations of experts with a high level of accuracy. Regarding clinical indices, the method obtained highly accurate results with correlation values around 0.99 for the estimation of the volumes and EF of the LV. The most difficult clinical metric to estimate is the EF of the RV with a correlation score of 0.90. Finally, the percentage of patients for which the predicted EF is less than 5% away from the ground truth was also assessed (5% being classically considered as an acceptable error margin [42]). The method accurately predicted the LV EF for 92% of the patients. These results suggest that we might be on the verge of fully automated quantification of cardiac function in routine CMR imaging. Indeed, a number of vendors now provide deep learning-based segmentation tools as part of their CMR image analysis packages [43]. Full automation of cardiac quantification would allow much faster and less labour-intensive analysis of CMR images, so that conclusions of the examination could be provided to the patient before leaving the radiology department. In current clinical practice, the latest sys-

70

4

O. Bernard et al.

tems provide pre-filled radiological reports with an integrated automatic speech recognition technology so doctors can dictate the various physiological and technical parameters. An automatic CMR analysis tool could thus easily be integrated within this framework. That being said, further investigations (the most important of which are the multi-centre and multi-vendor studies at larger scales) are still required before such software gets approved by accreditation agencies (CE mark, FDA or ISO) and gets integrated into CMR consoles. To this end, recent work has shown how training a nnU-Net model with CMR images and ground truth segmentations from more than 4,000 UK Biobank and 3,000 clinical CMR scans can lead to excellent quantification performance on external databases featuring data from previously unseen centres, vendors and MR field strengths [44]. Echocardiography image analysis: The network that currently performs the best on the CAMUS dataset is the one illustrated in . Fig. 4.6 and proposed by the group of Wei et al. [45]. This method is also based on a 3D U-Net and takes as input

. Fig. 4.6 Architecture of one of the best current deep learning solutions for the quantification of cardiac function in echocardiography imaging [45]. The network takes as input the full 2-D temporal sequence of a patient. The last layer of this network is used to optimize two tasks in parallel: the segmentation of each frame of the sequence and the bi-directional estimation of the motion field between two consecutive frames in the sequence

71

Measurement and Quantification

4

the full 2-D sequence of a patient. The last layer of the network is used to perform two tasks in parallel: the segmentation of each frame of the sequence and the bidirectional estimation of the motion field between two consecutive frames in the sequence. The authors show that the joint optimization of the two tasks results in better performance than the other deep learning methods that were applied on the same CAMUS dataset. . Figure 4.7 shows an example of a typical segmentation result obtained by the network proposed by Wei et al. on a patient from the CAMUS dataset. From this figure, one can appreciate the quality of the segmentation results in terms of shape accuracy and temporal consistency. . Tables 4.5 and 4.6 give the average geometrical and clinical scores obtained from 50 subjects and patients with several diseases (half of the CAMUS population had an EF value lower than 45%). The same geometrical and clinical metrics as the ACDC dataset were used to enable comparisons between both modalities. Regarding segmentation accuracy, it is important to note that the Hausdorff distances obtained by Wei’s network are lower than the inter-

. Fig. 4.7 Illustration of the results obtained with the network proposed by Wei et al. on the same patient over different time points in the sequence from end diastole (ED) to end systole (ES) [45] . Table 4.5 Echocardiography-based segmentation accuracy of the deep learning method proposed by Wei et al. assessed on 50 patients [45] Methods

LV endocardium Hausdorff dist. (mm)

LV epicardium Hausdorff dist. (mm)

Left atrium Hausdorff dist. (mm)

Inter-observer

7.1

7.5



Intra-overver

4.6

5.0



Wei et al. [45]

4.6

4.9

5.0

72

O. Bernard et al.

. Table 4.6 Echocardiography-based clinical metric accuracy of the deep learning method proposed by Wei et al. assessed on 50 patients [45]. ED and ES stand for end diastole and end systole

4

Method

LV volume ED correlation

LV volume ES correlation

LV EF correlation

Inter-observer

0.940

0.956

0.801

Intra-observer

0.978

0.981

0.896

Wei et al. [45]

0.958

0.979

0.926

observer variability and of the same order as the intra-observer variability, which is very impressive because of the low spatial resolution of echocardiography compared to the in-plane resolution of CMR images and the lower quality of myocardial contours due to speckles and artefacts. These results suggest that deep learning techniques are also able to reproduce manual annotations with a high accuracy in echocardiography imaging. In terms of clinical results, Wei’s method also reaches correlation coefficients very close to one for the estimation of LV volumes and EF. Although these results come from a pilot study, they are additional evidence of how valuable deep learning techniques can be in automating the estimation of cardiac functional indices in clinical routine. As with CMR imaging, further experiments are needed before applying this type of method in the clinic, addressing the variability across multiple clinical centres and vendors to further verify these preliminary findings.

4.5

Quantification of Calcium Scoring From CT Imaging

A very specific quantification problem associated with cardiac imaging is the calcium scoring of CT images. In this case as in other applications (e.g. lung or breast imaging), deep learning has disrupted the field of screening. Indeed, human readings of volumes at high resolution can be prone to errors. AI can improve current manual specificity and sensitivity values by guiding the attention of the specialist to certain image regions. Nonetheless, the task of screening pathological subjects is a challenging one for several reasons. First, screening may require imaging and reading large areas, possibly exceeding the memory available to most GPUs.2 Second, many images will (hopefully) come from normal subjects versus pathological ones. This requires a classification algorithm able to deal with imbalanced data. Third, as screening is meant to be performed on a high number of subjects, it should be performed at low dose for X-ray or CT imaging. The latter requires algorithms able to work at low contrast with a variety of image quality and reconstruction parameters.

2

Graphics Processing Units. Training modern deep learning models typically has high computational demands and is often only made feasible by exploiting dedicated hardware on GPUs.

Measurement and Quantification

73

4

A good example of addressing these challenges for calcium scoring is described in [8]. The authors jointly detect and quantify calcifications in the coronary and aortic trees. For processing volumes imaged at high resolution, the authors opted for a hybrid 2-D/3-D approach and extracted three orthogonal patches centred on the voxel of interest in the sagittal, axial and coronal directions. To address the class imbalance, the authors generated batches comprising equal numbers of healthy and pathological regions. Finally, to deal with the challenge of low contrast images, the authors worked on two types of reconstructed images—with soft and sharp filters— and evaluated the performance for training on soft and sharp reconstructions at both training and test time. . Figures 4.8 and 4.9 show the features of the two CNNs that are cascaded for the detection of calcifications. The two CNNs are trained on a large dataset (>1700 CT scans) for which calcifications were all manually delineated. The role of the first CNN (CNN1) is to propose candidates to the second CNN (CNN2). CNN1 is thus expected to be highly sensitive, leaving CNN2 with the role of rejecting false positives. For CNN1, the authors designed a fully convolutional network that captures as much spatial information as possible, even for deep layers. The network used dilated convolutions (see Technical Note, below) which guarantee large receptive fields, and dilation factors were gradually increased from 1 to 32, as illustrated in . Fig. 4.8. Another feature of the architecture is to balance several terms in the loss. The first loss term is defined using the output of the entire network

. Fig. 4.8 Detection of calcifications from CT [8]—CNN1. The role of this network is to propose candidates that will be refined in a next step

74

O. Bernard et al.

4

. Fig. 4.9 Detection of calcifications from CT [8]—CNN2. This network takes image patches around regions proposed by CNN1 (see . Fig. 4.8) and classifies them as pathological or not. Labelling of the lesions was performed by the previous network

for the voxel at the intersection of the three orthogonal slices only. The other loss terms consider in turn the entirety of each of the 3 slices, measuring the overlap between the probabilities outputted by the network with the manual contouring of the experts. The different loss terms are balanced by weights, giving the same importance to the first term (central voxel) as to all slice contributions. Technical Note Dilated convolutions, also known as atrous convolutions, involve dilating, or spreading out, the pixels that will be processed in the convolution. Note that this is not the same as increasing the size of the convolution kernel, e.g. a convolution may still be 3 × 3, but the pixels involved will be dilated, leaving gaps of unprocessed pixels between them. This leads to an increased receptive field (i.e. spatial extent of processed pixels) for the convolution.

CNN2 has a simpler architecture than CNN1 (see . Fig. 4.9). As pointed out earlier, its role is to distinguish between true calcifications and false positive voxels, based on local differences in the appearance of intensity profiles. Unlike CNN1, CNN2 thus used non-dilated convolutions with max pooling (see Technical Note, below). A simple cross-entropy loss was used for optimizing CNN2. Also, the end

Measurement and Quantification

75

4

part of CNN2 used fully connected layers and provided a binary output, because the purpose was false positive reduction and not classification into multiple classes. Technical Note Max pooling is a downsampling operation commonly used in encoder CNNs. It involves taking the maximum value of a patch of a feature map. E.g. typically a 2 × 2 max pooling operation can be applied, resulting in a downsampling factor of 4. Average pooling is a similar operation but using the average rather than the maximum value. Another term of interest here is stride, which refers to the steps taken (in terms of number of pixels in the image) between convolution operations. For example, a convolution with stride 2 means that the convolution kernel moves by 2 pixels each time (in each direction) when computing the feature maps.

4.6

Quantification of Coronary Occlusion From SPECT

Betancur et al. [46] addressed the problem of predicting an occlusion of a coronary artery from perfusion SPECT images. They compared the performance of a deep learning algorithm to standard total perfusion deficit as measured per coronary territory or as a whole per patient. The deep learning network was trained on a large dataset (>1,500) over 9 international hospitals. All patients underwent myocardial perfusion imaging and invasive coronary angiography within a 180 days interval, on which coronary occlusions were manually identified and located. The input of the deep learning algorithms here was not the SPECT images in a Cartesian system of coordinates but a polar resampling (in the form of a Bull’s eye). The authors compared the performance of different Bull’s eye maps as inputs. First, they took the raw result of SPECT quantification software. They then augmented this input by adding channels showing where the raw input deviated from a healthy distribution (perfusion deficit). The network was trained by minimizing the average error between the predicted per-vessel probabilities and the disease location, as identified on the invasive coronary angiograms. The architecture of the network is illustrated in . Fig. 4.10. The role of the fully connected part is to encode features that take into account spatial patterns present in the perfusion maps. This is especially interesting in the context of perfusion territories as the standard segments divide the LV using geometrical criteria that only approximate the anatomy. Finding a correspondence between the complex coronary territories and the actual coronary damage could therefore outperform the simplification of the standard AHA-based partition, if standard perfusion territories are learned by the network. Another potential interest of deep learning-based quantification in this context is to learn pathological patterns without making assumptions on the healthy distribution. In the SPECT quantification context, as is quite common in clinical practice, average and standard deviation are used to characterize normal values and detect perfusion deficits, per vessel or over all vessels per patient. In their results, the authors showed that the deep learning classifier had a higher area under the ROC curve (see . Chap. 2,

76

O. Bernard et al.

4 . Fig. 4.10 Architecture proposed in [46] for the quantification of calcium score from polar maps extracted from SPECT images

footnote 7) compared to the standard perfusion deficit assessment. The deep learning network performed better when adding both the perfusion deficit maps and a thresholded version of that map as additional input channels. However, this could bias the result of the comparison as deep learning makes direct use of the maps used for the standardized diagnosis it is then compared to. The authors also looked at the gain in sensitivity when the specificity is set to the same performance as the standard method. When looking at this sensitivity gain per vessel, they obtained a significant improvement for coronary artery (LAD) territory only.

4.7

Leveraging Clinical Reports as a Base of Annotations

As a last example, we briefly describe here the use of medical reports to automatically annotate data with pathological labels. In practice, clinical data often are partially annotated and the link between the original data and the annotation is not necessarily consistent. This does not fit with the need for standardized data and annotations of deep learning algorithms, which in many applications necessitates re-annotation of all data. To cope with disconnected data and annotations, Moradi et al. [47] have proposed a deep learning-based architecture to embed images and clinical reports into low-dimensional spaces. Clinical reports are encoded using a Doc2Vec [48] approach, converting short sequences of consecutive words and paragraphs to embedded vectors. The training task for the report embedding consisted of predicting the next word from the paragraph and the previous word’s embedding. Image embedding, on the other hand, was implemented through a pretrained VGG network.3 If, for a small portion of the data, reports and images are paired, the authors proposed to train a transform network to predict the paragraph vector of the report corresponding to an input image. After training the transform network, for any new given input image, one can look for its encoding in the lowdimensional space. Then, using the transform network, this image can be mapped to paragraphs in the whole training database of clinical reports (see . Fig. 4.11). 3

VGG is a widely used CNN architecture for classification problems [49].

Measurement and Quantification

77

4

. Fig. 4.11 Retrieving similar reports from an input image. This process involves: (1) Using the image embedding to project the image into the low-dimensional space, (2) transforming these projected coordinates using the transform network, (3) retrieving the nearest neighbors among the embedding of all reports from the full training database

From the predicted paragraphs, the pathology and its severity can be inferred. Although not strictly a quantification scenario, this concept is interesting for preannotation (in an unsupervised way) of large datasets from clinical reports. Training networks to reproduce these automatically generated annotations can provide a pretrained network that can be taken as initialization for quantification tasks on more controlled data. Having pre-trained networks is indeed an excellent strategy to improve the convergence of training deep learning networks compared to a trivial random initialization. We return to the subject of using machine learning to analyse and generate clinical reports in . Chap. 10 (Electronic Health Records). 4.8

Closing Remarks

We have considered the use of deep learning to automate the most common quantification tasks in cardiac imaging. As we saw in the Clinical Introduction, these tasks often form an integral part of clinical workflows but are labour-intensive and prone to intra- and inter-observer variability. Much progress has been made in using deep learning to automate these tasks and they are now starting to alter clinical workflows. A common theme is the need for thorough clinical validation, in order to demonstrate that these new techniques can be robust to the variation in data that is commonly encountered in the real world. In the following tutorial you will learn how to train and use a deep learning model for a segmentation task using CMR images. 4.9

Exercises

Exercise 1 A range of different deep learning architectures have been employed for quantification and measurement from cardiac imaging, and CNN-based models have tended to dominate. Why do you think this is?

78

O. Bernard et al.

Exercise 2 When developing a machine learning method for cardiac quantification, what are the benefits of validating it on a publicly available database? Can you think of any potential drawbacks?

4

Exercise 3 For what type of cardiac measurement/quantification problem might a recurrent neural network (RNN) be considered?

Exercise 4 The U-net CNN architecture has been one of the most widely used in cardiac imaging segmentation. Explain the nature and role of the ‘skip connections’ in the U-net.

Exercise 5 Cardiac functional biomarkers such as LV EF can, in principle, be estimated using deep learning directly from the images, i.e. without first segmenting the structures of interest. Speculate as to the advantages and disadvantages of this approach.

Exercise 6 Deriving measurements from echocardiography imaging is generally considered to be more challenging than the same task from CMR. Why do you think this is?

4.10

Tutorial—Cardiac MR Image Segmentation With Deep Learning

Tutorial 3 As for the other notebooks, the contents of this notebook are accessible as Electronic Supplementary Material. Overview In this hands-on tutorial, you will study in depth and apply the very popular deep learning U-Net architecture to perform segmentation of 2-D multi-slice CMR images.

Measurement and Quantification

79

4

This is a 30-year old problem which has taken an important step forward thanks to deep learning techniques. The material for this hands-on tutorial is based on a recently published study [13], and CMR images from the training set of the open access ACDC dataset [35]. The tutorial has been split into two parts, to better focus on the evaluation of a pre-trained model (first part) and then on the actual training (second part). Both parts can be run independently from each other. The figure below gives an overview of the U-net architecture for the segmentation of CMR images performed in this two-part hands-on tutorial:

Objectives • Consolidate the knowledge you’ve gained on quantification techniques from this chapter. • Use a pre-trained model based on the classical U-Net architecture to segment CMR images and assess its performance (part 1 of this hands-on tutorial). • Train your own U-Net (part 2 of this hands-on tutorial). Computing Requirements As for the other hands-on tutorials, this notebook starts with a brief “System setting” section, which imports the necessary packages, installs the potentially missing ones, and imports our own modules. Specific to this tutorial, you will need the tensorflow library, tailored for numerical computing with deep learning models.

4.11

Opinion

AI has the potential to play an important role in automating biomarker estimation in cardiology. Its role can improve the total process from acquisition of medical images to analysis and diagnosis of diseases.

80

4

O. Bernard et al.

In image acquisition, AI-based image reconstruction technology can bolster the ability to reconstruct images from the acquired signals, allowing a reduction of scanning times or higher image resolution. In CT imaging, AI-powered reconstruction algorithms can reduce radiation doses, which is still a major factor restricting serial CT scanning, in particular in children with congenital heart diseases. Image classification algorithms can be used for automated plane detection in echocardiography and CMR, allowing standardization of the acquired images that will improve comparability of biomarkers between patients, as well as within a patient over time. This standardization can significantly increase the ability to detect diseases early or follow the impact of treatments on cardiac function over serial acquisitions. All major vendors of cardiac imaging technology are currently rolling out AI-enabled reconstruction methods (Siemens: DeepResolve, Philips: Precise Suite, GE: AIRTM ReconDL, etc.), while real-time image plane detection during echocardiography exams is close to clinical translation [50]. In image analysis, the use of AI-based, automated image processing frameworks is likely to speed up and standardize biomarker extraction from cardiac imaging and ECG traces. Providing robust quality controls are in place, AI-based analysis could eliminate the inter- and intra-observer variability of measurements. For CMR, all major suppliers of commercial CMR analysis packages have currently implemented AI-based (semi-)automated segmentation methods. In addition to automating conventional image analysis, the use of AI can also provide access to biomarkers that are currently not routinely used. Image analysis is a laborious task and, due to time constraints, a limited set of biomarkers is typically obtained in a clinical setting. Biomarkers such as 3D morphology [51] and dynamic changes of the heart over the cardiac cycle can be sensitive measures of change in myocardial function [52]. With AI-based automated analysis, these measurements could become readily available in clinical practice. In addition, subtle differences in image characteristics, such as grey scale intensity patterns on cine CMR or CT, could potentially reveal important, but not readily observed information about myocardial structure. AI-based pattern recognition techniques can be used explore such radiomics-style biomarkers [53]. Altogether, AI-based analysis can provide a wide range of accurate, reproducible measurements of biomarkers from various imaging sources [27]. This development will significantly improve the ability of clinicians to detect diseases earlier and allow tracking of the adaptions in cardiac function after initiating treatments. Lastly, AI can significantly improve how cardiologists assess cardiac health. The heart is a complex pump. Its function is influenced by haemodynamics, the function of heart valves and the peripheral vasculature. Most used biomarkers are influenced by more than one such factor, while none represent all of these factors at once. As a result, it remains challenging to provide a good assessment of the impact of dysfunction of individual parts of the system on overall cardiac health. AI clustering and prediction algorithms, as well as methods based upon representation disentanglement [54], could potentially be very useful in modelling the complex relationships in the heart using biomarkers obtained from automated imaging and non-imaging sources. In combination with AI-enabled high quality imaging and biomarker extraction, these algorithms can provide a powerful method for better prediction

Measurement and Quantification

81

4

of diagnosis, prognosis and treatment effects. Ultimately, this will empower cardiologists to make better decisions about treatments for individual patients, helping to improve survival and quality of life for those suffering from heart diseases. Acknowledgements BR was supported by the NIHR Cardiovascular MedTech

Co-operative award to the Guy’s and St Thomas’ NHS Foundation Trust and Wellcome/EPSRC Centre for Medical Engineering at Kings College London (WT 203148/Z/16/Z).

References 1. Litjens G, Ciompi F, Wolterink JM, de Vos BD, Leiner T, Teuwen J, Iš gum I. State-of-the-art deep learning in cardiovascular image analysis. JACC: Cardiovasc Imaging. 2019;12(8), Part 1:1549−65. 2. Sherstinsky A. Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Physica D: Nonlinear Phenomena. 2020; 404:132306. 3. Taheri Dezaki F, Liao Z, Luong C, Girgis H, Dhungel N, Abdi AH, Behnami D, Gin K, Rohling R, Abolmaesumi P, Tsang T. Cardiac phase detection in echocardiograms with densely gated recurrent neural networks and global extrema loss. IEEE Trans Med Imaging. 2019;38(8):1821−32. 4. Choi E, Schuetz A, Stewart WF, Sun J. Using recurrent neural network models for early detection of heart failure onset. J Am Med Inf Assoc. 2016;24(2):361−70. 5. Wu D, Wang X, Bai J, Xu X, Ouyang B, Li Y, Zhang H, Song Q, Cao K, Yin Y. Automated anatomical labeling of coronary arteries via bidirectional tree LSTMs. Int J Comput Assist Radiol Surg. 2019;14(2):271−80. 6. Østvik A, Smistad E, Aase SA, Haugen BO, Lovstakken L. Real-time standard view classification in transthoracic echocardiography using convolutional neural networks. Ultrasound Med Biol. 2019;45(2):374−84. 7. Abdi AH, Luong C, Tsang T, Allan G, Nouranian S, Jue J, Hawley D, Fleming S, Gin K, Swift J, Rohling R, Abolmaesumi P. Automatic quality assessment of echocardiograms using convolutional neural networks: feasibility on the apical four-chamber view. IEEE Trans Med Imaging. 2017;36(6):1221−30. 8. Lessmann N, van Ginneken B, Zreik M, de Jong PA, de Vos BD, Viergever MA, Iš gum I. Automatic calcium scoring in low-dose chest CT using deep neural networks with dilated convolutions. IEEE Trans Med Imaging. 2017;37(2):615−25. 9. Trullo R, Petitjean C, Nie D, Shen D, Ruan S. Joint segmentation of multiple thoracic organs in CT images with two collaborative deep architectures. In: Deep learning in medical image analysis and multimodal learning for clinical decision support. 2017. p. 21−9. 10. Bank D, Koenigstein N, Giryes R. Autoencoders. arXiv. 2020. 11. Ronneberger O, Fischer P, Brox T. U-net: convolutional networks for biomedical image segmentation. In: Navab N, Hornegger J, Wells WM, Frangi AF, editors. Medical image computing and computer-assisted intervention - MICCAI. Springer. 2015; p. 234−41. 12. Leclerc S, Smistad E, Pedrosa J, Østvik A, Cervenansky F, Espinosa F, Espeland T, Berg EAR, Jodoin PM, Grenier T, Lartizien C, D’hooge J, Lovstakken L, Bernard O. Deep learning for segmentation using an open large-scale dataset in 2D echocardiography. IEEE Trans Med Imaging. 2019;38(9):2198−210. 13. Bernard O, Lalande A, Zotti C, Cervenansky F, Yang X, Heng P, Cetin I, Lekadir K, Camara O, Ballester MAG, Sanroma G, Napel S, Petersen SE, Tziritas G, Grinias E, Khened M, Varghese A, Krishnamurthi G, Rohé M, Pennec X, Sermesant M, Isensee F, Jaeger P, MaierHein KH, Full PM, Wolf I, Engelhardt S, Baumgartner CF, Koch LM, Wolterink JM, Isgum I, Jang Y, Hong Y, Patravali J, Jain S, Humbert O, Jodoin P. Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: is the problem solved? IEEE Trans Med Imaging. 2018;37(11):2514−25.

82

4

O. Bernard et al.

14. Bai W, Sinclair M, Tarroni G, Oktay O, Rajchl M, Vaillant G, Lee A, Aung N, Lukaschuk E, Sanghvi M, et al. Automated cardiovascular magnetic resonance image analysis with fully convolutional networks. J Cardiovasc Magn Reson. 2018;20(1):65. 15. Zreik M, Lessmann N, van Hamersvelt RW, Wolterink JM, Voskuil M, Viergever MA, Leiner T, Iš gum I. Deep learning analysis of the myocardium in coronary CT angiography for identification of patients with functionally significant coronary artery stenosis. Med Image Anal. 2018;44:72−85. 16. Wolterink JM, Leiner T, Viergever MA, Iš gum I. Generative adversarial networks for noise reduction in low-dose CT. IEEE Trans Med Imaging. 2017;36(12):2536−45. 17. Zhang Z, Yang L, Zheng Y. Translating and segmenting multimodal medical volumes with cycle- and shape-consistency generative adversarial network. In: IEEE/CVF conference on computer vision and pattern recognition. 2018; p. 9242−51. 18. Spier N, Nekolla S, Rupprecht C, Mustafa M, Navab N, Baust M. Classification of polar maps from cardiac perfusion imaging with graph-convolutional neural networks. Sci Rep. 2019;9(1):1−8. 19. Lu P, Bai W, Rueckert D, Noble JA. Multiscale graph convolutional networks for cardiac motion analysis. In: Ennis DB, Perotti LE, Wang VY, editors. Functional imaging and modeling of the heart. Cham: Springer; 2021. p. 264−72. 20. Vigneault DM, Xie W, Ho CY, Bluemke DA, Noble JA. Omega-Net: fully automatic, multiview cardiac MR detection, orientation, and segmentation with deep neural networks. Med Image Anal. 2018;48:95−106. 21. Xiong Z, Fedorov VV, Fu X, Cheng E, Macleod R, Zhao J. Fully automatic left atrium segmentation from late gadolinium enhanced magnetic resonance imaging using a dual fully convolutional neural network. IEEE Trans Med Imaging. 2019;38(2):515−24. 22. Duan J, Bello G, Schlemper J, Bai W, Dawes TJW, Biffi C, de Marvao A, Doumoud G, O’Regan DP, Rueckert D. Automatic 3D bi-ventricular segmentation of cardiac images by a shape-refined multi-task deep learning approach. IEEE Trans Med Imaging. 2019;38(9):2151− 64. 23. Smistad E, Østvik A, Salte IM, Melichova D, Nguyen TM, Haugaa K, Brunvand H, Edvardsen T, Leclerc S, Bernard O, Grenne B, Løvstakken L. Real-time automatic ejection fraction and foreshortening detection using deep learning. IEEE Trans Ultrason Ferroelectr Freq Control. 2020;67(12):2595−604. 24. Østvik A, Salte IM, Smistad E, Nguyen TM, Melichova D, Brunvand H, Haugaa K, Edvardsen T, Grenne B, Lovstakken L. Myocardial function imaging in echocardiography using deep learning. IEEE Trans Med Imaging. 2021;1. 25. Oktay O, Ferrante E, Kamnitsas K, Heinrich M, Bai W, Caballero J, Cook SA, de Marvao A, Dawes T, O’Regan DP, Kainz B, Glocker B, Rueckert D. Anatomically constrained neural networks (ACNNs): application to cardiac image enhancement and segmentation. IEEE Trans Med Imaging. 2018;37(2):384−95. 26. Simantiris G, Tziritas G. Cardiac MRI segmentation with a dilated CNN incorporating domain-specific constraints. IEEE J Sel Top Signal Process. 2020;14(6):1235−43. 27. Ruijsink B, Puyol-Antón E, Oksuz I, Sinclair M, Bai W, Schnabel JA, Razavi R, King AP. Fully automated, quality-controlled cardiac analysis from CMR: validation and large-scale application to characterize cardiac function. JACC: Cardiovasc Imaging. 2020;13(3):684−95. 28. Leclerc S, Smistad E, Østvik A, Cervenansky F, Espinosa F, Espeland T, Rye Berg EA, Belhamissi M, Israilov S, Grenier T, Lartizien C, Jodoin PM, Lovstakken L, Bernard O. LU-Net: a multistage attention network to improve the robustness of segmentation of left ventricular structures in 2-D echocardiography. IEEE Trans Ultrason Ferroelectr Freq Control. 2020;67(12):2519−30. 29. Ye C, Wang W, Zhang S, Wang K. Multi-depth fusion network for whole-heart CT image segmentation. IEEE Access. 2019;7:23421−9. 30. Chen C, Qin C, Qiu H, Tarroni G, Duan J, Bai W, Rueckert D. Deep learning for cardiac image segmentation: a review. Front Cardiovasc Med. 2020;7:25. 31. Sunnybrook challenge website. 7 https://www.cardiacatlas.org/studies/sunnybrook-cardiacdata/.

Measurement and Quantification

83

4

32. STACOM challenge 2011 website. 7 https://www.cardiacatlas.org/challenges/lvsegmentation-challenge/. 33. MICCAI RV challenge website. 7 https://rvsc.projets.litislab.fr/. 34. Kaggle challenge website. 7 https://www.kaggle.com/c/second-annual-data-science-bowl. 35. ACDC challenge website. 7 https://www.creatis.insa-lyon.fr/Challenge/acdc/. 36. M&Ms challenge website. 7 https://www.ub.edu/mnms/. 37. Bernard O, Bosch JG, Heyde B, Alessandrini M, Barbosa D, Camarasu-Pop S, Cervenansky F, Valette S, Mirea O, Bernier M, Jodoin P-M, Domingos JS, Stebbing RV, Keraudren K, Oktay O, Caballero J, Shi W, Rueckert D, Milletari F, Ahmadi S-A, Smistad E, Lindseth F, van Stralen M, Wang C, Smedby O, Donal E, Monaghan M, Papachristidis A, Geleijnse ML, Galli E, D’hooge J. Standardized evaluation system for left ventricular segmentation algorithms in 3D echocardiography. IEEE Trans Med Imaging. 2016;35(4):967−77. 38. EchoNet website. 7 https://echonet.github.io/dynamic/. 39. Isensee F, Jaeger PF, Full PM, Wolf I, Engelhardt S, Maier-Hein KH. Automatic cardiac disease assessment on cine-MRI via time-series segmentation and domain specific features. In: Statistical atlases and computational models of the heart. 2017. p. 120−9. 40. Isensee F, Jaeger PF, Kohl SAA, Petersen J, Maier-Hein KH. nnU-Net: a selfconfiguring method for deep learning-based biomedical image segmentation. Nat Methods. 2021;18(2):203−11. 41. Antonelli M, Reinke A, Bakas S, Farahani K, Kopp-Schneider A, Landman BA, Litjens G, Menze B, Ronneberger O, Summers RM, van Ginneken B, Bilello M, Bilic P, Christ PF, Do RKG, Gollub MJ, Heckers SH, Huisman H, Jarnagin WR, McHugo MK, Napel S, Pernicka JSG, Rhode K, Tobon-Gomez C, Vorontsov E, Huisman H, Meakin JA, Ourselin S, Wiesenfarth M, Arbelaez P, Bae B, Chen S, Daza L, Feng J, He B, Isensee F, Ji Y, Jia F, Kim N, Kim I, Merhof D, Pai A, Park B, Perslev M, Rezaiifar R, Rippel O, Sarasua I, Shen W, Son J, Wachinger C, Wang L, Wang Y, Xia Y, Xu D, Xu Z, Zheng Y, Simpson AL, Maier-Hein L, Cardoso MJ. The medical segmentation decathlon. Nat Commun. 2022; 13:4128. 42. Bogaert J, Dymarkowski S, Taylor A, Muthurangu V. Cardiac function. In: Clinical cardiac MRI. Springer. 2012. p. 109−68. 43. Wang S, Patel H, Miller T, Ameyaw K, Narang A, Chauhan D, Anand S, Anyanwu E, Besser SA, Kawaji K, Liu XP, Lang RM, Mor-Avi V, Patel AR. AI based CMR assessment of biventricular function: clinical significance of intervendor variability and measurement errors. JACC: Cardiovasc Imaging. 2021. 44. Mariscal Harana J, Vergani V, Asher C, Razavi R, King A, Ruijsink B, Puyol Anton E. Large-scale, multi-vendor, multi-protocol, quality-controlled analysis of clinical cine CMR using artificial intelligence. Eur Heart J Cardiovasc Imaging. 2021;22(Supplement_2). 45. Wei H, Cao H, Cao Y, Zhou Y, Xue W, Ni D, Li S. Temporal-consistent segmentation of echocardiography with co-learning from appearance and shape. In: International conference on medical image computing and computer-assisted intervention. Springer; 2020. p. 623−32. 46. Betancur J, Commandeur F, Motlagh M, Sharir T, Einstein AJ, Bokhari S, Fish MB, Ruddy TD, Kaufmann P, Sinusas AJ, Miller EJ, Bateman TM, Dorbala S, Di Carli M, Germano G, Otaki Y, Tamarappoo BK, Dey D, Berman DS, Slomka PJ. Deep learning for prediction of obstructive disease from fast myocardial perfusion SPECT: a multicenter study. JACC: Cardiovasc Imaging. 2018;11(11):1654−63. 47. Moradi M, Guo Y, Gur Y, Negahdar M, Syeda-Mahmood T. A cross-modality neural network transform for semi-automatic medical image annotation. In: Ourselin S, Joskowicz L, Sabuncu MR, Unal G, Wells W, editors. Medical image computing and computer-assisted intervention - MICCAI 2016. Cham: Springer; 2016. p. 300−7. 48. Le Q, Mikolov T. Distributed representations of sentences and documents. In: International conference on machine learning. PMLR; 2014. p. 1188−96. 49. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. 2015. 50. Madani A, Arnaout R, Mofrad M, Arnaout R. Fast and accurate view classification of echocardiograms using deep learning. npj Digit Med. 2018;1:6.

84

4

O. Bernard et al.

51. Bruse JL, Ntsinjana H, Capelli C, Biglino G, McLeod K, Sermesant M, Pennec X, Hsia T-Y, Schievano S, Taylor A. CMR-based 3D statistical shape modelling reveals left ventricular morphological differences between healthy controls and arterial switch operation survivors. J Cardiovasc Magn Reson. 2016;18. 52. Chamsi-Pasha MA, Zhan Y, Debs D, Shah DJ. CMR in the evaluation of diastolic dysfunction and phenotyping of HFpEF: current role and future perspectives. JACC: Cardiovasc Imaging. 2020;13(1), Part 2:283−96, special Issue: Noninvasive Assessment of Left Ventricular Diastolic Function. 53. Juarez-Orozco L, Yeung M, Knol R, Benjamins J, Ruijsink B, Martinez-Manzanera O, Knuuti J, Asselbergs F, Van Der Zant F, Van Der Harst P. Predicting cardiovascular risk traits from PET myocardial perfusion imaging with deep learning. Eur Heart J. 2020;41(S2). 54. Chartsias A, Joyce T, Papanastasiou G, Semple S, Williams M, Newby DE, Dharmakumar R, Tsaftaris SA. Disentangled representation learning in cardiac image analysis. Med Image Anal. 2019;58:101535.

85

5

Diagnosis Daniel Rueckert, Moritz Knolle, Nicolas Duchateau, Reza Razavi and Georgios Kaissis Contents 5.1

Clinical Introduction – 86

5.2

Overview – 87

5.3

Classical Machine Learning Pipeline for Diagnosis – 87

5.4

Deep Learning Approaches for Diagnosis – 90

5.5

Machine Learning Applications for Diagnosis – 93

5.6

Machine Learning Approaches Based on Radiomics – 94

5.7

Machine Learning Approaches for Large-Scale Population Studies – 95

5.8

Challenges – 95

5.9

Closing Remarks – 96

5.10

Exercises – 97

5.11

Tutorial—Two-Class and Multi-class Diagnosis – 98

5.12

Opinion – 99 References – 99

Supplementary Information The online version contains supplementary material available at 7 https://doi.org/10.1007/978-3-031-05071-8_5. Authors’ contribution: • Introduction, Opinion: RR. • Main chapter: DR, MK, GK. • Tutorial: ND. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Duchateau and A. P. King (eds.), AI and Big Data in Cardiology, https://doi.org/10.1007/978-3-031-05071-8_5

86

D. Rueckert et al.

n Learning Objectives: At the end of this chapter you should be able to: O5.A Explain the classical machine learning pipeline for medical diagnosis problems O5.B Describe the key characteristics of commonly used classical machine learning models in diagnosis, such as SVMs and decision trees/forests O5.C List the types of deep learning architecture that can be applicable to diagnosis problems O5.D Describe some specific applications for machine learning-based diagnosis in cardiology O5.E Explain the key challenges involved in the use of machine learning in cardiac diagnosis

5 5.1

Clinical Introduction

The use of AI in medicine is gaining traction, with many examples moving from research into clinical prototypes and products. Examples include medical record mining [1], predictive clinical decision support systems [2], and, its widest application, the interpretation of medical imaging to help with improving both diagnosis and prognosis of disease [3, 4]. Because of the increasing wealth of digital data that is generated, clinicians need to be able to find more efficient ways of meaningfully combining these data to deliver precision-based medicine. AI can not only enable routine tasks to be performed more efficiently but also provide new insights into disease processes that were previously not achievable by manual review and analysis due to time and labour constraints [5]. Diagnosis and treatment planning of cardiovascular disease is now increasingly reliant on imaging methods such as echocardiography [6], CT [7] and CMR [8]. These generate large amounts of data and yet clinical decision making can often come down to a small number of derived parameters such as the LV EF that use a limited amount of the available acquired imaging information. The application of AI methods to better utilize the available imaging data, overcome challenges with less-than-optimal reproducibility of some of the key biomarkers and reduce the manual workload and time taken to analyse the data is looking promising. AI methods are now being integrated into many clinical products particularly in image analysis [6−8] but also in image acquisition [7, 8]. Other diagnostic methods such as retinal scanning are also amenable to AI methods. Researchers are now looking to combine the power of AI with the non-invasive ease of retinal scanning to examine the workings of the heart and predict changes in the macrovasculature based on microvascular features and function [9]. In addition to addressing the variability associated with subjective image interpretation, AI can address the spatial and temporal pathologic heterogeneity of cardiovascular clinical phenotypes by allowing more detailed feature extraction around regions of interest [10]. This allows clinicians to use additional quantifiable features that relate more objectively and in more detail to the underlying clinical

Diagnosis

87

5

condition [11]. By extracting a multitude of information generated from images and non-imaging data, AI methods also provide the essential link to uncovering associations between clusters of patients in a fully automated manner [12]. Examples of the use of AI clustering in patients with heart failure have shown that it is possible to identify patient groups with different outcomes, with for example median 21-month survival of 26% versus 63% in patients with heart failure with preserved EF [13], and even different responses to treatment in a larger heart failure cohort [14]. These capabilities will not replace but rather augment the clinical decision process in a more efficient, user-friendly way, that should translate into improved patient care. Recent applications of AI in medical imaging provide proof of concept for its utility and on the whole high performance, with an accuracy paralleling that of human expertise [15, 16].

5.2

Overview

Over the last decade, AI and machine learning techniques have made significant advances. In particular, as we have seen in . Chap. 3 and elsewhere in the book, deep learning [17] has emerged as a powerful framework for solving perceptual tasks across many different application domains, including medicine [18]. Often, deep learning can achieve a level of performance that is comparable to humans (and in some cases even outperforming them) [19, 20]. In this chapter, we will first introduce some machine learning approaches that have been proposed for use in the context of automated medical diagnosis. We begin with classical machine learning approaches before reviewing deep learning approaches. In the subsequent sections we will review their application to diagnosis problems in cardiology as well as discuss challenges for clinical deployment.

5.3

Classical Machine Learning Pipeline for Diagnosis

Traditionally, the process of building a system for diagnosis in medicine consists of two stages (see . Fig. 5.1). In the first stage, information is extracted from the data (e.g. images, signals or clinical data) and in the second stage this information is used to build a statistical model that can perform classification. In machine learning, the information extracted from the data and that is used as input to the statistical model is typically referred to as the features. In the context of clinical decision making, these features are often referred to as biomarkers which serve as measurable indicators of the biological state or condition of the patient. For example, left ventricular myocardial mass or LV EF may be important characteristics when building a diagnosis system for cardiovascular diseases. Additionally, clinical data such as laboratory results (creatine kinase, lactate dehydrogenase, troponine, etc.) or results from other examinations (stress echocardiography, ECG) can be included. In the following we will briefly review some of the most commonly

88

5

D. Rueckert et al.

. Fig. 5.1 Classical machine learning pipeline for diagnosis from cardiac data with the two distinct stages of feature extraction and model learning

. Fig. 5.2 Comparison of resulting decision boundaries for different supervised classifiers introduced in this chapter

used machine learning models for performing classification using such features. An overview of their performance can be found in . Fig. 5.2. Support Vector Machines: The support vector machine (SVM) [21] model, which we first mentioned in . Chap. 2, is a very popular algorithm for supervised learning that was first proposed in the early 2000s. It offers robustness and easy applicability to a wide range of problems, domains and types of data without the need for expert prior knowledge. SVMs, which can be used for classification and regression tasks, construct a maximum margin separator that defines a decision boundary with maximum distance to its support vectors. Specifically, SVMs construct a socalled soft decision boundary which is less sensitive to outliers in the data than other approaches. The decision boundary is learned given the training data and assigns classes to data points based on their position in feature space and with respect to the maximum margin separator. This approach incorporates ideas from statistical

Diagnosis

89

5

learning theory [22] to address a common practical problem, namely that for a given dataset (of limited size), there often exist many solutions that split the training data perfectly. For non-linearly separable data, a so-called kernel function (see also the Technical Note in . Chap. 3, Sect. 3.4) can be used to transform data points into a higher-dimensional feature space where they become linearly separable (this is often referred to as the kernel trick). SVMs represent a non-parametric classification method, meaning that no explicit parameters are learned to define (parametrize) the decision boundary. Instead, a set of data points (the support vectors) is used to construct the separating hyperplane in a way that maximizes the distance between the support vectors of the two classes. Of note, the original mathematical formulation for the SVM is only defined for the binary case, however this can be easily extended to the multi-class case by performing one-against-rest classification with multiple binary SVMs [23], albeit at a much increased computational cost. For instance, a binary kernel-trick SVM has a worst-case time complexity of O(n3 × m) (see Technical Note, below), where n is the number of training examples and m the number of features. This difficulty of scaling SVMs to large datasets and multi-class prediction as well as the fact that deep neural networks have been shown to outperform SVMs in most applications has led to a drop in their popularity in the more recent machine learning literature. Despite this, SVMs can still be an attractive option for certain use cases (online learning, outlier detection etc.), especially when only a small or intermediate-sized training dataset is available. Furthermore, SVM approaches can also be adapted to regression problems. Technical Note The notation O(...) seen above is known as “Big-O notation”. It is commonly used to indicate the computational complexity of a task. For example, if n is the number of training samples, O(n) means that the algorithm takes a time that is proportional to n, O(n2 ) means that the time increases quadratically with n, etc.

Decision Trees and Forests: Decision trees and forests were also mentioned as types of machine learning model in . Chap. 2. A decision tree represents a fundamental data structure that can be used to make predictions. While decision trees are used commonly in machine learning, they are also used outside of machine learning, e.g. in operations research, and even in clinical guidelines, to help identify a strategy most likely to reach a goal. In the context of machine learning [24], a decision tree is a tree in which each of the internal (or non-leaf) nodes correspond to a split into sub-trees corresponding to an input feature. Each of the leaf nodes is labelled with a prediction or a probability distribution over multiple predictions. Depending on the type of predictions stored at the leaf nodes, one can differentiate between two types of decision tree. Decision trees where the predicted variable takes continuous values (typically real numbers) are called regression trees. Decisions trees where the predicted variable takes categorical values (typically class labels) are called classification trees.

90

5

D. Rueckert et al.

Each of the internal nodes corresponds to a split of the training data according to an input feature. Such a split can be thought of as a weak learner since a single split of the training data is unlikely to produce a very accurate prediction. By creating a set of splits that are hierarchically organized (in the form of a tree) a better prediction can be obtained. Hence, a key step in creating a good set of hierarchical splits is to determine how to split the training data at each node. These splits are typically determined in a top-down fashion by choosing an input feature at each step that “best” splits the training set. Different criteria such as the Gini impurity or information gain can be used to determine the optimal split [25], but in general these approaches aim to measure the homogeneity of the target prediction within the subsets after the split. While decision trees are a simple and elegant way of building predictors, the performance of decision trees is often limited in real world applications. One way to build stronger predictors is to combine multiple decision trees into so-called decision forests. Such decision forests belong to the class of so-called ensemble machine learning methods (see Technical Note, Sect. 4.4). In these ensemble methods one can differentiate between different approaches in constructing ensembles. In random forests, multiple decision trees are built using a technique called bagging where the training data are repeatedly resampled using replacement and the final prediction result is obtained by integrating the prediction across different trees using voting schemes. An alternative approach is based on a technique called boosting which builds an ensemble classifier by training each new instance to emphasize the training instances that were previously misclassified. The different classifiers are then combined in a weighted voting scheme such as AdaBoost [26].

5.4

Deep Learning Approaches for Diagnosis

Deep learning, as introduced in . Chap. 3, is based upon the concept of artificial neural networks and offers several advantages for visual information processing, including the ability to learn feature representations with multiple layers of abstraction as well as the ability for end-to-end learning (see . Fig. 5.3) [17]. Furthermore, deep learning approaches eliminate the need for hand-crafted features and classifiers that otherwise have to be tuned by experts to specific tasks. Instead, it enables end-to-end learning where both the features and the classifiers are directly learned from the available training data. This ensures that the features and classifiers are optimally suited for the task at hand, although this sometimes comes with the cost that the learnt features are not as meangingful (or interpretable) to end-users, i.e. clinicians. Supervised deep learning approaches often employ convolutional neural networks (CNNs) [27, 28] (see . Chap. 3, Sect. 3.5). As we have seen, CNNs consist of many layers that transform their input via convolutions with filters that are learned from the data, making them well suited for images. In contrast, supervised approaches applied to spatio-temporal data (e.g. audio or text) often use recurrent neural networks (RNNs, see . Chap. 4, Sect. 4.3) or long short-term memory (LSTM) networks [29].

91

Diagnosis

. Fig. 5.3

5

Deep learning pipeline with end-to-end trainable feature learning and model learning

In unsupervised deep learning approaches, neural networks based on autoencoders [30] or variational autoencoders [31] are frequently used to reduce the dimensionality of the data (see . Chap. 4, Encoder-decoder Networks). In fact autoencoders also often make use of convolutions in their neural network architecture. Such dimensionality reduction techniques can “simplify” the representation of the data in a way that renders it more conducive to processing by other algorithms, e.g. by subsequent supervised learning algorithms. However, the utilization of unsupervised architectures is not limited to pre-processing, as their output can also be used for diagnosis or scientific discovery. For instance, clustering (see . Chap. 2, Sect. 2.3) can clarify underlying subgroups in datasets, such as groups of patients with similar characteristics, which might, for example, exhibit a common response to a certain medication. Another approach to unsupervised learning is based on generative adversarial networks (GANs, see . Chap. 4, Sect. 4.3) [32] and its variations [33]. As we discussed in . Chap. 4, with GANs two neural networks (the generator and discriminator) compete with each other to generate new data with the same statistics as the training set. In the following sections, we review the most common architectures for deep learning applied to diagnosis problems in more detail.1 Encoder-decoder networks: Some tasks, such as semantic image segmentation, require dense (pixelwise) predictions. Applying CNN architectures to such a problem means that the associated computational complexity scales with the image size. Encoder-decoder networks (EDNs), such as the fully convolutional network [34], are a much more efficient approach to addressing dense prediction tasks. Such networks consist of two parts: the encoder, in which features are extracted and progressively downsampled, and the decoder, in which features (from the encoder) are progressively upsampled to produce an output at the end of the network with identical shape as the input. The rationale behind this technique is the progressive distillation of a large number of complex image features in the encoder and their 1

Editors’ note: There is some overlap in content between these descriptions and those provided in

. Chap. 4 but we choose to include both as we believe they act as complementary perspectives on these important concepts.

92

5

D. Rueckert et al.

recombination to form new features in the decoder. Label targets to train an EDN are often segmentation masks and hence during training, the network’s output is compared to the ground truth label map via overlap measures such as the commonly used Dice coefficient [35]. Parameter updates are then applied iteratively using a gradient-based optimization method such as stochastic gradient descent (see . Chap. 3, Two-Class Prediction). A fully convolutional network is such an EDN where only convolutional and pooling layers (see Technical Note, Sect. 4.5) are used to extract and process features. A commonly used EDN architecture for medical image segmentation is the U-net [36], where skip-connections from encoder to decoder were added to the fully convolutional network architecture, improving the convergence, performance and robustness of the original architecture. Generative adversarial networks (GANs): Generative adversarial networks (GANs) [32] are a relatively recent neural network architecture and training paradigm, whereby two competing neural networks (generator and discriminator) are trained simultaneously to produce a powerful generative model. More precisely, the generator, conditioned on random noise vectors, generates artificial data samples while the discriminator tries to determine whether they are fake or belong to the target population. While in theory, GANs are capable of approximating any datagenerating distribution given enough training data, the training process of GANs is often unstable and sensitive to the choice of hyperparameters. This can be caused by a multitude of reasons, but most of them relate to a disparity in the learning progress of the generator and the discriminator. As a result, countless variations and improvements of the original implementation have been proposed to tackle these problems (an overview and applications for medical imaging can be found in [37]). Conditional GANs (cGANs), such as the Pix2Pix GAN [38] are a particularly interesting GAN (re)-formulation, where the generator can be conditioned on additional input information (e.g. images). This more sophisticated sampling method allows them to transfer features from one image to the other. Conditional GANs can be used for so-called domain adaptation, whereby, for example, CT images can be transformed into virtual MR images. Another potential medical use lies in the generation of training data in scenarios where data are scarce [39] and/or there are associated privacy concerns, as GANs can be used to generate realistic looking, yet fake and thus private medical image data [40]. Additionally, GANs can be used for classification tasks by using part of either the generator or discriminator as a feature extractor, or alternatively by using the discriminator as a classifier. These GAN-based classification approaches have been shown to perform on-par with supervised neural network architectures, but require much less data while also potentially limiting the effect of domain overfitting [41]. Autoencoders: As mentioned earlier, an autoencoder [30] is a type of EDN that transforms (often high dimensional) input data into a lower dimensional latent vector representation in an unsupervised fashion. An autoencoder learns the optimal latent representation of the training data by attempting to reconstruct the original input data solely based on the encoded latent information. While autoencoders are an excellent dimensionality reduction technique, the resultant latent space rep-

Diagnosis

93

5

resentation is often incomplete and not optimally suited for generative sampling purposes. Variational autoencoders [31] explicitly learn to parametrize a Gaussian distribution from which sampling is performed. This more principled approach makes VAEs much better suited for generative tasks than conventional autoencoders. However, the images produced by VAEs usually look much less realistic than GAN-produced images, especially when high resolution images are desired. Bayesian deep learning: The Bayesian inference framework, based around Bayes’ theorem (the principled approach that prior assumptions influence posterior beliefs), offers the most complete approach for reasoning under uncertainty and is therefore a key component for building real-world systems in safety-critical applications such as self-driving cars or computer-aided diagnosis tools. Quantifying predictive confidence as well as uncertainty-based human expert referral [42] are thus crucial for establishing and promoting trust in an automated diagnosis system when deployed in a real-world clinical setting. In practice, traditional (point-estimate) neural networks are often over-confident about their predictions, thus highlighting the need for a sound approach to modelling uncertainty in deep learning. In Bayesian neural networks (BNNs) [43] each parameter is represented using a (posterior) probability distribution to model uncertainty. The computation of this distribution is, however, usually intractable due to the requirement to calculate high dimensional integrals for its precise specification. Two main techniques are used to avoid this intractable computation: variational inference and Markov Chain Monte Carlo (MCMC) [44]. In variational inference, the posterior distribution is approximated by a simpler (variational) distribution which is learned during training by optimising the distributional similarity between the variational distribution and the true posterior. On the other hand, MCMC methods take a sampling-based approach to computing the posterior by randomly drawing samples from areas of the posterior with high probability density. Each of the methods has its own benefits and drawbacks. In general, it can be stated that variational inference tends to be significantly faster, however it produces a biased estimate of the posterior distribution. MCMC is capable of exactly representing the posterior. However, it is both slower and computationally more expensive. In recent years, these methods have been complemented by newly-proposed approximate methods, empirically shown to enable reasonable uncertainty estimates without requiring the utilization of the above-mentioned inference techniques. For example, Monte Carlo dropout [45] utilizes a technique originally proposed for regularization and DeepEnsembles [46] leverages ensembles of neural networks to quantify predictive uncertainty. These two methods are performant and easy-to-implement approaches to making any neural network Bayesian.

5.5

Machine Learning Applications for Diagnosis

As described above, the computer-aided diagnosis of cardiovascular diseases plays an increasingly important role in clinical routines. A task that is commonly addressed using machine learning approaches is the classification of different

94

5

D. Rueckert et al.

cardiac pathologies. For example, the Automatic Cardiac Diagnosis Challenge (ACDC) [47], which primarily focuses on cardiac image segmentation, also proposes to diagnose different diseases with abnormal myocardial shape, including in addition to normal subjects, (1) patients with systolic heart failure with infarction, (2) patients with dilated cardiomyopathy, (3) patients with hypertrophic cardiomyopathy and (4) patients with abnormal RV. Many machine learning approaches use this dataset as a benchmark for cardiac disease classification [48−51]. The majority of recent approaches use deep learning for disease classification, using information about cardiac morphology as well as cardiac function. However, these approaches often do not allow for easy interpretation of the classification results. In [52], the authors tackle this problem by developing an interpretable deep learning model for disease classification using cardiac shape information. They exploit deep generative networks to model a population of anatomical shapes through a hierarchy of conditional latent variables. The approach has been shown to provide high classification accuracy as well as visualization of both global and regional anatomical features which discriminate between different pathologies. The interpretability of deep learning approaches is also the focus of the work in [53], in which a CNN model is used together with a VAE to learn a discriminative latent space for classification. Using the idea of ‘concept activation vectors’ [54], the latent space is then visualized in terms of diagnostically meaningful clinical parameters. In [55] the authors classify different cardiac pathologies by combining features derived from segmentations of the cardiac anatomy, their shapes and motion patterns. A similar approach is pursued in [56] which uses a multi-modal database of CMR and echocardiography images to learn cardiac motion patterns. During inference only motion from the echocardiography images is used to discriminate between normal subjects and patients with dilated cardiomyopathy. Other approaches that focus on the analysis of cardiac function from echocardiography images [57] have gained a lot of attention due to the wide availability of this modality. However, not all approaches focus on using image data as the primary source of information. For example, ECG data is a widely available source of important physiological information about cardiac abnormalities and analysis of ECG signals using machine learning approaches can provide powerful diagnostic tools [58].

5.6

Machine Learning Approaches Based on Radiomics

In the context of cardiovascular imaging so-called radiomics approaches also play an important role for diagnosis. Radiomics approaches aim to extract a large number of shape- or texture-based features from images that may then be used as predictor variables in statistical models for diagnosis. Radiomics has been successfully used in oncology [59] and more recently also in cardiology [50, 60]. The success of radiomics approaches depends heavily on the type of images used as the reproducibility of the extracted shape and texture information is critical for the success and reliability of these approaches. Furthermore, standardization of the imaging data is crucial when data from multiple hospitals or imaging centres is used.

Diagnosis

5.7

95

5

Machine Learning Approaches for Large-Scale Population Studies

Machine learning approaches also play an important role in discovering quantitative and clinically relevant phenotypes from population studies, which can in turn promote the discovery of novel diagnostic biomarkers. In [61], the authors used a deep learning pipeline for extracting 82 quantitative phenotypes of the heart and aorta from CMR from a large population study with over 25,000 participants from the UK Biobank [62]. They identified 2,617 significant associations between imaging phenotypes and non-imaging phenotypes of the participants describing relationships between risk factors and cardiovascular diseases. While the large-scale extraction of biomarkers and phenotypes from population studies is challenging, it is also important to perform quality control of the information extracted from such studies. For example, in CMR studies, the extraction of biomarkers may fail because of poor image quality or image artefacts (e.g. respiratory motion) or the image analysis pipeline may fail, and therefore affect the downstream task such as diagnosis. To address this problem, it is possible to use machine learning techniques to classify whether the image quality is sufficient for automated analysis [63] or whether the extracted parameters are likely to be correct [10, 64]. We return to the topic of automated quality control in cardiac image analysis in . Chap. 7.

5.8

Challenges

Despite the significant advances in the development of machine learning approaches in cardiology, there remain a number of challenges. One of the challenges is that deep learning approaches tend to require significant amounts of training data. In general, the more data are available for training, the more accurate and robust the resulting machine learning models become. The need for large datasets and high quality annotations makes data sharing even more important, not only for training but also for evaluating machine learning solutions in multi-institutional/multi-national trials. One solution to this challenge has been found in the availability of large datasets from prospective volunteer trials (such as the UK Biobank [65]) or from curated clinical databases such as PhysioNET [66]. However, in practice data sharing is often hampered by technical, legal and ethical challenges. In particular, legal and regulatory requirements represent difficulties for data sharing. An alternative to data sharing is the use of decentralized machine learning or federated learning approaches [67, 68]. In contrast to centralized approaches in which datasets are marshalled in one central location to train one machine learning model, federated learning uses collaborative training algorithms that do not require the exchange of the training datasets with a central instance. It has been shown that these federated learning approaches can achieve similar performance to conventional centralized approaches and outperform approaches that are only trained using data from one site.

96

5

D. Rueckert et al.

Another challenge for clinical adoption of machine learning-based approaches is the perceived black box nature of many of these approaches. This means that the output of a diagnosis by a machine learning model can be difficult for humans to understand and interpret. Recent guidelines of the European Union emphasize the importance of explainability and interpretability of AI-based approaches, especially if they affect humans directly. However, there is a lack of consensus as to precisely what explainability and interpretability mean in this context. Related to this challenge is also the fairness of decision-making algorithms. Fairness can be defined as the absence of any prejudice or bias toward an individual or a group based on a set of protected characteristics such as race, sex or age. It can be difficult to detect biases and unfairness in machine learning approaches that learn from data. The source of such problems is often (but not always) related to biases and/or imbalance in the data that are used to train the machine learning models. Identifying these biases is a first step to mitigating for the bias and developing “fair” machine learning approaches.

5.9

Closing Remarks

Whilst techniques for automated machine learning-based diagnosis in cardiac imaging are less mature than those for measurement and quantification, significant progress has been made in recent years, partly due to the availability of public databases for some cardiac diagnostic tasks. As in most applications, deep learning models are currently the best performing techniques in diagnosis in terms of classification accuracy, although classical machine learning models likely still have a role to play, especially in less complex tasks with a limited amount of annotated data. Deep learning models are seen as being less interpretable than some classical machine learning models, but researchers have taken note of the need for interpretability in diagnostic tools and have proposed methodological advances to address this issue. These interpretability techniques need further evaluation on real-world clinical data, and their role in clinical workflows needs to be carefully considered and their impact well validated. Of particular concern is the possibility for bias, for example based on the sex or race of the subject: a recent work [69] has shown the potential for racial bias in diagnosis of heart failure based on deep learning-derived LV EF measurements made from CMR imaging. Open and complete reporting of performance across such subgroups is therefore of paramount importance [70]. In the following tutorial you will gain practical experience of using deep learning for a simple diagnostic task.

Diagnosis

5.10

97

Exercises

Exercise 1 Explain the main difference between classical machine learning and deep learning approaches with regard to the features used for automated diagnosis.

Exercise 2 Explain what is meant by “end-to-end learning” in the context of deep learning-based diagnosis.

Exercise 3 A colleague argues that machine learning-based diagnosis will never be completely trusted by cardiologists. Therefore, we should consider their use as decision support tools rather than automated diagnosis tools. Do you agree? What implications would this have for the design of such tools?

Exercise 4 What role do you see Bayesian deep learning playing in automated diagnosis?

Exercise 5 A colleague argues that dealing with possible bias in machine learning-based diagnosis is less important than optimizing overall performance. Do you agree?

Exercise 6 A research group is working with cardiologists to develop a tool for automated diagnosis of some rare cardiovascular diseases from cine CMR imaging. Advise the group on what type(s) of machine learning approach might be applicable and what issues they should be aware of.

5

98

D. Rueckert et al.

Exercise 7 Supervised machine learning might seem the preferred approach for automated diagnosis, since the task is to predict a label (diagnosis) given the available data. What role could unsupervised machine learning have in automated diagnosis?

5.11

5

Tutorial—Two-Class and Multi-class Diagnosis

Tutorial 4 As for the other notebooks, the contents of this notebook are accessible as Electronic Supplementary Material. Overview In this hands-on tutorial, you will again use data from the ACDC open access dataset [71]. this time not for segmentation but for diagnosis, based on characteristics extracted from the image segmentations and additional patient characteristics. This corresponds to the task targeted in the second part of the paper reporting on the ACDC challenge [47]. You will use data from the 100 patients of the training set, which are equally distributed into 5 (ab)normal subgroups. The tutorial will guide you through the classification of these subjects, starting from two-class diagnosis (e.g. normal versus dilated hearts) and moving to the more complete multi-class diagnosis, as illustrated in the figure below. The tutorial lays stress on carefully examining the performance against the complexity of the machine learning model, keeping in mind the data used as input.

Objectives • Consolidate the knowledge you’ve gained on classification from the toy examples in the hands-on tutorial from . Chap. 3. • Conduct a proper classification problem on real-life data. • Get used to a wider variety of scikit-learn models and be critical about their output.

Diagnosis

99

5

Computing Requirements As for the other hands-on tutorials, this notebook starts with a brief “System setting” section, which imports the necessary packages, installs the potentially missing ones, and imports our own modules.

5.12

Opinion

There are ethical, regulatory and practical challenges that need to be addressed to ensure reliability, quality of care and safety before wide scale adoption into clinical prime time [72−74]. A further challenge is the difficulty in replicating the performance of often complex models built on local databases in other clinical settings. The use of federated learning techniques and infrastructure to build models with much wider and varied data sets across multiple clinical settings and geographies could be a good way of addressing this [67, 68] (see also . Chap. 10, Hospital and Patient Perspective). The infrastructure to easily deploy models into a clinical setting for different clinical service delivery organizations with different IT systems can also be a challenge that needs to be addressed. Finally, the health economic case will need to be made for individual applications, alongside their clinical utility, as pressure on healthcare budgets will otherwise make commercial success and wide procurement of diagnostic and decision support systems that use AI difficult. Nevertheless, the application of AI has the potential for reproducible clinical assessments through automated measurements, more efficient diagnostic support, improved phenotyping and better risk stratification through the mining of large datasets to uncover clinically relevant information [75]. Its application to care of patients with cardiovascular disease will be transformative and bring substantial benefit. Acknowledgements ND was supported by the French ANR (LABEX PRIMES of

Univ. Lyon [ANR-11-LABX-0063] within the program “Investissements d’Avenir” [ANR-11-IDEX-0007], and the JCJC project “MIC-MAC” [ANR-19-CE45-0005]).

References 1. Bendayan R, Mascio A, Stewart R, Roberts A, Dobson RJ. Cognitive trajectories in comorbid dementia with schizophrenia or bipolar disorder: The South London and Maudsley NHS foundation trust biomedical research centre (SLaM BRC) case register. Am J Geriatr Psychiatry. 2021;29(6):604−16. 2. Peiffer-Smadja N, Rawson T, Ahmad R, Buchard A, Georgiou P, Lescure F-X, Birgand G, Holmes A. Machine learning for clinical decision support in infectious diseases: A narrative review of current applications. Clin Microbiol Infect. 2020;26(5):584−95. 3. Asan O, Bayrak AE, Choudhury A. Artificial intelligence and human trust in healthcare: Focus on clinicians. J Med Internet Res. 2020;22(6): e15154. 4. Mintz Y, Brodie R. Introduction to artificial intelligence in medicine. Minim Invasive Ther Allied Technol. 2019;28(2):73−81.

100

5

D. Rueckert et al.

5. Colling R, Pitman H, Oien K, Rajpoot N, Macklin P, C.-P. A. in Histopathology Working Group, Snead D, Sackville T, Verrill C. Artificial intelligence in digital pathology: A roadmap to routine use in clinical practice. J Pathol. 2019;249(2):143−50. 6. Akkus Z, Aly YH, Attia IZ, Lopez-Jimenez F, Arruda-Olson AM, Pellikka PA, Pislaru SV, Kane GC, Friedman PA, Oh JK. Artificial intelligence (AI)-empowered echocardiography interpretation: A state-of-the-art review. J Clin Med. 2021;10(7):1391. 7. Lin A, Kolossváry M, Motwani M, Iš gum I, Maurovich-Horvat P, Slomka PJ, Dey D. Artificial intelligence in cardiovascular CT: Current status and future implications. J Cardiovasc Comput Tomogr. 2021. 8. Leiner T, Rueckert D, Suinesiaputra A, Baeßler B, Nezafat R, Iš gum I, Young A. Machine learning in cardiovascular magnetic resonance: Basic concepts and applications. J Cardiovasc Magn Reson. 2019;21:12. 9. Gupta K, Reddy S. Heart, eye, and artificial intelligence: A review. Cardiol Res. 2021. 10. Ruijsink B, Puyol-Antón E, Oksuz I, Sinclair M, Bai W, Schnabel JA, Razavi R, King AP. Fully automated, quality-controlled cardiac analysis from CMR: Validation and large-scale application to characterize cardiac function. JACC: Cardiovasc Imaging. 2020;13(3):684−95. 11. Yasaka K, Akai H, Kunimatsu A, Kiryu S, Abe O. Deep learning with convolutional neural network in radiology. Jpn J Radiol. 2018. 12. Greenspan H, van Ginneken B, Summers RM. Guest editorial deep learning in medical imaging: Overview and future promise of an exciting new technique. IEEE Trans Med Imaging. 2016;35(5):1153−9. 13. Woolley RJ, Ceelen D, Ouwerkerk W, Tromp J, Figarska SM, Anker SD, Dickstein K, Filippatos G, Zannad F, Metra M, Ng L, Samani N, van Veldhuisen DJ, Lang C, Lam CS, Voors AA. Machine learning based on biomarker profiles identifies distinct subgroups of heart failure with preserved ejection fraction. Eur J Heart Fail. 2021;23(6):983−91. 14. Ahmad T, Lund LH, Rao P, Ghosh R, Warier P, Vaccaro B, Dahlström U, O’Connor CM, Felker GM, Desai NR. Machine learning methods improve prognostication, identify clinically distinct phenotypes, and detect heterogeneity in response to therapy in a large cohort of heart failure patients. J Am Heart Assoc. 2018. 15. Chartrand G, Cheng PM, Vorontsov E, Drozdzal M, Turcotte S, Pal CJ, Kadoury S, Tang A. Deep learning: A primer for radiologists. RadioGraphics. 2017;37(7):2113−31, pMID: 29131760. 16. Miller DD, Brown EW. Artificial intelligence in medical practice: The question to the answer? Am J Med. 2018;131(2):129−33. 17. Lecun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521:436−44. 18. Topol EJ. High-performance medicine: The convergence of human and artificial intelligence. Nat Med. 2019;25(1):44−56. 19. Fauw JD, Ledsam JR, Romera-Paredes B, Nikolov S, Tomasev N, Blackwell S, Askham H, Glorot X, O’Donoghue B, Visentin D, van den Driessche G, Lakshminarayanan B, Meyer C, Mackinder F, Bouton S, Ayoub K, Chopra R, King D, Karthikesalingam A, Hughes CO, Raine R, Hughes J, Sim DA, Egan C, Tufail A, Montgomery H, Hassabis D, Rees G, Back T, Khaw PT, Suleyman M, Cornebise J, Keane PA, Ronneberger O. Clinically applicable deep learning for diagnosis and referral in retinal disease. Nat Med. 2018;24:1342−50. 20. Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, Thrun S. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017;542:115−8. 21. Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273−97. 22. Vapnik V. The nature of statistical learning theory. Springer; 2013. 23. Crammer K, Singer Y. On the algorithmic implementation of multiclass kernel-based vector machines. J Mach Learn Res. 2001;2:265−92. 24. Criminisi A, Shotton J, Konukoglu E. Decision forests: A unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning. Found Trends Comput Graph Vis. 2012;7(2−3):81−227. 25. Rokach L, Maimon O. Top-down induction of decision trees classifiers - a survey. IEEE Trans Syst Man Cybern Part C (Appl Rev). 2005;35(4):476−87. 26. Freund Y, Schapire RE. A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci. 1997;55:119−39.

Diagnosis

101

5

27. Fukushima K, Miyake S. Neocognitron: A new algorithm for pattern recognition tolerant of deformations and shifts in position. Proc IEEE. 1982;15:455−69. 28. Lecun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proc IEEE. 1998;86(11):2278−324. 29. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735−80. 30. Hinton GE, Salakhutdinov RR. Reducing the dimensionality of data with neural networks. Science. 2006;313(5786):504−7. 31. Kingma DP, Welling M. Auto-encoding variational bayes. In: International conference on learning representations (ICLR); 2014. 32. Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville AC, Bengio Y. Generative adversarial nets. In: Advances in neural information processing systems (NIPS); 2014. p. 2672−80. 33. Arjovsky M, Chintala S, Bottou L. Wasserstein generative adversarial networks. In: International conference on machine learning (ICML); 2017. p. 214−23. 34. Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR); 2015. p. 3431−40. 35. Sudre CH, Li W, Vercauteren T, Ourselin S, Cardoso MJ. Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In: Deep learning in medical image analysis and multimodal learning for clinical decision support; 2017. p. 240−8. 36. Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. In: Navab N, Hornegger J, Wells WM, Frangi AF, editors. Medical image computing and computer-assisted intervention - MICCAI. Springer. 2015;2015:234−41. 37. Yi X, Walia E, Babyn P. Generative adversarial network in medical imaging: A review. Med Image Anal. 58;2019. 38. Isola P, Zhu J-Y, Zhou T, Efros AA. Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR); 2017. p. 1125−34. 39. Madani A, Moradi M, Karargyris A, Syeda-Mahmood T. Chest x-ray generation and data augmentation for cardiovascular abnormality classification. In: SPIE medical imaging: image processing, vol. 10574; 2018. p. 105741M. 40. Torfi A, Fox EA, Reddy CK. Differentially private synthetic medical data generation using convolutional gans. Information Sciences. 2022; 586:485−500. 41. Torfi A, Fox EA, Reddy CK. Semi-supervised learning with generative adversarial networks for chest x-ray classification with ability of data domain adaptation. In: Proceedings of the IEEE international symposium on biomedical imaging (ISBI); 2018. p. 1038−42. 42. Leibig C, Allken V, Ayhan MS, Berens P, Wahl S. Leveraging uncertainty information from deep neural networks for disease detection. Sci Rep. 2017;7(1):1−14. 43. Abdar M, Pourpanah F, Hussain S, Rezazadegan D, Liu L, Ghavamzadeh M, Fieguth P, Cao X, Khosravi A, Acharya UR, Makarenkov V, Nahavandi S. A review of uncertainty quantification in deep learning: Techniques, applications and challenges. Inf Fusion. 2021; 76:243−97. 44. Hastings WK. Monte Carlo sampling methods using Markov chains and their applications. Oxford University Press; 1970. 45. Gal Y, Ghahramani Z. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In: International conference on machine learning (ICML); 2016. p. 1050−9. 46. Lakshminarayanan B, Pritzel A, Blundell C. Simple and scalable predictive uncertainty estimation using deep ensembles. In: Advances in neural information processing systems (NIPS); 2017. p. 6402−13. 47. Bernard O, Lalande A, Zotti C, Cervenansky F, Yang X, Heng P, Cetin I, Lekadir K, Camara O, Ballester MAG, Sanroma G, Napel S, Petersen SE, Tziritas G, Grinias E, Khened M, Varghese A, Krishnamurthi G, Rohé M, Pennec X, Sermesant M, Isensee F, Jaeger P, Maier-Hein KH, Full PM, Wolf I, Engelhardt S, Baumgartner CF, Koch LM, Wolterink JM, Isgum I, Jang Y, Hong Y, Patravali J, Jain S, Humbert O, Jodoin P. Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: Is the problem solved? IEEE Trans Med Imaging. 2018;37(11):2514−25.

102

5

D. Rueckert et al.

48. Isensee F, Jaeger PF, Full PM, Wolf I, Engelhardt S, Maier-Hein KH. Automatic cardiac disease assessment on cine-MRI via time-series segmentation and domain specific features. In: Statistical atlases and computational models of the heart; 2017. p. 120−9. 49. Khened M, Varghese A, Krishnamurthi G. Densely connected fully convolutional network for short-axis cardiac cine MR image segmentation and heart diagnosis using random forest. In: Statistical atlases and computational models of the heart; 2017. p. 140−51. 50. Cetin I, Sanroma G, Petersen SE, Napel S, Camara O, Ballester MAG, Lekadir K. A radiomics approach to computer-aided diagnosis with cardiac cine-MRI. In: Statistical atlases and computational models of the heart; 2017. p. 82−90. 51. Wolterink JM, Leiner T, Viergever MA, Isgum I. Automatic segmentation and disease classification using cardiac cine MR images. In: Statistical atlases and computational models of the heart; 2017. p. 101−10. 52. Biffi C, Cerrolaza JJ, Tarroni G, Bai W, de Marvao A, Oktay O, Ledig C, Folgoc LL, Kamnitsas K, Doumou G, Duan J, Prasad SK, Cook SA, O’Regan DP, Rueckert D. Explainable anatomical shape analysis through deep hierarchical generative models. IEEE Trans Med Imaging. 2020;39(6):2088−99. 53. Clough JR, Oksuz I, Puyol-Antón E, Ruijsink B, King AP, Schnabel JA. Global and local interpretability for cardiac MRI classification. In: Medical image computing and computer assisted intervention (MICCAI); 2019. p. 656−64. 54. Kim B, Wattenberg M, Gilmer J, Cai CJ, Wexler J, Viégas FB, Sayres R. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (TCAV). In: International conference on machine learning (ICML), vol. 80. PMLR; 2018. p. 2673−82. 55. Zheng Q, Delingette H, Ayache N. Explainable cardiac pathology classification on cine MRI with motion characterization by semi-supervised learning of apparent flow. Med Image Anal. 2019;56:80−95. 56. Puyol-Antón E, Ruijsink B, Gerber B, Amzulescu MS, Langet H, De Craene M, Schnabel JA, Piro P, King AP. Regional multi-view learning for cardiac motion analysis: Application to identification of dilated cardiomyopathy patients. IEEE Trans Biomed Eng. 2019;66(4):956−66. 57. Ouyang D, He B, Ghorbani A, Yuan N, Ebinger J, Langlotz CP, Heidenreich PA, Harrington RA, Liang DH, Ashley EA, Zou JY. Video-based AI for beat-to-beat assessment of cardiac function. Nature. 2020;580:252−6. 58. Alday EAP, Gu A, Shah AJ, Robichaux C, Wong AKI, Liu C, Liu F, Rad AB, Elola A, Seyedi S, Li Q, Sharma A, Clifford GD, Reyna MA. Classification of 12-lead ECGs: The PhysioNet/computing in cardiology challenge 2020. Nature. 2021;41(12), p. 124003. 59. Aerts HJWL, Velazquez ER, Leijenaar RTH, Parmar C, Grossmann P, Carvalho S, Bussink J, Monshouwer R, Haibe-Kains B, Rietveld D, Hoebers F, Rietbergen MM, Leemans CR, Dekker A, Quackenbush J, Gillies RJ, Lambin P. Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach. Nat Commun. 2014;4:4006. 60. Raisi-Estabragh Z, Izquierdo C, Campello VM, Martin-Isla C, Jaggi A, Harvey NC, Lekadir K, Petersen SE. Cardiac magnetic resonance radiomics: Basic principles and clinical perspectives. Eur Heart J - Cardiovasc Imaging. 2020;21(4):349−56. 61. Bai W, Suzuki H, Huang J, Francis C, Wang S, Tarroni G, Guitton F, Aung N, Fung K, Petersen SE, Piechnik SK, Neubauer S, Evangelou E, Dehghan A, O’Regan DP, Wilkins MR, Guo Y, Matthews PM, Rueckert D. A population-based phenome-wide association study of cardiac and aortic structure and function. Nat Med. 2020;26:1654−62. 62. Petersen SE, Matthews PM, Francis JM, Robson MD, Zemrak F, Boubertakh R, Young AA, Hudson S, Weale P, Garratt S, Collins R, Piechnik S, Neubauer S. UK Biobank’s cardiovascular magnetic resonance protocol. J Cardiovasc Magn Reson. 2015;18:8. 63. Tarroni G, Oktay O, Bai W, Schuh A, Suzuki H, Passerat-Palmbach J, de Marvao A, O’Regan DP, Cook S, Glocker B, Matthews PM, Rueckert D. Learning-based quality control for cardiac MR images. IEEE Trans Med Imaging. 2019;38(5):1127−38. 64. Tarroni G, Bai W, Oktay O, Schuh A, Suzuki H, Glocker B, Matthews PM, Rueckert D. Largescale quality control of cardiac imaging in population studies: Application to UK Biobank. Sci Rep. 2020;10(1):1−11.

Diagnosis

103

5

65. Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J, Downey P, Elliott P, Green J, Landray M, Liu B, Matthews P, Ong G, Pell J, Silman A, Young A, Sprosen T, Peakman T, Collins R. UK biobank: An open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLOS Medicine. 2015;12(3): e1001779. 66. Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Ivanov PC, Mark RG, Mietus JE, Moody GB, Peng C-K, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet. Circulation. 2000;101(23). 67. Rieke N, Hancox J, Li W, Milletarì F, Roth HR, Albarqouni S, Bakas S, Galtier MN, Landman BA, Maier-Hein K, Ourselin S, Sheller M, Summers RM, Trask A, Xu D, Baust M, Cardoso MJ. The future of digital health with federated learning. NPJ Digit Med. 2020;3:119. 68. Kaissis GA, Makowski MR, Rueckert D, Braren RF. Secure, privacy-preserving and federated machine learning in medical imaging. Nat Mach Intell. 2020;2:305−11. 69. Puyol-Antón E, Ruijsink B, Harana JM, Piechnik SK, Neubauer S, Petersen SE, Razavi R, Chowienczyk P, King AP. Fairness in cardiac magnetic resonance imaging: Assessing sex and racial bias in deep learning-based segmentation. Front. Cardiovasc. Med., Sec. Cardiovascular Imaging Volume 9 −2022 (7 https://doi.org/10.3389/fcvm.2022.859310) 70. Noseworthy PA, Attia ZI, Brewer LC, Hayes SN, Yao X, Kapa S, Friedman PA, Lopez-Jimenez F. Assessing and mitigating bias in medical artificial intelligence: The effects of race and ethnicity on a deep learning model for ECG analysis. Circ Arrhythmia Electrophysiol. 2020;13(3). 71. ACDC challenge website. 7 https://www.creatis.insa-lyon.fr/Challenge/acdc/. 72. Dey D, Slomka PJ, Leeson P, Comaniciu D, Shrestha S, Sengupta PP, Marwick TH. Artificial intelligence in cardiovascular imaging: JACC state-of-the-art review. J Am College Cardiol. 2019;73(11):1317−35. 73. Fenech ME, Buston O. AI in cardiac imaging: A UK-based perspective on addressing the ethical, social, and political challenges. Front Cardiovasc Med. 2020;7(54). 74. Petersen SE, Abdulkareem M, Leiner T. Artificial intelligence will transform cardiac imagingopportunities and challenges. Front Cardiovasc Med. 2019;6:133. 75. Seetharam K, Brito D, Farjo PD, Sengupta PP. The role of artificial intelligence in cardiovascular imaging: State of the art review. Front Cardiovasc Med. 2020;7:374.

105

6

Outcome Prediction Buntheng Ly, Mihaela Pop, Hubert Cochet, Nicolas Duchateau, Declan O’Regan and Maxime Sermesant Contents 6.1

Clinical Introduction – 106

6.2

Overview – 108

6.3

Current Clinical Methods to Predict Outcome – 108

6.4

AI-Based Methods to Predict Outcome – 110

6.5

Application: Prediction of Response Following Cardiac Resynchronization Therapy (CRT) – 112

6.6

Application: AI Methods to Predict Atrial Fibrillation Outcome – 116

6.7

Application: Risk Stratification in Ventricular Arrhythmia – 118

6.8

Closing Remarks – 124

6.9

Exercises – 124

6.10

Tutorial—Outcome Prediction – 125

6.11

Opinion – 125 References – 127

Supplementary Information The online version contains supplementary material available at 7 https://doi.org/10.1007/978-3-031-05071-8_6. Authors’ contribution: • Introduction, Opinion: DO. • Main chapter: BL, MP, HC, MS. • Tutorial: ND. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Duchateau and A. P. King (eds.), AI and Big Data in Cardiology, https://doi.org/10.1007/978-3-031-05071-8_6

106

B. Ly et al.

n Learning Objectives At the end of this chapter you should be able to: O6.A Compare and contrast traditional and AI based methods for outcome prediction in cardiology O6.B Explain some ways in which AI models can be used to predict response to cardiac resynchronization therapy, either using supervised or unsupervised formulations O6.C Describe how AI can be used to predict outcomes of atrial fibrillation O6.D Explain how AI can assist in risk stratification in ventricular arrhythmia

6.1

6

Clinical Introduction

Outcome prediction is a critical part of clinical decision making in cardiovascular disease. Accurate assessment of a patient’s risk and timing of future events informs the choice of evidence-based prevention and treatment [1]. Imaging plays a pivotal role in risk stratification by visualising disease status, assessing disease trajectory and evaluating response to therapy. An example is the use of coronary artery calcium scoring (see also . Chap. 4, Sect. 4.5) as a semi-quantitative test for measuring calcified coronary artery plaque that can be of value in risk stratifying patients for future cardiovascular disease endpoints including guiding decisions about statin therapy in selected groups [2]. Although calcium scoring is a simple and highly reproducible test it doesn’t account for prognostically important variations in regional distribution, intensity characteristics, or lesion-specific features [3]. A similar pattern of limitations emerges when using imaging to identify predisposing substrates and triggers associated with sudden cardiac death (SCD). Implantable cardioverter-defibrillators (ICD) are the most effective approach to primary prevention of SCD, and current guidelines regarding device implantation are based on an imaging-derived LV EF ≤35% [4]. However, the majority of out-of-hospital cardiac arrests occur in patients with only mild to moderate dysfunction who might be denied an ICD on current best practice [5], and so the reliance on single parameter thresholds fails to identify many of those who would benefit from the intervention [6]. Risk prediction guidelines draw insight from large-scale clinical studies through linear regression modelling of conventional explanatory variables, but this approach does not embrace the dynamic physiological complexity of heart disease [7]. Even objective quantification of heart function by conventional analysis of cardiac imaging relies on crude measures of global contraction that are only moderately reproducible and insensitive to the underlying disturbances of cardiovascular physiology [8]. In routine practice observer-driven pattern recognition is also used to guide classification introducing value from expertise but at the expense of objectivity and standardization [9]. Discretising severity into subjective categories may facilitate interpretability but incurs a loss of predictive power especially when building risk models [10]. Even consensus guidelines on patient management, for instance

Outcome Prediction

107

6

the investigation of stable chest pain, may substantially diverge when different assumptions, biases and inferential models drive their design [11]. The growing abundance of digital medical imaging linked to electronic health records presents an opportunity to develop prediction models that fully exploit biologically rich and diverse datasets at scale. Systematic quantification and evaluation of novel prognostic features could be transformative in the ambition for delivering “personalized medicine” tailored to individual characteristics including both phenotypic and genotypic profiles [12]. However, despite the exponential growth of machine learning approaches for prediction and classification tasks in healthcare [13], the safe and timely translation into clinically validated and regulated systems has proved to be challenging [14]. Systematic reviews of machine learning-based cardiovascular risk prediction have revealed inconsistent reporting, study heterogeneity and poor methodology [15]. In medical imaging applications of machine learning there are relatively few prospective or randomized trials, and independent external validation is scarce, increasing the risk of reporting biased performance estimates [16, 17]. Coordinated national and international efforts to enhance health interactions through access to large scale data and advanced analytics are accelerating the pace of prognostic algorithm development. Examples include community studies such as the 500,000 participants of the UK Biobank of whom 20% are being recalled for CMR [18], and the German National Cohort of 200,000 individuals including 30,000 with imaging [19]. Guidance is also emerging around the use of open data standards for healthcare informatics platforms to enable computable biomedical data to be discovered, analysed and evaluated in a trusted environment [20]. Here there is a growing role for federated learning architectures, where data are not exchanged, to address privacy concerns and provide access to heterogenous samples [21]. The most pressing bottleneck to progress is developing high-quality harmonized medical image data resources that have a robust ground truth coupled with active linkages to health events [22]. While the focus of the first wave of radiology AI applications has been on lesion detection, it is machine learning to guide risk stratification, assess treatment responses and perform outcome prediction that will be at the forefront of delivering actionable insights into clinical care [23]. Meaningful risk stratification must inform evidence-based management. For instance, an attractive target for better outcome prediction is where prognosticallyrich data are not fully exploited by conventional analyses and any re-classification of risk group leads to a change in management [24]. Such individual-level modelling requires clinical studies that capture how disease and treatment responses vary over time [25]. An advantage over developing sophisticated new image biomarkers of disease is that outcome prediction is readily interpretable, but it remains crucial to inform clinicians what features were important in the classification and being able to frame the results with a level of confidence. Where machine learning is brought closer to clinical decision making it is also vital to fully understand the role of human factors in such an unfamiliar cognitive environment—both for medics and patients. While the majority of patients currently support doctors using AI in the cardiovascular healthcare sector that confidence is easily lost and far more needs to

108

B. Ly et al.

be done to include stakeholders in setting priorities, ensuring trustworthiness, and addressing health inequalities [26].

6.2

6

Overview

Following . Chap. 5 on diagnosis, this chapter develops another classical problem in medical data analysis where AI has a strong role to play: outcome prediction. To help the reader appreciate the impact of AI in this field, the approaches taken by more traditional outcome prediction methods are first summarized. It is shown how methods for predicting outcome can be framed in different ways, and can make use of a wide range of disparate data sources. In particular, outcome prediction is often presented as a problem that can be addressed using a supervised learning formulation, given that in most cases labels can be taken into account (for example, the time to a negative event such as death or re-hospitalization, or even encompassing primary and/or secondary endpoints). However, this chapter also presents how an unsupervised formulation could help in some exemplar applications. This point of view is illustrated further in the hands-on tutorial accompanying this chapter (see . Sect. 6.10).

6.3

Current Clinical Methods to Predict Outcome

Outcome prediction models of a disease or its recurrence following treatment are extensively used in clinical practice, medical research and public health [27]. In this regard, the ability to predict continuous or binary outcomes in patients with cardiovascular disease (CVD) has the potential for accurate identification of risk factors, stratification, superior treatment planning, as well as informed decision making [28, 29]. Specifically, modelling the outcome of arrhythmia-related cardiac diseases (such as atrial fibrillation, ventricular arrhythmia and heart failure) requires not only the selection of precise variables to accurately identify the critical predictors, but also to execute meticulous adjustments for time dependencies among treatments and responses [30]. Prior to the recent introduction of AI-based prediction methods, these prediction outcomes were modelled using classical statistical approaches, which are briefly outlined below along with associated terminology. For cardiac arrhythmia-related conditions, survival data (i.e. the period from a specific time point to an event of interest [31]) refers to the time from arrhythmia episode or heart failure diagnosis to death or to any time-dependent phenomenon such as arrhythmia-free survival (i.e. the time until arrhythmia relapses). To understand arrhythmia-related survival data, one can generate a Kaplan-Meier (K-M) survival curve by representing time (days, months or years) on the horizontal axis and the calculated survival probability on the vertical axis. An example of a K-M curve is illustrated in . Fig. 6.1 using virtual data and freely available source code [32]. Explicitly, for each time corresponding to an event, a new value for the K-M curve is calculated by dividing the number of events that have occurred by the number of patients remaining at risk at that time, and then this new value is used to

Outcome Prediction

109

6

. Fig. 6.1 Kaplan-Meier curve to estimate survival probability (using virtual data and freely available code in [32])

calculate the survival probability [33] and confidence interval. The censored data refers to incomplete data, such as in the case of a patient dropout from the study during the follow-up time. The so-called ‘risk’ is defined as being the probability of an event happening over a period of time. Should this risk vary over time, one can estimate the risk at a particular time point by calculating a new parameter named ‘hazard’. Typically, regression models or Cox proportional hazards models are employed for comprehensive analysis of the survival data. The Cox hazard model [34] relates the log hazard ratio to a linear predictor of one or multiple explanatory variables and is considered ‘semi-parametric’, meaning that there is no requirement to parameterise the underlying survival distribution. Cox regression models have been widely applied to predict the outcome of abnormal heart rhythm conditions, although most of them cannot give information about when dangerous arrhythmic events might occur (or reoccur after therapy) within the following 1 year to 10 years. Novel risk prediction models can express results in a more specific time scale [35]. Although limited, data-based multi-variable statistical models correlate better with the actual patient outcomes compared to the predictions given by clinical experts, especially given the inter-physician variability. The development of robust tools for primary and secondary outcome predictions is of great importance for all cardiovascular applications. Let us consider for instance the case of atrial fibrillation (AF), which is the most prevalent arrhythmia condition and is associated with life-threatening complications (e.g. embolic stroke, co-existence with heart failure, dementia) and death [36]. These complications and

110

6

B. Ly et al.

potentially fatal events lead to a considerable morbidity and mortality, posing a financial burden on the healthcare system. Notably, more than 30 million people worldwide suffer from AF, hence the considerable clinical interest to predict: the outcomes prior to the intervention; incident or recurrent AF after ablation; and the progression from sudden/paroxysmal to persistent or permanent AF. Despite a relatively high acute success rate of radiofrequency (RF) catheter ablation therapy, the outcome prediction of long-term AF recurrence during follow-up remains challenging. Using regression with multiple variables, various clinical scores can be calculated, such as the APPLE score (using: age, persistent AF, imPaired eGFR, left atrium LA, ejection fraction) at baseline with rhythm outcomes documented using 1-week monitoring with Holter device [37], or the MB-LATER score (using: male gender, bundle branch block, LA, AF type, early recurrences) 3 months after ablation, although the predictive ability of these scores may appear modest [38]. Other clinical prediction methods of AF outcome rely on tedious classification of signals recorded by the common 12-lead electrocardiogram (ECG) [39], the amount of atrial fibrosis identified by CMR imaging [40], or CT imaging-defined atrial shape statistics [41]. However, the former predictor requires a substantial amount of dedicated time and resources in order to process a large number of ECG signals, whereas the latter predictor is limited by the relatively poor spatial resolution of the data acquired in clinics and by the fact that most image-based segmentation methods still lack thorough validation. Thus, a consequence of using more sophisticated prognostic and risk prediction methods for primary or secondary outcome predictions is that the number of input variables becomes significant. This leads to complex regression models, potential bias, and difficulties in assessment of model performance via calibration and discrimination measures. However, the utility of using calibration (i.e. overall performance and goodness of fit) and discrimination (i.e. predictive values, ROC curve) measures is uncertain, and cannot guarantee the robustness of the prediction model and its overall contributions to the net benefit and cost effectiveness of the study. Equally important, it should be underlined that current approaches to predict outcomes strongly depend on: statistical assumptions of the model employed; data source and standardization; sample size in large cohorts of patients; misinterpretation of scores; cumbersome long-term survival analysis (including missing data at follow-up and/or unexpected mortality); as well as on multi-variables in the model (clinical and therapy-related taken at baseline), which altogether complicate the analysis [27].

6.4

AI-Based Methods to Predict Outcome

To address the limitations of traditional methods employed for clinical outcome predictions in CVD patients, recently developed tools using machine learning concepts either based on agnostic approaches or on data-driven models empirically optimized have been proposed. These can partially overcome the issues associated with the traditional regression-based prediction methods. However, some initial machine learning-based methods (e.g. SVM or random forests, see . Chap. 5,

Outcome Prediction

111

6

. Fig. 6.2 Example of generic pipeline for building AI-based models to predict survival rate or therapy outcome for cardiac applications

Sect. 5.3) did not prove to be sufficiently superior, especially when looking at the ROC curve (more specifically, AUC. see . Chap. 2, Sect. 2.5) as a criterion for comparisons between the outcome predicted by these machine learning methods versus traditional regression methods [42]. Thus, better approaches are still needed and these should be able to handle multi-variables input as well as complex relationships between inputs and output prediction, while being customized for the data specific to a particular clinical study. In this context, several novel AI-based methods have been developed to accurately and robustly predict complex clinical outcomes, as illustrated in this chapter for arrhythmia and dyssynchrony. Outcome prediction models must be able to forecast a future event based on the pre-recorded patient’s descriptors. As shown in . Fig. 6.2, these descriptors can include: specific image-based biomarkers (e.g. amount of fibrotic scar or wall thickness, atrial/ventricular shape, indices like ejection fraction and strain); physiological ECG signals or blood pressure; as well as clinical baseline descriptors such as gender, race, phenotype, etc. Based on these features, the AI-based models are optimised to group the patients into specific outcome classes. Technically the model can be defined in a similar way to diagnosis models (see . Chap. 5), but the main difference is the delay between the descriptor registration and the desired endpoints, where the complexity of the ground truth outcome and the evolution of the clinical descriptors can be recorded and exploited through time. The clinical outcome can be complex (depending on the disease and/or therapy of interest) and can include: the acute success rate or response to a specific therapy; the mid-to-long term survival rate following the therapy; other events such as intervention-related complications, worsening of already existing comorbidities, or sudden cardiac death (SCD). The class output of the prediction model can be defined as binary, based on the patient’s status at a specific time point, or can be diversified into more classes to include the status or evolution at each follow up point, for instance the

112

6

B. Ly et al.

event/death occurrence in the 1st, 2nd or 3rd year, etc. Furthermore, while accurate AI-based outcome predictions based on pre-therapy/follow up descriptors would be beneficial for clinical decision making and therapy planning, knowledge of the dynamic evolution of these descriptors post-therapy could provide valuable insights for modelling an optimized response. Integrating the descriptors at each follow up into the AI outcome model has potential not only for accurately predicting the patient status at the next follow up point, but also for a better understanding of the relationship between descriptor volatility and the eventual clinical outcome. In the following subsections, we provide three applications of AI-based methods implemented for modelling outcome predictions for distinct pathological cases, namely: heart failure; atrial fibrillation; and ventricular arrhythmia. The scope of this chapter is not intended to be an exhaustive review of all the AI-based methods for outcome predictions; thus, the methods presented below are meant to provide representative examples from our field of expertise and to illustrate how supervised and unsupervised AI approaches (introduced in . Chaps. 2 and . 5) could be integrated into clinical outcome prediction pipelines.

6.5

Application: Prediction of Response Following Cardiac Resynchronization Therapy (CRT)

Cardiac Resynchronization Therapy (CRT) involves the implantation of a biventricular pacing device in selected patients with mild to severe systolic heart failure (HF) to address the symptoms of HF and to reduce HF hospitalizations. The pacing restores a synchronous beating of the right and left ventricles, improving the overall biomechanical function of the heart and, consequently, the ejection fraction (EF). According to American Heart Association (AHA) and European Society of Cardiology (ESC) guidelines, two official evidence-based guidelines for HF management, CRT provides a clear-cut benefit to patients with reduced LV ejection fraction (150 ms), left bundle branch block (LBBB) morphology, and in sinus rhythm, who are still at risk of advanced HF progression despite receiving optimal medical treatment. Response to CRT corresponds to the degree of LV remodelling documented in the imaging, usually by quantifying the reduction in the LV end systolic volume. However, the evidence for positive CRT response becomes less clear when the QRS duration is between 130−150 ms, non-LBBB morphology or with AF patients, since the recommendations start to deviate between the two guidelines [43]. In addition, depending on the current selection criteria, between 20 to 30% of patients who underwent CRT were reported as non-responders [44]. While strategies to improve CRT response might also involve the improvement of CRT technology and post-implant care, pacing optimization and patient selection both still play a major role in limiting unnecessary implants and correctly assigning patients to appropriate treatment. Moreover, while the current recommendations are based largely on LV ejection fraction, QRS duration and morphology, several clinical trials have demonstrated that patient response to CRT also depends on demographic and clinical

Outcome Prediction

. Fig. 6.3

113

6

Formulation of AI-based model prediction for CRT outcome

characteristics as well as on the electrical and mechanical function of the heart [44]. Thus, the interest in CRT patient assessment has started to shift towards the inclusion of imaging data. This is where AI methods have made their way into CRT response prediction, thanks to their ability to integrate and interpret diverse and heterogeneous data involved in treatment personalization for superior CRT outcome. CRT outcome prediction using AI-based models can be formulated as a supervised or unsupervised problem, as shown in . Fig. 6.3. It should be noted that recent developments suggest that AI-based models outperform conventional clinical methods [45]. The following subsections will provide selected application examples of AI-based models built for CRT outcome prediction. As AI is an emerging technique, the reader is advised to seek contemporary reviews such as, for instance [45], for a more comprehensive review of AI methods used in CRT. Supervised Prediction of CRT Response Supervised outcome models are trained to predict the endpoint according to the input descriptors, which, in the context of CRT outcome, can provide a direct answer as to whether the patient would benefit from the therapy. To train such models, datasets comprised of the ground truth endpoint are required. The primary endpoint of CRT clinical trials usually entails either death from any cause or nonfatal HF events. However, such datasets are usually not publicly available and their acquisition implies large clinical trials that span over multiple years. In addition, the retrospective nature of these datasets means that the available clinical descriptors or imaging data are limited by the study protocol. This can limit researchers’ ability to investigate novel biomarkers since the required data may not have been recorded. Therefore, to facilitate these studies, classification tasks are usually simplified and focused on patient response to CRT. The labels “responder” or “non-responder” are assigned to the patients who showed significant LV remodelling as quantified by the reduction in the end systolic volume (between >= 10% and >= 15%) at 6-month post-operation follow up. This predictor was shown to be a strong indicator of lower long-term mortality and HF events [46]. Both machine learning and deep learning methods have been shown to provide additional predictive value over the metrics used in current clinical guidelines (LV ejection fraction, QRS duration and LBBB) [47−50]. In addition to these metrics, machine learning methods are able to exploit detailed cardiac motion data for out-

114

6

B. Ly et al.

come prediction (i.e. response to CRT) [47, 50]. For example, a random forest-based machine learning model was able to achieve a higher AUC score (0.74 compared to the log regression model 0.67) [50]. Furthermore, owing to their ability to process large multidimensional input data (i.e. the imaging data), deep learning methods are capable of making accurate predictions from the LV and RV segmentation masks of CMR images [48]. Using such masks of heart motion through the cardiac cycle phases, a deep learning model can learn to predict patient CRT response without the need for feature extraction or additional clinical descriptors. The binary evaluation, “responder” versus “non-responder”, using a single cutoff value of the LV end systolic volume, might not accommodate all the possible outcomes and the subtlety of every patient’s response to CRT. Moreover, the categorization is even more heavily impacted by the poor reproducibility of the serial LV end systolic volume measurements. To mitigate this issue, a “super-responder” class can be considered in the classification model, which provides information on the patients most likely to gain strong benefit from the therapy. Given appropriate data for supervised training, machine learning can be used to predict super-response as well as just response to CRT [51]. In addition, in studies based on long-term CRT clinical trials, machine learning methods can be used to provide more prediction details than simply response or even super-response. Patient survival through the follow-up period could also be framed as the model output [49]. In this case, the output prediction may be split into different classes according to the patient survival post-CRT therapy, which could provide better insight into the clinical evolution of the pathology, offering potential benefits to decision making and planning strategy. An AI classification model usually predicts the output as a value between 0-1 for each class, which is usually regarded as the probability that the inputs belong to the specific class. However, most incorrect predictions are still associated with a high probability. Enforcing the model training to be uncertainty-aware [52] could provide additional information when analysing the model output. The integration of uncertainty into model predictions also allows including the variability of clinical data in the prediction (image and segmentation quality, incomplete clinical variables, etc.). This variability can be more prevalent in routine clinical care compared to data from clinical trials. Complementing the model output with uncertainty information could be extremely useful to ensure clinical adoption of the AI model as a decision support tool. Unsupervised Prediction of CRT Response While supervised models may be bounded by the output class, unsupervised models do not require any output label for the learning phase. From the perspective of better characterizing the patient outcome, the main objective behind this family of methods is to fit the patients (as represented by the input characteristics) into different archetypal subgroups or phenogroups, according to the similarity of their characteristics and not based on already existing labels. These phenogroups can be then interpreted by analysing their differences in outcome and patient characteristics.

Outcome Prediction

115

6

Being agnostic to the potential labels makes an unsupervised approach capable of identifying two phenogroups of patients based on differences in clinical characteristics and long-term prognosis [53]. The survival analysis of the unsupervised phenogroups highlights the distinction in survival rate between the two populations. The phenogroup analysis can also be related back to the input characteristics. For example, in [53] one phenogroup was found to have higher numbers of CRT responders as well as certain input features (apical rocking, septal flash) whereas a second phenogroup featured signs of advanced HF (RV dysfunction, kidney failure and biventricular dilation). The number of phenogroups in the unsupervised model is not limited to two, and several profiles can be generated to account for the spectrum of possible outcomes following therapy. The optimum number of phenogroups can be calculated independently to the endpoint by maximising the distance of the phenogroups [53]. This optimum number can also be set to maximise the statistical significance of the phenogroups to a desired endpoint. Note that while the number of the phenogroups may be biased toward a ground truth label, the optimization does not take the label into account and thus the population is still grouped according to the correlation of the input characteristics. An initial number is chosen for the first training, then the model is retrained with the new number of phenogroup(s), until the optimum condition is met. Up to 4 phenogroups can be defined based on the primary endpoint (death or non-fatal HF event), to account for the different levels of prognosis: the best, the worst and two in-between phenogroups [54]. The survival analysis of the population in each phenogroup proves the accuracy of the unsupervised model in grouping the patients likely to benefit or not from CRT. Without prior knowledge, the unsupervised model was able to classify the CRT “responder” versus “non-responder” groups in a statistically significant manner with better accuracy than each single clinical descriptor alone [54]. It is also interesting to note that without supervision, the identified phenogroups actually correspond to specific mechanisms that can condition CRT (non-)response, previously described by clinicians based on their physiological knowledge [55]. Although the binary division of patients may appear too simplistic to accommodate for all the possible patient reactions to therapy and their long-term prognosis, the unsupervised model allows assigning a “risk profile” to the patient according to the common clinical descriptors, which could prove useful in CRT patient selection and clinical decision making. An additional benefit of unsupervised learning is the flexibility on the exploitable data. Although a significant number of known ground truth outcomes are required for accurate phenogroup analysis or at least better interpretation, unsupervised training can advantageously, by definition, be performed on databases with unknown or incomplete outcome labels. The lack of implicit classification loss during the optimization also limits the overfitting in small datasets compared to supervised methods [56]. Beyond the specific context of CRT, unsupervised phenogrouping approaches may be useful in the overall context of HF to unravel novel disease entities, knowing, for instance, that dilated cardiomyopathies are currently poorly classified. This would in turn allow more personalized patient management by selecting the drugs or interventions (e.g. CRT) most likely to be effective for the specific phenogroups.

116

6.6

6

B. Ly et al.

Application: AI Methods to Predict Atrial Fibrillation Outcome

Accurate prediction of primary and secondary outcome in case of arrhythmic events is a critical task for early prevention and selection of the most effective treatment. Atrial fibrillation (AF) is the most prevalent arrhythmia condition, which along with its associated comorbidities [57], represents a burden to healthcare systems and an increased risk of stroke and mortality to the patient, particularly for the ageing European and North American populations. Among the most important comorbidities are: coexisting HF, ischemic heart disease, hypertensive or valvular heart disease, and diabetes. With respect to AF management, the first line of therapy is anti-arrhythmic medication (e.g. beta blockers, calcium channel blockers) to control the heart rate, along with anti-coagulants that prevent blood clots and stroke. However, during prolonged treatment spanning over years, the antiarrhythmic drugs are often associated with side effects (e.g. shortness of breath, dizziness, tiredness, slow heart rate, low blood pressure) and also affect over time the normal function of several organs (e.g. liver, kidney, thyroid, lungs). Other therapy options are cardioversion (to reset aberrant heart rhythms), and catheter ablation (to eliminate the atrial foci generating abnormal electrical impulses). Notably, among this spectrum of therapies, only catheter ablation is potentially curative. This minimally-invasive procedure is performed under imaging guidance and consists of the elimination of AF foci using thermal energy (e.g. radiofrequency and cryoablation), via an ablation catheter whose tip is precisely manoeuvred to destroy only tiny tissue areas harbouring AF sources. Several predictors of AF risk as well as of the outcomes prior to and following the therapy of choice have been clinically identified. Among the key predictors are: ECG signals (recorded in ambulatory care clinics or by wearable devices: Holter monitors and smart watches); biomarkers extracted from clinical data (age, race, sex, phenotypes, image-based parameters such as the amount of fibrosis, atrial shape and size descriptors: surface area, anteroposterior diameter, biplane area-length volume) and information from patient electronic health records [58]. As described in more detail below, the features of these descriptors can be extracted and used by AI-based models to predict AF risk and important outcomes such as: whether the patients are free of AF, whether AF reoccurs following ablation but has reduced number of episodes; complete versus incomplete pulmonary vein (PV) isolation during ablation; worsening of comorbidities, embolic complications (e.g. stroke), and overall mortality (see . Fig. 6.4). The most important key descriptor of AF is the noninvasive ECG, a widely available monitoring measurement of cardiac electrical activity obtained by means of one or multi-surface electrodes. ECG is an established clinical diagnostic biomarker of abnormal heart rhythm (too fast, too slow or with irregular beats), which can be easily digitized and transferred for interpretation. The typical components of recorded ECG signals are: P wave (corresponding to atrial depolarization), QRS wave (ventricle depolarization), and T wave (ventricular repolarization). The QRS complex is often converted to a Fourier spectrum in order to observe dominating events and potentially lethal AF in the 0−20 Hz frequency range [59]. Unfortunately,

Outcome Prediction

. Fig. 6.4

117

6

Example of AI-based pipeline to predict AF risk and outcome

the signature detection of various types of AF morphologies needed for classifiers that feed traditional regression models requires experts for interpretation as well as dedicated resources to analyze large datasets, which are difficult to find. These limitations have prompted research into versatile methods built on adaptable deep neural networks that can deal with such large datasets, and on machine learning methods empowered to have learning abilities. While promising, the early machine learning-based prediction models used PCA, SVM or random forest methods, and employed intense preprocessing steps and noise removal before extracting relevant morphological features from the ECG signals (e.g. slopes, peaks, amplitude timings, etc.) [60]. However, modern convolutional neural network-based (CNN, see . Chap. 3, Sect. 3.5) models have the ability to use feature characteristics extracted directly from raw ECGs for automated analysis. In contrast to 2-D CNN models that are suitable to exploit image structure, deep learning-based models using 1-D CNNs are able to segment and classify the heart beats using ECGs. Each beat is labelled as normal or abnormal, enabling predictions of AF risk and stroke complications as outcome [61]. Other models use 12-lead ECGs recorded in sinus rhythm in order to find suitable patterns to predict incident AF [62], while recently developed CNN models can be trained with more than 1 million ECGs to accurately predict mortality as a primary AF outcome [29]. Complementary details on ECG analysis but not necessarily specific to AF can be found in . Chap. 10, Sect. 10.3. Lastly, a notable recent breakthrough of AI-based methods for AF outcome prediction is in the area of remote monitoring technologies, where automatic AI algorithms have been applied to single-lead ECG traces obtained through mobile and smart watch-enabled recordings [63, 64]. It is envisioned that smart AI-based algorithms developed for consumer- or patient-facing applications (which are massively scalable) will soon completely exceed the capacity of human readers of ECGs. However, the utilization of these predictive models for AF risk and outcome is still hampered by the inconsistent quality of data collected in real-time fashion. This is mainly due to the sporadic poor quality of tracing and to noisy data, which might introduce bias and error in the correct interpretation and the model output. To sum up, integrating complexity into AI-based prognostic models through multilayer deep learning models can result in rapid identification of ECG signal

118

B. Ly et al.

features and subtle patterns which are not typically recognizable by the human eye. Comprehensive and sizeable clinical datasets containing single-lead or multi-lead digital ECGs are also being linked to electronic health records (see . Chap. 10, Sect. 10.4), substantially contributing to the development and deployment of accurate AI models for AF risk and associated outcome prediction.

6.7

6

Application: Risk Stratification in Ventricular Arrhythmia

Ventricular Arrhythmia (VA) is the most frequent event leading up to SCD, which is among the major causes of death in developed countries. The spectrum of therapies includes the delivery of electrical shocks to the heart via implantable cardioverter defibrillators (ICD) to prevent SCD, and catheter radiofrequency ablation as the potential curative treatment. Both therapies involve invasive and risky interventions; thus, the correct identification of patients at risk as well as the ablation targets (i.e. the discrete myocardial sites promoting arrhythmia) are crucial to prevent SCD and reduce surgery complications. The ICD is an implantable device used to deliver appropriate electrical therapies (antitachycardia pacing or shock) to terminate the VA episode. The ICD implantation is applied preemptively to subjects identified as being at risk of developing potentially lethal VA. The objective of the therapy is to terminate the arrhythmic episode at the occurrence, but not to prevent its recurrence. The current recommendation for ICD patient selection for primary prevention relies largely on the LV ejection fraction value, which is a key clinical index measuring the relative change of LV volume between end diastole and end systole [65]. Unfortunately, current clinical strategies based solely on the LV ejection fraction lead to numerous nonessential implants, due to the fact that up to 3/4 of the selected patients would not receive any appropriate therapy within 5 years after the implantation [66]. In addition, the current strategies miss more than 80% of SCD victims whose LV ejection fraction is not severely altered. Radiofrequency ablation is an electrophysiology procedure that eliminates the VA source (known as the ‘substrate’) using an electrical current delivered by an intracardiac catheter whose tip is maneuvered onto the target. Ablation is proposed for patients in the advanced stage, who have experienced multiple VA episodes and who often received multiple ICD therapies that were poorly tolerated. The objective is to modify the myocardial substrate on which arrhythmias occur, in order to prevent their recurrence. The main limitation of this treatment lies in the correct and exhaustive identification of the target. The current diagnosis strategy in the electrophysiology lab involves dangerous VA induction using a programmed electrical stimulation [65], which is invasive and time consuming, and suffers from a limited ability to successfully induce arrhythmia and inaccessibility to the arrhythmogenic substrate location. Therefore, accurate VA risk stratification is crucial for adapting the appropriate therapy for SCD prevention. Furthermore, the classification of VA patients should also be extended to the detection of specific arrhythmogenic areas for successful curative ablation interventions.

Outcome Prediction

119

6

The Machine Learning Approach The poor performance of LV ejection fraction in patient selection for VA could be explained by the limitation of a single descriptor to predict a complex phenomenon such as VA. This further highlights the limitations of classical statistical analysis, which usually focus on the impact of single descriptors. In contrast, machine learning provides models that can capture more complex statistical relationships and integrate more disparate and multidimensional data. There is no limit, technically-speaking to the number of variables that can be used as input to machine learning models. These could consist of demographics, medical history, medication therapy, laboratory results, and features extracted from ECG, imaging and clinical notes [45]. Ideally, since the model would be able to learn by itself to distinguish which inputs are useful or not through optimization, it is advisable to include all the available descriptors as input to avoid feature selection bias. Machine learning-based models could also be used to integrate the evolution of the input variables in a dynamic way [67]. For example, the RF_SLAM (Random Forest for Survival, Longitudinal and Multivariate) model allows the integration of baseline descriptors (pre-ICD implant) and dynamic descriptors (post-implant). The updated clinical descriptors, at each follow up, such as the serial LV ejection fractions or the number of HF hospitalizations are integrated to the prediction model to provide new information concerning patients’ biological response to the therapy and their survival rate. Such a dynamic model provides a better understanding of the relation between the evolution of the variables, allowing the flexibility needed in personalized medicine. Lastly, the patient HF status post-therapy plays an important role in predicting patients’ survival, while the serial LV ejection fractions may not contribute substantially [67]. However, more input variables lead to more complex models, which would take longer to train and to run, and be more difficult to interpret. Moreover, external issues such as missing data or clinical practicality could be legitimate reasons to limit the number of input descriptors. Current statistical methods to reduce the number of input variables mostly rely on univariate and multivariate analysis for feature significance, where the variables with significant p-value (< 0.05) are selected [67]. With machine learning models, feature importance analysis can also be used for feature selection. First, the primary model is trained with all the available input variables, and then the top predictors are extracted to be used as the input to the secondary and main prediction model [68]. The feature importance analysis of the machine learning model is an analysis of the degree of importance of each input variable to the model decision. Feature importance algorithms can be model dependent, by inspecting the weights or coefficients of the trained model, or model independent, for instance using the permutation importance algorithm,1 which looks at the score decrease when a feature is absent. Understanding the importance of each input variable allows transparency and interpretability, which are required to escape from “black-box machine 1

7 https://scikit-learn.org/stable/modules/permutation_importance.html#id2.

120

B. Ly et al.

learning models” (see also . Chap. 8, Sect. 8.3, and . Chap. 9, Sect. 9.7), and help to increase the trust and viability of the models in clinical practice. Features such as the presence of left bundle branch block, serum magnesium, antiarrhythmic drugs, LV scar size, and LV gray zone have been reported to be among the most influential clinical descriptors of VA [67, 68]. The integration of multi-modality data allows the machine learning model to play a vital role in personalized medicine. Understanding the importance of imaging features in the prediction of outcome, such as LV scar or gray zone, is also a pivotal step in radiomics, an emerging field that explores a large variety of quantitative features derived from medical images. Feature Extraction Before Learning

6

In VA risk prediction, the cardiac descriptors to be used as input for machine learning can be extracted from different types of imaging modalities, including echocardiography, CMR or CT imaging. Depending on the available imaging data, these descriptors could be static or dynamic. Static features are extracted from the image captured at a specific moment of the cardiac cycle, in general end diastole and/or end systole. These can include anatomical features of the heart such as myocardial scar, myocardial thickness, or LV volume. Dynamic features are the features that define the movement of the heart and can be extracted from image sequences throughout the cardiac cycle. Features such as myocardial displacement, strain, or strain rate along the main anatomical directions of the heart can be extracted. Feature extraction usually starts with image segmentation and tracking, which is generally performed fully manually or semi-automatically. From there, the abovementioned features can be extracted by image processing. Deep learning methods can be used for robust automatic segmentation and even tracking in many cardiac imaging modalities (see . Chap. 4, Sect. 4.4). This allows fully-automated feature extraction, which is crucial to exploit large databases where manual extraction on all cases may not be feasible. Nevertheless, automatic segmentation models still need to be improved for some tasks on specific imaging modalities, such as LV myocardial scar delineation from late gadolinium enhancement CMR [69, 70]. In this case, manual segmentation or at least manual corrections on top of an automatic delineation is highly recommended. Extracting relevant features from images plays a crucial role in refining the raw image data, which may not be adapted for a given machine learning-based outcome prediction model (using classification or regression). This also serves to obtain features that are more human understandable, thus allowing better interpretability of the prediction, and even more when this is combined with some feature importance analysis as mentioned above. There are still some limitations to this approach. First, the extracted features are usually grouped into regional and global features (namely, by averaging local values across a given region or the whole myocardium). Regional features might allow a certain flexibility into regional heterogeneity compared to global features, but they do not allow the assessment of finer heterogeneities within the region. Second, using the extracted features as inputs to a machine learningbased outcome prediction model does not explicitly provide spatial information to

Outcome Prediction

121

6

the machine learning model. The model would have to learn this spatial or temporal relation between the variables during its optimization, which represents an unnecessary extra step. In contrast, the spatiotemporal information would not be lost for CNN models, for instance, which allow the direct used of imaging or ECG sequence data as input. Finally, the limited types of features usually extracted from imaging data can also lead to selection bias, namely only the same known features are studied, while other features (still present in the images) are ignored. Going Further With Deep Learning Models Deep learning models are capable of predicting outcome without having to extract specific features from the images, as exemplified in many classification problems in computer vision, which fostered the popularity of CNNs. In healthcare applications, direct diagnosis can be obtained for a large range of domains including (but not limited to) dermatology, cancer or lesion detection, and fracture detection. Nonetheless, direct VA classification using a deep learning model working on the raw image data as input has not yet been reported. A potential reason for this is that it requires considering complex cardiac data, in 3-D or even 3-D+time, meaning additional complexity of input data for a classification model. In the context of VA risk stratification, myocardial scar is considered a substrate leading to the VA mechanism. Electrophysiology assessment has linked myocardial scar with myocardial fibrosis, a pathological remodelling of the cardiac muscle [71]. The gold standard technique to visualise scar is late gadolinium enhanced (LGE) CMR. The features of LV myocardial scar extracted from LGE CMR imaging have been shown to be determinant predictors for VA, already by themselves [72] or combined with other descriptors in a machine learning prediction model [67]. Although scar segmentation requires manual segmentation by an expert (until automatic methods reach acceptable performance on data from clinical routine), these works highlight the potential of deep learning models to use scar segmentation data for VA risk stratification. CT imaging has also proven to be relevant for myocardial scar imaging, in the form of visualizing wall thinning, which is known to have similar electrophysiological properties as the scar region observed in LGE CMR images [73]. Thus, for CT imaging, a major objective would be to first quantify LV wall thinning using segmentation techniques, which would alleviate the extra task of segmenting scar within the myocardium, which is still challenging in CMR images. Moreover, compared to CMR images, CT images have better contrast and resolution, and image acquisition is better standardized across imaging centres and scanner manufacturers. These conditions make the use of deep learning models for automatic segmentation attractive. Once the LV myocardium is segmented, the 3-D structure can be further simplified by calculating myocardial thickness locally and projecting these values onto a 2-D Bull’s eye representation of the whole LV. The flattening helps to reduce the dimensionality of the data and therefore the computations involved during learning. It also limits the effect of (zero) values outside the myocardium if the 2-D or 3-D data are considered as 2-D or 3-D images.

122

6

B. Ly et al.

The Bull’s eye flattening of the LV is inspired by the American Heart Association 17-segment model, which helps physicians to better understand the distribution of input values across the 3-D LV myocardium. These steps can be fully automated, meaning that outcome prediction studies can be performed on large datasets. Datasets of CT images are also easier to construct, due to the wider availability and inclusiveness of this imaging modality (i.e. it can be performed on patients with metal implants) compared to CMR imaging, which enhances the feasibility of building large prospective databases in the near future. Through their optimization, deep learning models can learn the relationship between the extent, position, and heterogeneity of the wall thinning region and the patient’s risk of arrhythmia. These models would have to learn to filter between pathological and physiological wall thinning, as observed at the base and apex of the LV, and thickness heterogeneities, which could be caused by the papillary muscles and trabeculations. This has been recently shown to outperform predictions based on the LV ejection fraction [74]. Explainability With Deep Learning Models With “standard” machine learning models, input features generally involve previous design by the user and their relative importance can be studied to interpret the prediction. In contrast, deep learning models stand more as “black boxes” that transform the input data into a prediction, which limits human understanding of the model decision (see also . Chap. 8, Sect. 8.3, and Chap. 9, Sect. 9.7). While their performance might be higher, the lack of explainability can clearly limit the trust from both physicians and patients, therefore making the integration of deep learning models in clinical practice harder to justify. To achieve some transparency with deep learning classification methods, visual explanations of the prediction could be estimated, in the form of an attention map to answer the question why did the model predict what it predicted?. To illustrate this, . Fig. 6.5 shows the attention map calculated with the GradCAM++ technique [75] (see Technical Note, below) from a positive VA prediction [74]. A trained classification model was used to classify the 2-D LV thickness map input, which provided two scores for VA+ and VA- (i.e. presence or absence of VA). Following the GradCAM++ method, the positive score was backpropagated to the “last”, i.e. the deepest, convolutional layer to generate the classification attention map. We can observe that the map highlights the thinning region in the input, which allows clearer understanding to the model’s prediction and further confirms the initial hypothesis that linked myocardial thinning with risk of VA. Technical Note The class activation mapping (CAM) method follows the fundamental assumption  c  k wk Aij , where Y c is the classification score of class c, wkc is the that Y c = k

i

j

weight for each specific feature map, Ak is the feature map of i × j resolution from

123

Outcome Prediction

6

. Fig. 6.5 The Ventricular Arrhythmia (VA) prediction output and corresponding attention map [74]. The classification model classified the input thickness map as VA+ (with the score of 0.97). From the VA+ score, the GradCAM++ method [75] used gradient back propagation to generate the attention map, which highlighted the regions most influential to the model prediction

the last convolutional layer of k filters and i and j stand for the row/column indices of each pixel. In other words, the classification score of class c can be calculated as a linear multiplication of the global sum of the last convolutional feature maps Akij and the unknown weights wkc for each feature map k. The class-specific attention map,  for the spatial location (i, j), can then be calculated using Lijc = wkc .Akij . k

The weights wkc can be calculated directly by applying global average pooling (GAP,

see . Chap. 4, Sect. 4.5) on the feature maps Ak , although it is imposed that the activation output (softmax or sigmoid function) is applied directly after the GAP layer, as suggested by the original CAM method [76]. However, this method required changes to the model’s architecture, which in turn required retraining. To work around this limitation, gradient back propagation methods can also be used to solve for wkc [75, 77]. The back propagation method does not require architecture modification or model re-training and is directly applicable to the pretrained network. The formulation based on the positive gradient, as proposed by [75] in the GradCAM++ method showed higher attention accuracy compared to the previous GradCAM method [77].

124

6.8

6

B. Ly et al.

Closing Remarks

This section has provided an overview of some of the key issues in outcome prediction, as well as reviews of the state-of-the-art in three exemplar areas. We emphasise that research into outcome prediction is not limited to these areas. Indeed, some of the most high profile work has come in other applications. Of particular note, [24] demonstrated how survival prediction could be performed in pulmonary hypertension patients using only motion estimated from cine CMR data. Such techniques, as well as those reviewed in this section, if streamlined and translated into clinical practice, could have a major impact on risk stratification and patient management in a wide range of applications. Next, after some self-assessment exercises, we proceed to a practical tutorial on outcome prediction, in which you will have the chance to develop machine learning models for predicting the outcome of subjects based on their cardiac shape.

6.9

Exercises

Exercise 1 What does a Kaplan-Meier curve illustrate? How could you use Kaplan-Meier curves to evaluate the effect of a treatment or intervention? Could this approach be applied to evaluate the prediction made by an AI model for outcome prediction?

Exercise 2 What types of data sources can typically be exploited in outcome prediction? How does your answer change when considering traditional and AI-based approaches?

Exercise 3 Explain the potential advantages/disadvantages of supervised and unsupervised analysis of data for outcome prediction.

Exercise 4 As well as the three exemplar applications presented in this book, what other applications have been studied in terms of the use of AI for outcome prediction in cardiology? You may wish to perform a brief literature review to help you answer.

Outcome Prediction

6.10

125

6

Tutorial—Outcome Prediction

Tutorial 5 As for the other notebooks, the contents of this notebook are accessible as Electronic Supplementary Material. Overview In this hands-on tutorial, we aim at predicting the outcome of subjects based on their cardiac shape (here, mimicking the 2-D LV myocardial contour extracted from 4-chamber echocardiographic views). We designed synthetic data (both 2-D shapes and outcome labels) specifically tailored for the purpose of this tutorial. You will focus on two strategies. First, (supervised) regression with the Partial Least-Squares method, which also performs dimensionality reduction: we will exploit this for interpretation purposes. Then, an unsupervised approach that chains two standard algorithms for dimensionality reduction and clustering (as shown in the figure below), which you will compare to the supervised approach.

Objectives • Conduct a simple outcome prediction problem using high-dimensional data as input, with the help of the sckit-learn tools. • Understand the differences between supervised and unsupervised ways of handling this problem. Computing Requirements As for the other hands-on tutorials, this notebook starts with a brief “System setting” section, which imports the necessary packages, installs the potentially missing ones, and imports our own modules.

6.11

Opinion

In the wider health domain deep learning has achieved successes in forecasting survival from high dimensional inputs such as cancer genomic profiles and gene expression data [78, 79] and in formulating personalized treatment recommendations [80]. Integrative approaches to risk classification have used unsupervised clus-

126

6

B. Ly et al.

tering of broad clinical variables to identify heart failure patients with distinct risk profiles [81, 82] while supervised machine learning algorithms can diagnose, risk stratify and predict adverse events from health record and registry data [83−85]. However, in an era of machine learning and AI, it is increasingly desirable that we extract quantitative biomarkers from medical images that inform on disease detection, characterization, monitoring and assessment of response to treatment. Quantitation has the potential to provide objective decision support tools in the management pathway of patients. Despite this, the quantitative potential of imaging remains under-exploited because of variability of the measurement, lack of harmonized systems for data acquisition and analysis, and crucially, a paucity of evidence on how such quantitation potentially affects clinical decision making and patient outcome. The benefit of machine learning in primary or secondary care treatment will not have an impact until a consensus is reached on how algorithmic approaches shape guideline-driven management in specific conditions and settings. Common pitfalls that can undermine machine learning-based applications include issues of transparency, reproducibility, ethics, and effectiveness [86]; and there is a pressing need for strategies to address the risk of bias when reporting performance [87]. A key challenge remains access to high quality data at scale that reflects temporal disease dynamics, heterogeneity across diverse populations, and response to interventions. Trusted research environments (TREs) facilitate accredited large scale access to health data held to common data standards which are enabling a “National Grid” of federated learning resources for researchers [88]. Such initiatives are already showing agility in providing a population-wide resource to support research on COVID-19 and cardiovascular disease [89], heralding a future where national or trans-national person-level data are discoverable and accessible to researchers through a single gateway providing a transformative substrate for outcome analysis. The nature of what we consider health data is also being reframed enabling inferences on cardiovascular disease to be made from diverse sources such as facial imaging [90], social media activity [91] and smart wearable devices [92]. There is currently a lack of device standardization and validity testing but such approaches could offer minimally intrusive approaches for continuous monitoring of population-level trends and individual-level events. This also invites us to re-evaluate the choice of outcomes we use for risk stratification and study endpoints. Relatively few studies have comprehensively examined how lifestyle interventions may improve life expectancy free from the major diseases such as diabetes, cardiovascular disease, and cancer [93]. While a focus of current machine learning research is on disease classification or mortality prediction a key contribution to real world practice may be predicting how interventions improve “health-span” to avoid or delay the onset of multimorbidity. Machine learning itself also offers an alternative to the challenge of personalization in the context of interventional trials. More flexible data-driven approaches to classic randomized control trials may learn the relationships between the actions, context, and outcomes allowing an estimation of causal effects from the probability of receiving a treatment conditional on patient characteristics [94]. However, the

Outcome Prediction

127

6

goal of digitally-enabled “personalized” medical care faces serious challenges, many of which cannot be addressed through algorithmic complexity alone [95]. To learn a causal effect, we need to estimate not just the most likely outcome in a classical prediction task, but what would have happened if things had been different—a counterfactual prediction [96]. Endeavours in causal inference and causal discovery are so far largely unexplored—especially for medical imaging data. In this context, they could lead to the discovery of new applications for personalized counterfactual predictions such as what would cardiovascular function have looked like if the patient had not been exposed to a specific risk factor [97]? While conventional machine learning approaches identify risk factors associated with a future endpoint, reframing this as a counterfactual inference task improves performance where there are multiple possible causes for an outcome [98]. How might these advances in AI reshape the delivery of healthcare? Firstly, this could disrupt the conventional linear pathway of self-referral to primary care, specialist referral, and investigations eventually leading to a therapeutic intervention. Care could be more pro-active and anticipatory, integrating data from multiple sources in the community to guide lifestyle interventions and primary prevention strategies. While traditional investigations are performed at specialist centres, AI could democratize this workflow by providing expert-level diagnostics at the point of care by physicians or even through direct-to-consumer technology. The integration of diverse data sources with innovative risk modelling could realise the ‘Digital Twin’ ambition of an individual-level casual framework for precision cardiology [99]. Finally, conventional diagnostic labels could become an irrelevance as we better understand the high-dimensional space that characterizes dynamic disease processes, their associated risks and effect of time-dependent interventions. This foresees healthcare providers trading discrete diagnostic classifications for improved patient-valued outcomes. Acknowledgements This work was supported by the French Government, through

the National Research Agency (ANR) Investments in the Future with 3IA Côte d’Azur (ANR-19-P3IA-0002) and IHU Liryc (ANR-10-IAHU-04), and through Université Côte d’Azur STIC Doctoral School. ND was supported by the French ANR (LABEX PRIMES of Univ. Lyon [ANR11-LABX-0063] within the program “Investissements d’Avenir” [ANR-11-IDEX0007], and the JCJC project “MIC-MAC” [ANR-19-CE45-0005]).

References 1. Pate A, Emsley R, Ashcroft DM, Brown B, van Staa T. The uncertainty with using risk prediction models for individual decision making: an exemplar cohort study examining the prediction of cardiovascular disease in English primary care. BMC Med. 2019;17(134). 2. Grundy SM, Stone NJ, Bailey AL, Beam C, Birtcher KK, Blumenthal RS, Braun LT, de Ferranti S, Faiella-Tommasino J, Forman DE, Goldberg R, Heidenreich PA, Hlatky MA, Jones DW, Lloyd-Jones D, Lopez-Pajares N, Ndumele CE, Orringer CE, Peralta CA, Saseen JJ, Smith SC, Sperling L, Virani SS, Yeboah J. 2018 AHA/ACC/AACVPR/AAPA/ABC/ACPM/ADA/AGS/APhA/ASPC/NLA/PCNA guideline on the management of blood cholesterol: exec-

128

3.

4.

5.

6

6. 7.

8. 9.

10. 11.

12. 13. 14. 15.

16.

17.

18.

B. Ly et al.

utive summary: a report of the American College of Cardiology/American Heart Association task force on clinical practice guidelines. J Am Coll Cardiol. 2019;73(24):3168−209. Blaha MJ, Mortensen MB, Kianoush S, Tota-Maharaj R, Cainzos-Achirica M. Coronary artery calcium scoring: is it time for a change in methodology? JACC: Cardiovascul Imag. 2017;10(8):923− 937. Priori SG, Blomström-Lundqvist C, Mazzanti A, Blom N, Borggrefe M, Camm J, Elliott PM, Fitzsimons D, Hatala R, Hindricks G, Kirchhof P, Kjeldsen K, Kuck K-H, Hernandez-Madrid A, Nikolaou N, Norekvål TM, Spaulding C, Van Veldhuisen DJ, ESD Group. 2015 ESC guidelines for the management of patients with ventricular arrhythmias and the prevention of sudden cardiac death: The Task Force for the Management of Patients with Ventricular Arrhythmias and the Prevention of Sudden Cardiac Death of the European Society of Cardiology (ESC) Endorsed by: Association for European Paediatric and Congenital Cardiology (AEPC). European Heart J. 2015;36(41):2793−67. Stecker EC, Vickers C, Waltz J, Socoteanu C, John BT, Mariani R, McAnulty JH, Gunson K, Jui J, Chugh SS. Population-based analysis of sudden cardiac death with and without left ventricular systolic dysfunction: Two-year findings from the oregon sudden unexpected death study. J Am Coll Cardiol. 2006;47(6):1161−6. van der Bijl P, Delgado V, Bax JJ. Imaging for sudden cardiac death risk stratification: current perspective and future directions. Progr Cardiovascul Diseas. 2019;62(3):205−211. Johnson KW, Shameer K, Glicksberg BS, Readhead B, Sengupta PP, Björkegren JL, Kovacic JC, Dudley JT. Enabling precision cardiology through multiscale biology and systems medicine. JACC: Basic Transl Sci. 2017;2(3):311−27. Cikes M, Solomon SD. Beyond ejection fraction: an integrative approach for assessment of cardiac structure and function in heart failure. Eur Heart J. 2015;37(21):1642−50. deSouza NM, Achten E, Alberich-Bayarri A, Bamberg F, Boellaard R, Clément O, Fournier L, Gallagher F, Golay X, Heussel CP, Jackson EF, Manniesing R, Mayerhofer ME, Neri E, O’Connor J, Oguz KK, Persson A, Smits M, van Beek EJR, Zech CJ, European Society of Radiology. Validated imaging biomarkers as decision-making tools in clinical trials and routine practice: current status and recommendations from the EIBALL* subcommittee of the European Society of Radiology (ESR). Insights Imag. 2019;10(87). Altman DG, Royston P. The cost of dichotomising continuous variables. BMJ. 2006;332(7549):1080. Adamson PD, Newby DE, Hill CL, Coles A, Douglas PS, Fordyce CB. Comparison of international guidelines for assessment of suspected stable angina: insights from the PROMISE and SCOTHEART. JACC: Cardiovascul Imag. 2018;11(9):1301−10. Tada H, Fujino N, Nomura A, Nakanishi C, Hayashi K, Takamura M, Aki Kawashiri M. Personalized medicine for cardiovascular diseases. J Hum Genet. 2020. Mesko B, Görög M. A short guide for medical professionals in the era of artificial intelligence. NPJ Digital Med. 2020;3:09. Kelly C, Karthikesalingam A, Suleyman M, Corrado G, King D. Key challenges for delivering clinical impact with artificial intelligence. BMC Med. 2019;17. Krittanawong C, Virk HUH, Bangalore S, Wang Z, Johnson KW, Pinotti R, Zhang H, Kaplin S, Narasimhan B, Kitai T, Baber U, Halperin JL, Tang WHW. Machine learning prediction in cardiovascular diseases: a meta-analysis. Sci Rep. 2020;10. Nagendran M, Chen Y, Lovejoy C, Gordon A, Komorowski M, Harvey H, Topol E, Ioannidis J, Collins G, Maruthappu M. Artificial intelligence versus clinicians: systematic review of design, reporting standards, and claims of deep learning studies. BMJ. 2020;368: m689. Liu X, Faes L, Kale AU, Wagner SK, Fu DJ, Bruynseels A, Mahendiran T, Moraes G, Shamdas M, Kern C, Ledsam JR, Schmid MK, Balaskas K, Topol EJ, Bachmann LM, Keane PA, Denniston AK. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. Lancet Digit Health. 2019;1:e271-97. Littlejohns TJ, Holliday J, Gibson LM, Garratt S, Oesingmann N, Alfaro-Almagro F, Bell JD, Boultwood C, Collins R, Conroy MC, Crabtree N, Doherty N, Frangi AF, Harvey NC, Leeson P, Miller KL, Neubauer S, Petersen SE, Sellors J, Sheard S, Smith SM, Sudlow CLM, Matthews

Outcome Prediction

19. 20. 21.

22. 23. 24.

25. 26. 27. 28. 29.

30. 31. 32. 33. 34.

35.

36.

37.

38.

39.

129

6

PM, Allen NE. The UK Biobank imaging enhancement of 100,000 participants: rationale, data collection, management and future directions. Nat Commun. 2020;11. Ahrens WA. The German National Cohort: aims, study design and current status. Europ J Public Health. 2019;29. Sebire NJ, Cake C, Morris AD. HDR UK supporting mobilising computable biomedical knowledge in the UK. BMJ Health Care Inform. 2020;27(2). Rieke N, Hancox J, Li W, Milletarí F, Roth HR, Albarqouni S, Bakas S, Galtier MN, Landman BA, Maier-Hein K, Ourselin S, Sheller M, Summers RM, Trask A, Xu D, Baust M, Cardoso MJ. The future of digital health with federated learning. NPJ Digit Med. 2020;3:119. Harvey H, Glocker B. A standardised approach for preparing imaging data for machine learning tasks in radiology. Springer International Publishing; 2019. 61−72. O’Regan D. Putting machine learning into motion: applications in cardiovascular imaging. Clin Radiol. 2020;75(1):33−7. Bello GA, Dawes TJW, Duan J, Biffi C, de Marvao A, Howard LSGE, Gibbs JSR, Wilkins MR, Cook SA, Rueckert D, O’Regan DP. Deep-learning cardiac motion analysis for human survival prediction. Nat Mach Intell. 2019;1:95−104. Senn S. Mastering variation: variance components and personalised medicine. Stat Med. 2016;35(7):966−77. British Heart Foundation. Putting patients at the heart of artificial intelligence: all party parliamentary group on heart and circulatory diseases. 2019. Chen L. Overview of clinical prediction models. Ann Transl Med. 2020;8(4):71. Cui J. Overview of risk prediction models in cardiovascular disease research. Ann Epidemiol. 2009;19(10):711−7. Raghunath S, Ulloa Cerna AE, Jing L, VanMaanen DP, Stough J, Hartzel DN, Leader JB, Kirchner HL, Stumpe MC, Hafez A, Nemani A, Carbonati T, Johnson KW, Young K, Good CW, Pfeifer JM, Patel AA, Delisle BP, Alsaid A, Beer D, Haggerty CM, Fornwalt BK. Prediction of mortality from 12-lead electrocardiogram voltage data using a deep neural network. Nature Med. 2020;26(6):886− 91. Desai N, Giugliano R. Can we predict outcomes in atrial fibrillation? Clin Cardiol. 2012;35(S1):10− 4. Bland J, Altman D. Survival probabilities (the Kaplan-Meier method). BMJ. 1998;317(7172):1572. Cardillo G. KMplot (7 https://github.com/dnafinder/kmplot). GitHub, January 12, 2022. 7 https://github.com/dnafinder/kmplot. Goel M, Khanna P, Kishore J. Understanding survival analysis: Kaplan-Meier estimate. Int J Ayurveda Res. 2010;1(4):274−8. Jia X, Baig M, Mirza F, Hosseini H. A cox-based risk prediction model for early detection of cardiovascular disease: identification of key risk factors for the development of a 10-year CVD risk prediction. Adv Prev Med. 2019;8392348. Staerk L, Preis S, Lin H, Casas J, Lunetta K, Weng L, Anderson C, Ellinor P, Lubitz S, Benjamin E, Trinquart L. Novel risk modeling approach of atrial fibrillation with restricted mean survival times: application in the Framingham heart study community-based cohort. Circ Cardiovasc Qual Outcomes. 2020;13(4): e005918. Yeung-Lai-Wah J, Qi A, Uzun O, Humphries K, Kerr C. Long-term survival following radiofrequency catheter ablation of atrioventricular junction for atrial fibrillation: clinical and ablation determinants of mortality. J Interv Card Electrophysiol. 2002;6(1):17−23. Kornej J, Hindricks G, Shoemaker M, Husser D, Arya A, Sommer P, Rolf S, Saavedra P, Kanagasundram A, Patrick Whalen S, Montgomery J, Ellis C, Darbar D, Bollmann A. The APPLE score: a novel and simple score for the prediction of rhythm outcomes after catheter ablation of atrial fibrillation. Clin Res Cardiol. 2015;104(10):871−6. Potpara T, Mujovic N, Sivasambu B, Shantsila A, Marinkovic M, Calkins H, Spragg D, Lip G. Validation of the MB-LATER score for prediction of late recurrence after catheter-ablation of atrial fibrillation. Int J Cardiol. 2019;276:130−5. Lankveld T, Zeemering S, Scherr D, Kuklik P, Hoffmann B, Willems S, Pieske B, Haïssaguerre M, Jaïs P, Crijns H, Schotten U. Atrial fibrillation complexity parameters derived from surface

130

40.

41.

42.

43. 44.

6

45.

46.

47.

48.

49.

50.

51.

52.

53.

54.

B. Ly et al.

ECGs predict procedural outcome and long-term follow-up of stepwise catheter ablation for atrial fibrillation. Circ Arrhythm Electrophysiol. 2016;9(2): e003354. Chelu M, King J, Kholmovski E, Ma J, Gal P, Marashly Q, AlJuaid M, Kaur G, Silver M, Johnson K, Suksaranjit P, Wilson B, Han F, Elvan A, Marrouche N. Atrial fibrosis by late Gadolinium enhancement magnetic resonance imaging and catheter ablation of atrial fibrillation: 5-year followup data. J Am Heart Assoc. 2018;7(23): e006313. Jia S, Nivet H, Harrison J, Pennec X, Camaioni C, Jaïs P, Cochet H, Sermesant M. Left atrial shape is independent predictor of arrhythmia recurrence after catheter ablation for atrial fibrillation: a shape statistics study. Heart Rhythm O2. 2021;2(6):622−632. Christodoulou E, Ma J, Collins G, Steyerberg E, Verbakel J, Van Calster B. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J Clin Epidemiol. 2019;110:12−22. van der Meer P, Gaggin HK, Dec GW. ACC/AHA Versus ESC Guidelines on Heart Failure: JACC Guideline Comparison. J Am Coll Cardiol. 2019;73(21):2756−68. Prinzen FW, Vernooy K, Auricchio A. Cardiac resynchronization therapy: State-of-the-art of current applications, guidelines, ongoing trials, and areas of controversy. Circulation. 2013;128(22):2407−18. Bazoukis G, Stavrakis S, Zhou J, Bollepalli SC, Tse G, Zhang Q, Singh JP, Armoundas AA. Machine learning versus conventional clinical methods in guiding management of heart failure patients-a systematic review. Heart Fail Rev. 2021;26(1):23−34. Yu CM, Bleeker GB, Fung JWH, Schalij MJ, Zhang Q, Van Der Wall EE, Chan YS, Kong SL, Bax JJ. Left ventricular reverse remodeling but not clinical improvement predicts long-term survival after cardiac resynchronization therapy. Circulation. 2005;112(11):1580−6. Peressutti D, Sinclair M, Bai W, Jackson T, Ruijsink J, Nordsletten D, Asner L, Hadjicharalambous M, Rinaldi CA, Rueckert D, King AP. A framework for combining a motion atlas with non-motion information to learn clinically useful biomarkers: application to cardiac resynchronisation therapy response prediction. Med Image Anal. 2017;35:669−84. Puyol-Antón E, Chen C, Clough JR, Ruijsink B, Sidhu BS, Gould J, Porter B, Elliott M, Mehta V, Rueckert D, Rinaldi CA, King AP. Interpretable deep models for cardiac resynchronisation therapy response prediction. Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics). 2020;12261(LNCS(Dl)):284−93. Tokodi M, Schwertner WR, Kovács A, Tosér Z, Staub L, Sárkány A, Lakatos BK, Behon A, Boros AM, Perge P, Kutyifa V, Széplaki G, Gellér L, Merkely B, Kosztin A. Machine learning-based mortality prediction of patients undergoing cardiac resynchronization therapy: The SEMMELWEISCRT score. Eur Heart J. 2020;41(18):1747−56. MM. Kalscheur, RT. Kipp, MC. Tattersall, C Mei, KA. Buhr, DL. Demets, ME. Field, LL. Eckhardt, and CD. Page. Machine Learning Algorithm Predicts Cardiac Resynchronization Therapy Outcomes: Lessons from the COMPANION Trial. Circulation: Arrhythmia and Electrophysiology. 11(1):1−11, 2018. Peressutti D, Bai W, Jackson T, Sohal M, Rinaldi A, Rueckert D, King A. Prospective identification of crt super responders using a motion atlas and random projection ensemble learning. In Navab N, Hornegger J, Wells WM, Frangi AF. editors. Medical image computing and computer-assisted intervention - MICCAI,. Springer International Publishing. 2015;2015:493−500. Dawood T, Chen C, Andlauer R, Sidhu BS, Ruijsink B, Gould J, Porter B, Elliott M, Mehta V, Rinaldi CA, Puyol-Antón E, Razavi R, King AP. Uncertainty-aware training for cardiac resynchronisation therapy response prediction. In: Puyol Antón E, Pop M, Martín-Isla C, Sermesant M, Suinesiaputra A, Camara O, Lekadir K, Young A. editors. Statistical atlases and computational models of the heart. Multi-disease, multi-view, and multi-center right ventricular segmentation in cardiac MRI challenge. Cham: Springer International Publishing, 2022. p. 189−98. Galli E, Le Rolle V, Smiseth OA, Duchenne J, Aalen JM, Larsen CK, Sade EA, Hubert A, Anilkumar S, Penicka M, Linde C, Leclercq C, Hernandez A, Voigt JU, Donal E. Importance of systematic right ventricular assessment in cardiac resynchronization therapy candidates: a machine learning approach. J Am Soc Echocardiogr. 2021;34(5):494−502. Cikes M, Sanchez-Martinez S, Claggett B, Duchateau N, Piella G, Butakoff C, Pouleur AC, Knappe D, Biering-Sørensen T, Kutyifa V, Moss A, Stein K, Solomon SD, Bijnens B. Machine learning-

Outcome Prediction

55.

56. 57.

58.

59. 60.

61.

62.

63.

64.

65.

66.

67.

68.

69.

70.

131

6

based phenogrouping in heart failure to identify responders to cardiac resynchronization therapy. Eur J Heart Fail. 2019;21(1):74−85. Parsai C, Bijnens B, Sutherland G, Baltabaeva A, Claus P, Marciniak M, Paul V, Scheffer M, Donal E, Derumeaux G, Anderson L. Toward understanding response to cardiac resynchronization therapy: left ventricular dyssynchrony is only one of multiple mechanisms. Eur Heart J. 2009;30(8): 940−9. Zhuang J, Wang J, Hoi SC, Lan X. Unsupervised multiple kernel learning. J Mach Learn Res. 2011;20:129−45. Lamori JC, Mody SH, Patel AA, Schein JR, Gross HJ, Dacosta Dibonaventura M, Nelson WW. Burden of comorbidities among patients with atrial fibrillation. Therapeutic Adv Cardiovascul Disease. 2013;7(2):53−62. Vlachos K, Letsas KP, Korantzopoulos P, Liu T, Georgopoulos S, Bakalakos A, Karamichalakis N, Xydonas S, Efremidis M, Sideris A. Prediction of atrial fibrillation development and progression: current perspectives. World J Cardiol. 2016;8(3):267. Traykov VB, Pap R, Saghy L. Frequency domain mapping of atrial fibrillation - methodology, experimental data and clinical implications. Curr Cardiol Rev. 2012;8(3):231−8. Lyon A, Mincholé A, Martínez J, Laguna P, Rodriguez B. Computational techniques for ECG analysis and interpretation in light of their contribution to medical advances. J R Soc Interface. 2018;15:20170821. Xiong Z, Nash MP, Cheng E, Fedorov VV, Stiles MK, Zhao J. ECG signal classification for the detection of cardiac arrhythmias using a convolutional recurrent neural network. Physiol Measur. 2019;39(9). Melzi P, Tolosana R, Cecconi A, Sanz-Garcia A, Ortega GJ, Jimenez-Borreguero LJ, VeraRodriguez R. Analyzing artificial intelligence systems for the prediction of atrial fibrillation from sinus-rhythm ECGs including demographics and feature visualization. Sci Rep. 2021;11(1):1−10. Santala OE, Halonen J, Martikainen S, Jäntti H, Rissanen TT, Tarvainen MP, Laitinen TP, Laitinen TM, Väliaho ES, Hartikainen JE, Martikainen TJ, Lipponen JA. Automatic mobile health arrhythmia monitoring for the detection of atrial fibrillation: prospective feasibility, accuracy, and user experience study. JMIR Mhealth Uhealth. 2021;9(10):1−12. Duncker D, Ding WY, Etheridge S, Noseworthy PA, Veltmann C, Yao X, Jared Bunch T, Gupta D. Smart wearables for cardiac monitoring−real-world use beyond atrial fibrillation. Sensors. 2021;21(7):1−25. Al-Khatib SM, Stevenson WG, Ackerman MJ, Bryant WJ, Callans DJ, Curtis AB, Deal BJ, Dickfeld T, Field ME, Fonarow GC, Gillis AM, Granger CB, Hammill SC, Hlatky MA, Joglar JA, Kay GN, Matlock DD, Myerburg RJ, Page RL. 2017 AHA/ACC/HRS guideline for management of patients with ventricular arrhythmias and the prevention of sudden cardiac death. J Am Coll Cardiol. 2018;72(14):e91−e220. Saxon LA, Hayes DL, Gilliam FR, Heidenreich PA, Day J, Seth M, Meyer TE, Jones PW, Boehmer JP. Long-term outcome after ICD and CRT implantation and influence of remote device follow-up: the ALTITUDE survival study. Circulation. 2010;122(23):2359−67. Wu KC, Wongvibulsin S, Tao S, Ashikaga H, Stillabower M, Dickfeld TM, Marine JE, Weiss RG, Tomaselli GF, Zeger SL. Baseline and dynamic risk predictors of appropriate implantable cardioverter defibrillator therapy. J Am Heart Assoc. 2020;9(20). Wang Q, Li B, Chen K, Yu F, Su H, Hu K, Liu Z, Wu G, Yan J, Su G. Machine learning-based risk prediction of malignant arrhythmia in hospitalized patients with heart failure. ESC Heart Fail. 2021. Karim R, Bhagirath P, Claus P, James Housden R, Chen Z, Karimaghaloo Z, Sohn H-M, Lara Rodríguez L, Vera S, Albà X, Hennemuth A, Peitgen H-O, Arbel T, Gonzàlez Ballester MA, Frangi AF, Götte M, Razavi R, Schaeffter T, Rhode K. Evaluation of state-of-the-art segmentation algorithms for left ventricle infarct from late gadolinium enhancement mr images. Med Image Anal. 2016;30:95−107. Zhuang X, Xu J, Luo X, Chen C, Ouyang C, Rueckert D, Campello VM, Lekadir K, Vesal S, RaviKumar N, Liu Y, Luo G, Chen J, Li H, Ly B, Sermesant M, Roth H, Zhu W, Wang J, Ding X, Wang X, Yang S, Li L. Cardiac segmentation on late gadolinium enhancement mri: a benchmark study from multi-sequence cardiac mr segmentation challenge. 2021.

132

6

B. Ly et al.

71. Oduneye SO, Pop M, Biswas L, Ghate S, Flor R, Ramanan V, Barry J, Celik H, Crystal E, Wright GA. Postinfarction ventricular tachycardia substrate characterization: a comparison between late enhancement magnetic resonance imaging and voltage mapping using an MR-guided electrophysiology system. IEEE Trans Biomed Eng. 2013;60(9):2442−9. 72. Klem I, Weinsaft JW, Bahnson TD, Hegland D, Kim HW, Hayes B, Parker MA, Judd RM, Kim RJ, Carolina N. Imaging in heart rhythm disorders assessment of myocardial scarring improves risk stratification in patients evaluated for cardiac defibrillator implantation. JAC. 2012;60:408−20. 73. Komatsu Y, Cochet H, Jadidi A, Sacher F, Shah A, Derval N, Scherr D, Pascale P, Roten L, Denis A, Ramoul K, Miyazaki S, Daly M, Riffaud M, Sermesant M, Relan J, Ayache N, Kim S, Montaudon S, Laurent F, Hocini M, Haïssaguerre M, Jaïs P. Regional myocardial wall thinning at multidetector computed tomography correlates to arrhythmogenic substrate in postinfarction ventricular tachycardia: assessment of structural and electrical substrate. Circul: Arrhyth Electrophysiol. 2013;6(2):342−50. 74. Ly B, Finsterbach S, Nuñez-Garcia M, Cochet H, Sermesant M. Scar-related ventricular arrhythmia prediction from imaging using explainable deep learning. In: FIMH 2021 - international conference on functional imaging and modeling of the heart, ser. Lecture notes in computer science, vol. 12738. Stanford, United States: Springer International Publishing, Jun. 2021. p. 461−70. 75. Chattopadhay A, Sarkar A, Howlader P, Balasubramanian VN. Grad-CAM++: generalized gradient-based visual explanations for deep convolutional networks. In: Proceedings - 2018 IEEE winter conference on applications of computer vision, WACV 2018; 2018. vol. 2018, p. 839−47. 76. Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A. Learning deep features for discriminative localization. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), vol. 2004. IEEE; 2016, p. 2921−29. 77. Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-cam: why did you say that? visual explanations from deep networks via gradient-based localization. 2016;17:331−36. 7 http://arxiv.org/abs/1610.02391. 78. Yousefi S, Amrollahi F, Amgad M, Dong C, Lewis JE, Song C, Gutman DA, Halani SH, Velazquez Vega JE, Brat DJ, Cooper LAD. Predicting clinical outcomes from large scale cancer genomic profiles with deep survival models. Sci Rep. 2017;7(1):11−707. 79. Ching T, Zhu X, Garmire LX. Cox-nnet: An artificial neural network method for prognosis prediction of high-throughput omics data. PLOS Comput Biol. 2018;14(4):1−18. 80. Katzman JL, Shaham U, Cloninger A, Bates J, Jiang T, Kluger Y. DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network. BMC Med Res Methodol. 2018;18(24). 81. Ahmad T, Pencina MJ, Schulte PJ, O’Brien E, Whellan DJ, Piña IL, Kitzman DW, Lee KL, O’Connor CM, Felker GM. Clinical implications of chronic heart failure phenotypes defined by cluster analysis. J Am Coll Cardiol. 2014;64(17):1765−74. 82. Shah SJ, Katz DH, Selvaraj S, Burke MA, Yancy CW, Gheorghiade M, Bonow RO, Huang C-C, Deo RC. Phenomapping for novel classification of heart failure with preserved ejection fraction. Circulation. 2015;131(3):269−79. 83. Awan S, Sohel F, Sanfilippo F, Bennamoun M, Dwivedi G. Machine learning in heart failure: ready for prime time. Curr Opin Cardiol. 2018;33:190−5. 84. Tripoliti EE, Papadopoulos TG, Karanasiou GS, Naka KK, Fotiadis DI. Heart failure: diagnosis, severity estimation and prediction of adverse events through machine learning techniques. Comput Struct Biotechnol J. 2017;15:26−47. 85. Ambale-Venkatesh B, Yang X, Wu CO, Liu K, Hundley WG, McClelland R, Gomes AS, Folsom AR, Shea S, Guallar E, Bluemke DA, Lima JA. Cardiovascular event prediction by machine learning. Circ Res. 2017;121(9):1092−101. 86. Vollmer S, Mateen BA, Bohner G, Király FJ, Ghani R, Jonsson P, Cumbers S, Jonas A, McAllister KSL, Myles P, Grainger D, Birse M, Branson R, Moons KGM, Collins GS, Ioannidis JPA, Holmes C, Hemingway H. Machine learning and artificial intelligence research for patient benefit: 20 critical questions on transparency, replicability, ethics, and effectiveness. BMJ. 2020;368. 87. Liu X, Cruz Rivera S, Moher D, Calvert MJ, Denniston AK. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Nat Med. 2020;26:1364−74.

Outcome Prediction

133

6

88. Health Data Research UK, Trusted research environments and data management − past, present and future. 2021. 89. Wood A, Denholm R, Hollings S, Cooper J, Ip S, Walker V, Denaxas S, Akbari A, Banerjee A, Whiteley W, Lai A, Sterne J, Sudlow C. Linked electronic health records for research on a nationwide cohort of more than 54 million people in England: data resource. BMJ. 2021;373. 90. Lin S, Li Z, Fu B, Chen S, Li X, Wang Y, Wang X, Lv B, Xu B, Song X, Zhang Y-J, Cheng X, Huang W, Pu J, Zhang Q, Xia Y, Du B, Ji X, Zheng Z. Feasibility of using deep learning to detect coronary artery disease based on facial photo. Europ Heart J. 2020;41(46):4400−11. 91. Sinnenberg L, DiSilvestro CL, Mancheno C, Dailey K, Tufts C, Buttenheim AM, Barg F, Ungar L, Schwartz H, Brown D, Asch DA, Merchant RM. Twitter as a potential data source for cardiovascular disease research. JAMA Cardiol. 2016;1(9):1032−6. 92. Bayoumy K, Gaber M, Elshafeey A, Mhaimeed O, Dineen FA, Marvel EH, Martin SS, Muse ED, Turakhia MP, Tarakji KG, Elshazly MB. Smart wearable devices in cardiovascular care: where we are and how to move forward. Nat Rev Cardiol. 2021;1−19. 93. Li Y, Schoufour J, Wang DD, Dhana K, Pan A, Liu X, Song M, Liu G, Shin HJ, Sun Q, Al-Shaar L, Wang M, Rimm EB, Hertzmark E, Stampfer MJ, Willett WC, Franco OH, Hu FB. Healthy lifestyle and life expectancy free of cancer, cardiovascular disease, and type 2 diabetes: prospective cohort study. BMJ. 2020;368. 94. Kaptein M. Computational personalization: data science methods for personalized health. Tilburg University. 2018. 95. Wilkinson J, Arnold K, Murray E, van Smeden M, Carr K, Sippy R, de Kamps M, Beam A, Konigorski S, Lippert C, et al. Time to reality check the promises of machine learning-powered precision medicine. Lancet Digit Health. 2020;2(12):e677-80. 96. Hernán MA, Hsu J, Healy B. A second chance to get causal inference right: a classification of data science tasks. Chance. 2019;32(1):42−9. 97. Castro DC, Walker I, Glocker B. Causality matters in medical imaging. Nat Commun. 2020;11(3673). 98. Richens J, Lee C, Johri S. Improving the accuracy of medical diagnosis with causal machine learning. Nat Commun. 2020;11(3923). 99. Corral-Acero J, Margara F, Marciniak M, Rodero C, Loncaric F, Feng Y, Gilbert A, Fernandes JF, Bukhari HA, Wajdan A, Martinez MV, Santos MS, Shamohammdi M, Luo H, Westphal P, Leeson P, DiAchille P, Gurev V, Mayr M, Geris L, Pathmanathan P, Morrison T, Cornelussen R, Prinzen F, Delhaas T, Doltra A, Sitges M, Vigmond EJ, Zacur E, Grau V, Rodriguez B, Remme EW, Niederer S, Mortier P, McLeod K, Potse M, Pueyo E, Bueno-Orovio A, Lamata P. The ‘Digital Twin’ to enable the vision of precision cardiology. Eur Heart J. 2020;41(48):4556−64.

135

Quality Control Ilkay Oksuz, Alain Lalande and Esther Puyol-Antón Contents 7.1

Clinical Introduction – 136

7.2

Overview – 138

7.3

Motion Artefact Detection – 139

7.4

Poor Planning Detection and Automatic View Planning – 141

7.5

Missing Slice Detection – 142

7.6

Segmentation Failure Detection – 143

7.7

Closing Remarks – 149

7.8

Exercises – 149

7.9

Tutorial—Quality Control – 150

7.10

Opinion – 151 References – 154

Supplementary Information The online version contains supplementary material available at 7 https://doi.org/10.1007/978-3-031-05071-8_7. Authors’ contribution: • Introduction, Opinion: AL. • Main chapter: IO, EPA. • Tutorial: IO. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Duchateau and A. P. King (eds.), AI and Big Data in Cardiology, https://doi.org/10.1007/978-3-031-05071-8_7

7

136

I. Oksuz et al.

n Learning Objectives At the end of this chapter you should be able to: O7.A Explain the need for automated quality control techniques in clinical imaging and population studies O7.B Describe some machine learning techniques for quality control of echocardiography images O7.C Describe the types of artefacts/problems that can occur in CMR imaging and outline some machine learning approaches for identifying or controlling them O7.D Explain the different approaches that have been taken to achieve automated quality control of segmentations

7.1

7

Clinical Introduction

Cardiac imaging is a widely-used tool in current clinical practice to assess patients’ clinical status, ranging from the emergency department to the follow-up of chronic disease. Quality control contributes to the development and acceptance of cardiac imaging in clinical practice, both for single exams and for large population studies. Today, CMR stands as the technique of choice to evaluate the heart after injury. However, other techniques based on ultrasound (echocardiography), X-rays (in particular CT) or radioactivity (cardiac scintigraphy) have shown their usefulness depending on the disease, the emergency status, and the availability of the technique. The ongoing improvements of these imaging techniques (in terms of quality, but also considering multi-modality acquisitions such as the different types of sequences for CMR, for example) and the potentially dynamic nature of the acquisitions (as for echocardiography, cine CMR, dynamic CT or gated myocardial perfusion SPECT, among others) mean that most of these techniques produce a large amount of images. Typically, tomographic approaches such as CMR or CT provide several hundreds of images, but sometimes more than one thousand images. Regardless of which imaging technique is used, even now visual assessment of an imaging exam enables the production of many aspects of a clinical report. Nevertheless, an increasingly important part of the outcome of a cardiac imaging exam is based on objective parameters measured from the images. Among such parameters, we can highlight cavity volumes at end diastole and end systole, EF, but also the size of the diseased area with abnormal uptake of a contrast agent. Considering the amount of images in one patient exam and the level of accuracy required in these objective parameters, they are typically estimated using software with automatic or semi-automatic processing, these days more and more based upon deep learning (in particular, CNNs) for segmenting the areas of interest, such as the myocardium or the cardiac cavity. Reliable quantification of these parameters is crucial as the diagnosis obtained from the imaging exam is often defined according to the (ab)normal values these parameters take. Thus, noise, artefacts or technical pitfalls during the data acquisition are problematic. Indeed, their impact on the parameter quantification could hamper the production of the report and at worst,

Quality Control

137

7

potentially lead to misinterpretations. We can take as an example the estimation of RV EF from a series of CMR images in the short axis orientation. Variability among experts can be high due to the difficulties of segmenting the cardiac cavity at the basal level in end diastole and end systole, and can result in questionable conclusions. This issue is even more critical if the most basal slice is noisy, or at worst, missing. When the practitioner starts doubting the quality or the consistency of the data, the ideal solution would be to repeat the acquisition of the images. However, this decision must be taken immediately (before the end of the exam), and is sometimes not possible at all, for example when it requires injecting a contrast agent or for modalities involving ionizing radiation. Ideally, the downstream analysis of the images should be able to handle any acquired data, whatever their quality and their consistency, which means that quality control is essential to prevent any confusion in the final diagnosis. Quality control can be performed at three levels of the process: during the data acquisition, during the processing of the data (such as the segmentation of the regions of interest) and during the quantification of the clinical parameters (EF, cavity volume, percentage of diseased area, strain, etc.). Regarding data acquisition, quality issues can be due to the patient, to the imaging technique or to the exam planning. Among the limitations due to the patient, we can cite difficulties with the patient holding his/her breath or the presence of a metallic object that impedes some acquisitions. These issues are difficult to address and must be considered during the rest of the imaging protocol. Another issue specific to cardiovascular imaging is the synchronization with the ECG signal. Indeed some techniques are not real-time (such as CMR or CT), and triggering of the data acquisition with ECG is mandatory. This triggering is generally done through the detection of the R-wave. Problems with acquisition of the ECG signal, for example due to arrhythmia or extra systoles, can lead to noisy or difficult to interpret images. In CMR, incorrect R-wave detection can also adversely affect the evaluation of cardiac function because merging different parts of the cardiac cycle to create one image can lead to blurry cardiac borders, and the EF could be underestimated due to an improper diastolic or systolic volume evaluation [1]. Additional technical issues can appear depending on the technique, such as banding artefacts with SSFP-like sequences when studying cardiac function with CMR, beam-hardening artefacts for CT, or subdiaphragmatic activity for scintigraphy. For CMR, these issues can sometimes be solved by changing some settings during the exam, but the technician or the physician must react instantaneously to reach the correct setting or repeat the acquisition. These artefacts must be detected subsequently, because they can be confused with disease-related abnormalities. For example in CMR, the dark rim artefact that can appear during the imaging of the first-pass of a contrast agent could be confused with an ischemic area (which corresponds to a delay in the arrival of the contrast agent). Another example is the activity in subdiaphragmatic organs in cardiac scintigraphy, which can interfere with the evaluation of myocardial perfusion. When there is a doubt in the interpretation of a cardiac scintigraphy exam, sometimes no conclusion can be reached and additional exams are required (often CMR) to confirm the diagnosis. Other types of issues can come from the exam planning, such as wrong plane orientation, missing slices during a tomographic

138

7

I. Oksuz et al.

image acquisition (e.g. incomplete coverage of the LV or RV in short axis CMR), or missing parts of the heart in the image (e.g. part of the LV is out of the imaging window during an echocardiographic acquisition in a 4-chamber view [2]). To evaluate cardiac function from CMR, a complete coverage of the base of each ventricle is mandatory because this level represents an important part of the volume of the cavity. This issue is less crucial for the apical level of the heart due to lower volumes of this part of the cavity. In tomographic imaging, and in particular in CMR due to its rather high slice thickness (between 5 mm and 10 mm), an incorrect orientation of the plane can also be troublesome as it can lead to high partial volume effects, and therefore to an overestimation of some parameters (such as the cavity volume or the myocardium mass). In general, the acquired raw data require post-processing, which mainly consists of the segmentation of the relevant areas. As manual processing is tedious and time consuming, in many cases physicians examine the output of automated segmentation algorithms and correct it if necessary. Although current automatic methods (and in particular those based on CNNs) are more effective, necessitating less (or no) manual correction, there is still a regulatory requirement to review each segmentation. Currently, automatic diagnosis is not a priority area as diagnosis is performed by the specialized physician, using the imaging exam but also the clinical context. Errors in automated segmentation, as sometimes visible in the output of approaches based on CNNs, must be detected and corrected (or discarded), because even if they have no significant impact on the estimated parameters, they can harm clinician trust in the automated tools. Generally speaking, in clinical practice, some minor segmentation errors can be tolerated in the production of clinical reports. However, on the other hand, an error considered as acceptable in the segmentation could lead to misdiagnosis. In particular, the spatial resolution is a characteristic that should be considered. For example, the size of a pixel in CMR or in scintigraphy is large relative to the size of the organ. Typically, in cine CMR, the in-plane resolution is around 1.5 mm, and the normal myocardial thickness at end diastole is about 10 mm. Thus, a segmentation error of around one pixel in one direction could drastically impact the final result. This motivates the need for real-time quality control tools. A quality control process that can warn the clinician about questionable segmentations could highlight the possibility of poor quality leading to downstream problems, and hence should be a part of the automated processing of the data. In addition, quality control must also be done on the estimated clinical parameters, to prevent inconsistencies between the images and the quantified parameters. This is crucial for physicians to trust these technologies and accelerate their adoption into clinical practice.

7.2

Overview

In this chapter we review the technical state-of-the-art in quality control (QC) in cardiac image analysis. The review focuses mainly on QC of CMR imaging, since this is a more mature field in the research literature. QC of echocardiography imaging is also extremely valuable, especially with the recent availability of large-scale public

Quality Control

139

7

databases of annotated echocardiography images such as the EchoNet Dynamic dataset1 as well as AI models trained on such datasets [3]. Although authors such as [3] have assessed the robustness of their algorithms to variable image quality, the development of models to automatically assess echocardiography image quality has been less explored. Some notable works in this area include [4], who proposed a CNN-based quality score prediction model for 4-chamber apical echocardiography. Similarly, [5] used CNNs in an automated pipeline for echocardiography video analysis (view classification, segmentation and calculation of functional metrics) and investigated the maximum probability of view assignment as a metric of video quality. Finally, [6] also proposed a CNN model for assessment of image quality in 4-chamber apical echocardiography, including different quality scores for LV foreshortening, poor contrast and off-axis images. In the following sections we review the more extensive literature on automated QC in CMR imaging, focusing on a number of different sources of poor quality in the images and/or the subsequent automatic processing: motion artefacts, poor planning, missing slices and segmentation failure.

7.3

Motion Artefact Detection

Despite the well-known advantages of CMR, numerous categories of artefacts are frequently encountered [7]. These artefacts may be related to the scanner hardware or software functionalities, environmental factors or the human body itself. As an example, high magnetic field as well as particular sequences may exacerbate or even give rise to specific artefacts which the reader should be aware of to avoid misinterpreting the image. These artefacts can include Zipper artefacts, B0-related artefacts, aliasing, metal-induced artefacts and truncation artefacts [8]. In the chest and abdomen, one of the most common types of artefact is motionrelated artefacts, which can be categorized into three types: (1) mistriggering artefacts, (2) artefacts caused by physiological problems such as arrhythmia and (3) artefacts caused by breathing motion. In CMR imaging, the effects of cardiac cycle motion are generally dealt with by triggering and gating the acquisition using an electrocardiogram (ECG) signal. However, when the ECG signal is corrupted either due to arrhythmia or instrumental issues (i.e. mistriggering) the resulting images can exhibit low quality because of incorrect binning of k-space profiles. Similarly, breathing motion causes profiles to be combined which were acquired at different breathing states. Outside of these three classes, voluntary motion artefacts can also be encountered, which can be limited by clear instructions to patients or patient sedation. Examples of breathing, arrhythmia and mistriggering artefact images are given in . Fig. 7.1. CMR images are often acquired from patients who already have cardiac diseases, are more likely to have arrhythmias or have difficulties with breath-holding or remaining still during acquisition. Therefore, the images can contain a range of 1

The EchoNet Dynamic dataset contains >10, 000 annotated apical 4-chamber videos: 7 https:// echonet.github.io/dynamic/.

140

I. Oksuz et al.

. Fig. 7.1 Examples of a a good quality cine CMR image, b an image with breathing motion artefacts, c an image of a patient with arrhythmia and d an image with a mistriggering artefact. Red arrows indicate the artefact regions within the cavity

7

. Fig. 7.2 Overview of training a low-quality image detection model. Starting from a good-quality image dataset, first a set of low-quality images are generated following the acquisition principles of CMR. Then a machine learning model is trained with paired low and good quality data. The trained model can be utilized at test time to detect artefact images

image artefacts [9]. Misleading conclusions can be drawn when the original data are of poor quality. Traditionally, images are visually inspected by one or more experts, and those showing an insufficient level of quality are excluded from further analysis. However, such visual assessment is time-consuming and prone to variability due to inter-rater and intra-rater variability. Therefore, automatic detection of artefact images is of great value for enabling subsequent processing to be successful. Several methods have been proposed to detect motion artefacts automatically. One approach is to generate paired data (i.e. artefact and non-artefact) using synthetic artefact generation techniques and then train a machine learning model to detect the artefact images (as illustrated in . Fig. 7.2). An example of generating synthetic artefacts is provided in the hands-on tutorial at the end of this chapter. In a pioneering work, [10] investigated synthetic motion artefacts and used histogram, box, line and texture features to train a random forest algorithm to detect different artefact levels. However, their algorithm was tested only on artificially corrupted synthetic data and aimed only at detecting breathing artefacts. Recently, deep CNN architectures have become a common solution for the artefact detection task. [11]

Quality Control

141

7

proposed to use a k-space data augmentation strategy2 to increase the amount and diversity of low-quality data for training a 3-D CNN. Due to the temporal nature of the artefacts these architectures subsequently moved towards recurrent neural networks (RNNs, see . Chap. 4, Sect. 4.3). For example, [12] proposed to use a long short-term memory (LSTM) model on top of the features derived from 2-D CNNs to classify low quality images. One major shortcoming in the current state-of-the-art is the lack of techniques that can classify each artefact type separately. There is potential to correct artefacts after the scan if the artefact sub-type can be correctly classified. Most of the available algorithms output a binary label (artefact/good-quality) as output, which is used as a rejection filter at the beginning of down-stream tasks (e.g. segmentation, registration, volumetric analysis etc.) for large-scale image phenotyping. Soft predictions instead of binary filtering could be utilized as a proxy measure to increase the quality of automatic deep learning-based large cohort studies. In addition, endto-end architectures, which can detect the artefacts on the spot and analyze the influence on segmentation could improve the quality of population studies [13]. The influence of QC on such downstream analysis is further discussed later in this chapter.

7.4

Poor Planning Detection and Automatic View Planning

CMR planning is important to ensure high quality image data and to enable accurate quantification of cardiac function. Currently, CMR acquisition requires manual intervention for setting the acquisition planes. The radiographer selects the cardiac planes at which the imaging data will be acquired, which is a skilled but subjective operation. Automatically identifying these view planes would be of great significance for fully automating CMR acquisition and the first step of automating this process is to detect poorly planned images. The planning can also be posed as an automatic landmark-detection problem, i.e. when accurate anatomical landmarks are identified the planning can be optimized accordingly. One group of methods is focused on landmark localization for automatic view planning. Early works on landmark detection for view planning utilized the SVM method (see . Chap. 5, Sect. 5.3) for localization of landmarks [14]. [15] proposed to utilize probabilistic boosting trees for LV segmentation and align a mean shape (average model of all annotations) with the test data to get an estimate of the object shape. They deformed the pre-defined LV model to extract the landmark information for anatomical landmark detection. [16] proposed to use a 3-D CNN for detecting the landmark points (LV and RV apex, and the aortic, mitral, pulmonary, and tricuspid valves) in 4-D cardiac flow MR images. They first proposed to find a tight bounding box and then trained a 3-D CNN to detect the 6 landmark points 2

Data augmentation is a technique commonly used in deep learning to boost the amount of training data. It involves supplementing the training set by including transformed versions of the existing training data. For example, common augmentations of imaging data can involve translations, rotations, flipping and elastic deformations. Data augmentation has been shown to improve generalization performance.

142

7

I. Oksuz et al.

automatically for planning. They validated their method with a blinded test on cardiac radiologists. For a similar task, [17] proposed to use a reinforcement learning (RL, see . Chap. 2, Sect. 2.3) setup, in which a deep Q-learning neural network was trained to improve the performance of view planning. Their reward definition in the setup was based on plane prediction and moving towards the correct plane was rewarded. They terminated their algorithm when an oscillation was reached and outputted the predicted plane as the target plane. One result of inaccurate planning is an ‘off-axis’ orientation of the 4-chamber view, often recognized by the presence of the LV outflow tract (LVOT). This can lead to difficulties in assessment of atrial volumes and septal wall motion, either manually by experts or by automated image analysis algorithms. For large datasets such as the UK Biobank, manual labelling is tedious and automated analysis pipelines including automatic image quality assessment need to be developed. Oksuz et al. proposed [18] to use a shallow neural network architecture to classify the presence of the LVOT in 4-chamber images. The authors visualized the attention maps of the network to determine the focus of the proposed neural network architecture. The issue of detecting the view planes automatically remains a challenging task and evaluation of the final performance is the key to achieve high precision in view planning. With the availability of accurate loss functions thanks to subjective evaluation metrics, it is likely that current machine learning techniques (e.g. spatiotemporal neural networks, RL) can be trained better to fully automate the view planning process in a robust way. Spatio-temporal models can incorporate cardiac motion information into the view planning process and can overcome the challenges caused by cardiac motion. RL models have the capability to learn policies for view planning with the definition of appropriate reward functions. The successful clinical translation of these methods will enable semi- (or fully-)automatic view planning for image acquisition.

7.5

Missing Slice Detection

One of the most common problems in CMR databases is the incomplete coverage of the heart region. Several automatic detection methods have been proposed for detecting missing slices. The literature has mostly focused on missing apical and basal slice detection [19]. Missing slices adversely affect the accurate calculation of the LV volume and hence the derivation of cardiac metrics such as EF. Another study [20] used Generative Adversarial Networks (GANs, see . Chaps. 4, Sect. 4.3 and 5, Sect. 5.4) in a semi-supervised setting to improve the performance of missing slice detection. [21] proposed to use a decision forest approach (see . Chap. 5, Sect. 5.3) for heart coverage estimation, inter-slice motion detection and image contrast estimation in the cardiac region. The authors applied their methodology to 19,265 short-axis (SA) cine stacks from the UK Biobank dataset and showed that up to 14.2% of the analysed SA stacks had suboptimal coverage. They also reported that up to 16% of the stacks were affected by noticeable inter-slice motion. [22] proposed to use Fisher discriminative and dataset invariance 3-D CNNs to

Quality Control

143

7

detect missing slices independently of image acquisition parameters, such as imaging device, magnetic field strength and variations in protocol execution. To address the issue of missing slices several methods have been proposed for the imputation of the missing slices. Recently, Zhang et al. proposed a missing slice imputation GAN, to learn key features of cardiac SA slices across different positions, and use them as conditional variables to effectively infer missing slices in the query volumes [23]. The method was based on mapping the slices to latent vectors with position features through a regression net. The latent vector corresponding to the desired position is then projected onto the slice manifold conditional on slice intensity through a generator net. The results illustrate acceptable image quality in the synthesized apical and basal slices. The synthetic slices were utilized to generate EF calculations and indicated close performance to the ground truth images. The potential of GAN-type models in generating realistic synthetic images in computer vision can be instrumental in generating missing medical images. The anatomy and disease characteristics should be modelled carefully in machine learning frameworks to avoid misleading clinical conclusions.

7.6

Segmentation Failure Detection

Medical image segmentation is an active area of research and plays an important role in clinical practice, as we have already seen in . Chap. 4. It is a necessary step for tasks such as estimation of clinical parameters, disease diagnosis (see . Chap. 5) and treatment planning and guidance (see . Chap. 6). QC of image segmentation is essential since segmentation quality impacts the decisions that clinicians or other downstream algorithms can make about the patient and their disease management. In the case of an automated pipeline used as part of a clinical workflow, QC of image segmentation can be used to flag to the user that a poor segmentation was obtained and requires manual review. Alternatively, several candidate segmentations could be automatically generated using different hyperparameter settings and/or algorithms, and the segmentation with the best evaluation score could be selected. QC of image segmentation is also performed during development of the algorithms, but in this case it serves a very different purpose than the evaluation of segmentation quality after algorithm development (i.e. during deployment). During development, segmentation quality is used to compare different approaches or to optimize hyperparameter settings. The standard approach is to create ground truth (manually segmented) structures and to compare those structures with algorithmgenerated segmentations in terms of overlap or boundary differences. As already reviewed in . Chap. 2 (Model Validation), the two most common metrics used for evaluating segmentation quality are: 1. The Dice coefficient, which measures the degree of overlap between the two segmentations. Its value ranges between 0 and 1, with 0 denoting no overlap and 1 denoting perfect agreement.

144

I. Oksuz et al.

2. The mean contour distance and/or Hausdorff distance, which evaluate the mean and the maximum distance respectively between two segmentation contours. Although such measures can be easily employed when developing and comparing algorithms, evaluating segmentation quality after algorithm deployment is more challenging since no ground truth is available. Several approaches have been proposed in the literature for automated QC of segmentation in the absence of a ground truth, and these can be divided into two main categories: (1) direct segmentation QC, where the generated segmentation is directly used to produce an evaluation metric, (2) indirect segmentation QC, where some features/parameters are extracted from the generated segmentation for performing QC. We discuss each of these approaches in more detail below. Direct Segmentation Quality Control

7

Approaches for direct evaluation of segmentation performance in the absence of manual annotations can be further subdivided into the following categories: Learning-based quality control: This family of algorithms aims to use the outputs of AI models as confidence measures. For example, [24] used an unsupervised approach to compare the predicted segmentation to a probabilistic generative segmentation model, which produces a smooth segmentation aligned with visible contours in the image. They showed good correlation between the real and predicted Dice scores for brain tumors (r = 0.69) and the LV myocardium (r = 0.78) and demonstrated visual interpretation of the results, which could be useful for manual correction of poor cases. [25] developed an automatic segmentation refinement algorithm that detected incorrect segmentations of T1 mapping MR images. [26] proposed a deep learning-based framework that utilized multiple neural networks to integrate segmentation and quality scoring on a per-case basis based on accurate Dice score predictions. From the multiple segmentations estimated by the neural networks, the segmentation with the highest predicted Dice score among the candidates was selected as the final segmentation. They used the proposed method to automatically segment aortic cine MR sequences from the UK Biobank imaging study and reported a mean Dice score prediction error of 0.011/0.015 and a mean absolute error in estimating lumen area of 17.6/10.5 mm2 for ascending aorta and proximal descending aorta respectively. An example illustration (based on the method proposed by [26]) of a learning-based QC algorithm is shown in . Fig. 7.3a. Optimization-based segmentation algorithms: The aim of this type of algorithm is to explicitly optimize an objective function (or cost function) to produce the desired segmentation. The objective function is generally called an energy function and it will have low values of the ‘energy’ for good segmentations and higher values for bad segmentations. A classical example of optimization-based segmentation algorithms is active contour models [27], also known as Snakes. The basic idea is to start with an initial boundary represented in the form of a closed curve or contour, and to iteratively update the representation based on internal (e.g. boundary smoothness) and external (constraints imposed by image intensities/gradients) energies. As an

Quality Control

145

7

. Fig. 7.3 Examples of direct segmentation QC approaches. (a) Example of learning-based QC based on [26]. The original image is shown on the left, in the middle the predicted segmentations (yellow masks) from the multiple CNNs, and on the right the final optimal segmentation with the predicted Dice score. (b) Example of an optimization-based segmentation algorithm. From left to right: original image with the initial snake position in red, output segmentation also shown in red, and the evolution of the energy function over a number ofiterations. (c) Example of uncertainty-based QC based on [31]. From left to right: the original image with the ground truth segmentation, the uncertainty segmentation network, the output of the network (segmentation and voxelwise uncertainty map), the QC network and finally the quality output (accepted/rejected)

illustration, . Fig. 7.3b shows an example of an active contour model to segment the LV blood pool of the middle slice of a short axis CMR image. In theory, the energy of the output solution can be used as an evaluation metric for segmentation quality. However, in practice, most algorithms are designed to compare relative energies of different segmentations and not to measure an absolute energy difference between a solution and the ground truth. One example that used optimizationbased segmentation for QC was proposed by [28], who used a hybrid approach between optimization-based segmentation and learning-based QC for automated

146

I. Oksuz et al.

quality scoring of segmentations. They derived 42 hand-crafted features from multiple optimization-based segmentations, and then trained a regression algorithm to predict the conventional segmentation error with respect to a known ground truth. They achieved an accuracy of 85% in detecting segmentation failures in lung CT and 54% in liver CT.

7

Uncertainty-based quality control: The main aim of these approaches is to not only estimate a segmentation but to also generate a voxelwise uncertainty map from the predicted segmentation. Voxelwise uncertainty maps typically reflect the confidence level of the predicted class label per voxel. In that sense, uncertainty estimates provide additional information on a method’s prediction and might be employed in various ways, e.g. as visual feedback, to guide corrections via segmentation error localization or for segmentation failure detection at the patient level. [29] proposed to use a fully convolutional neural network (see . Chaps. 4, Sect. 4.3 and 5, Sect. 5.4) for automated whole-brain segmentation that, in addition to the segmentation prediction, also estimated a voxelwise model uncertainty map. From the uncertainty map they estimated four structure-wise uncertainty values, which were highly correlated with the Dice score and therefore could be used to predict segmentation accuracy in the absence of ground truth manual annotations. [30] proposed a deep learning-based framework to segment the LV and RV in CMR images that also produced voxelwise uncertainty maps that can be used for visual assessment of the quality of the predicted segmentations. Similarly, [31] proposed to use the PhiSeg network [32] to segment the main cavities of the heart in native ShMOLLI T1 mapping images and to obtain voxelwise uncertainty maps. The key contribution of this work was to incorporate two QC steps to automatically detect inaccurate cases. The first step used the ELBO (Evidence Lower Bound) output of the network, which quantifies how likely it is that the segmentation is correct, and can detect very unlikely cases. The second step was a classification network that classified each pair of predicted segmentation and voxelwise uncertainty map as accurate or inaccurate. . Figure 7.3c shows an illustration of this pipeline. Indirect Segmentation Quality Control Indirect approaches for segmentation QC can be further subdivided into the following categories: Registration-based quality control: These methods perform image registration between the test image and a set of pre-selected template images with known segmentations. A quality metric can then be calculated by referring to the segmentations of these template images transformed using the registration results. In other words, the segmentation of the image with unknown ground truth is compared to that of multi-atlas segmentations,3 and a smaller difference between these segmentations 3

Atlas-based segmentation was a popular approach before the rise to prominence of deep learningbased segmentation models. The ‘atlas’ consists of an example image with associated ground truth segmentation. A new image is segmented by registering it to the atlas image, then transforming the atlas segmentation to the new image using the resulting displacement field. Multi-atlas segmentation is an extension of this approach to make use of multiple image/segmentation pairs in the atlas.

Quality Control

147

7

is assumed to reflect higher segmentation quality. In one of the earliest such works, the common agreement strategy (STAPLE) was used for comparing the relative performance of seven different models for the task of segmenting brain scans into white matter, grey matter and cerebrospinal fluid [33]. In this case, the different segmentation results were treated as plausible references, and were evaluated using STAPLE and the concept of common agreement. In a later work, [34] proposed the concept of reverse classification accuracy to predict segmentation quality and achieved good performance on a large-scale CMR dataset. More recently, [35] proposed to use a 3-D U-net-like architecture to segment the complete ventricular system in the fluid-attenuated inversion recovery (FLAIR) sequence. The ventricle segmentation was then used to assess registration quality by comparing it to the ventricles of the atlas propagated to the target image space. An advantage of this method compared to previous methods is that it can be used not only to flag or discard erroneous registrations, but also to select the best registration. In general, registration-based QC methods can be computationally expensive due to the cost of multiple image registrations, although this could potentially be reduced by using GPU acceleration and learning-based registration tools. . Figure 7.4a shows an example of using reverse classification accuracy [34] to estimate the predicted segmentation accuracy on cine short axis CMR images. Biomarker-based quality control: Instead of using the predicted segmentation directly as input to a machine learning model, this family of methods extracts multiple image-derived features and uses these as input to a QC classifier. For example, [36] proposed a machine learning approach to automatically identify problematic images based on a wide range of imaging derived metrics. To detect wrong cases they used three classifiers and their outputs were used with a voting system that combined the a posteriori probabilities of the different classifiers using the ‘Minimum Probability’ combination rule. [21, 37] proposed a pipeline that was able to estimate heart coverage, inter-slice motion and cardiac image contrast for short axis cine CMR stacks in the UK Biobank database. [38] proposed a framework for producing cardiac image segmentation maps that are guaranteed to respect pre-defined anatomical criteria. To this end, they defined a set of short and long axis anatomical criteria to identify any segmentation with anatomically implausible results and warped these results toward the closest anatomically valid cardiac shape. They tested the proposed framework on shortaxis CMR as well as apical 2- and 4-chamber view echocardiography images. The cardiac shape warping acts as a post-processing operator which is independent of the segmentation network, and could be applied to any segmentation method that could potentially generate anatomically erroneous segmentation maps. Therefore, this framework not only detects incorrect segmentations but also attempts to correct them by converting the segmentation output into the closest anatomically plausible cardiac shape. A more complete QC pipeline was proposed by [39]. This work described a deep learning segmentation-based pipeline for quantification of cardiac function from short-axis and 2- and 4-chamber long-axis cine CMR stacks. Several QC measurements were used to assess the quality of the segmentations. First, they evaluated

148

I. Oksuz et al.

7

. Fig. 7.4 Examples of indirect segmentation quality control approaches. (a) Example of registrationbased QC using reverse classification accuracy (RCA) based on [34]. The red dotted box shows the original image and the predicted segmentation generated by the segmentation network. The green box shows the reference database with the reference segmentations (yellow masks) and the predicted segmentations (red masks). The mask with the best segmentation score is selected as the final segmentation and its Dice score with the original predicted mask is used as an estimate of the segmentation accuracy. (b) Example of biomarker-based QC based on [39]. From left to right: original image with ground truth segmentation, deep learning network that rejects images with low image quality, segmentation network, output of the segmentation network and estimate of the volume traces, deep learning-based classifier to detect abnormal traces and decision of acceptance/rejection of the predicted segmentation

the orientation of the images, the presence of missing slices, and the coverage of the segmentations over the heart using an automatic algorithm. Specifically, they assessed if the long axis images intersected the mitral valve and apex in the short axis, if the short axis stack covered the full length of the long axis segmentation and if the long axis segmentation reached a similar level as the short axis segmentation and vice versa. Next, they used a SVM classifier (see . Chap. 5, Sect. 5.3) to detect abnormalities in the obtained volume curves and strain curves. Finally,

Quality Control

149

7

they introduced a set of rules based on clinical knowledge for detection of unrealistic output parameters. This work was significant in proposing an automated cine CMR analysis tool that included comprehensive QC designed to detect erroneous results for clinician review, allowing fully autonomous processing of CMR exams. . Figure 7.4b shows a schematic of the proposed pipeline, which includes two QC steps that reject images with insufficient quality or erroneous outputs.

7.7

Closing Remarks

We have seen that much work has been carried out on QC in cardiac imaging. We have also seen that poor quality can relate to the images themselves or to failures in subsequent automated processing, such as segmentation. Poor quality can be caused by different problems in the acquisition and processing of the images, and it is likely that separate methods will be required to assess different aspects of quality. But the presence of adequate methods for QC will likely be an essential prerequisite for clinical translation of automated AI-based pipelines in cardiology. Of particular note here is the fact that most work to date has been on a posteriori QC, i.e. assessing the quality of images after they have been acquired. This has been by far the most common approach investigated to date, but it should be remembered that a more ambitious but transformative aim would be to integrate such QC steps into the acquisition. Some promising preliminary work has already been performed with this aim in mind, e.g. to create an “active” acquisition process for CMR in which acquisition proceeds only until there are sufficient k-space data to reconstruct a good quality image [40]. If successful, such techniques could serve not only to ensure the quality of images and downstream analyses, but also to speed up the acquisition process. Next, we present a few exercise questions to help readers to self-assess their learning from this chapter. Following this, a tutorial is presented to enable interested readers to gain practical experience in the development and evaluation of a simple model for QC in CMR.

7.8

Exercises

Exercise 1 What factors can cause artefacts or poor quality in CMR images and how could this impact subsequent use of these images, for example in segmentation and/or biomarker estimation?

150

I. Oksuz et al.

Exercise 2 How can machine learning help to identify problems with view planning or even automate this process?

Exercise 3 Why can it be important to identify failures in automated segmentation? Describe two different types of approach for achieving segmentation quality control.

Exercise 4 A research team is developing a tool for automated calculation of cardiac functional parameters, with the objective of characterizing the ageing process of the heart. The pipeline consists of a deep learning-based segmentation model followed by automated analysis of the resulting time series of segmentations. The tool is intended for use in analysing large-scale databases of CMR images. What advice would you give the team about the need for quality control in this pipeline?

7

Exercise 5 A colleague argues that automated quality control is not important for the use of AI in clinical cardiology imaging (i.e. as opposed to large-scale population studies). Do you agree?

7.9

Tutorial—Quality Control

Tutorial 6 As for the other notebooks, the contents of this notebook are accessible as Electronic Supplementary Material. Overview In this hands-on tutorial, you will study automatic CMR motion artefact detection with deep neural networks. More specifically, you will apply two state-of-the-art neural network architectures (ResNet50 and DenseNet121) for classifying low quality CMR images. The material is based on a recently published study [12] and a github repository (7 https://github.com/canerozer/binary_quality_classification), as well as CMR images from the open access ACDC dataset [41].

Quality Control

151

7

The low quality cases are generated synthetically by using k-space corruption following the original CMR acquisition process. The tutorial focuses on the testing phase of the (pre-trained) networks, and formulates quality control as a binary classification problem (i.e. low vs. high quality images). The figure below shows one low quality sample and one high quality sample, and the confusion matrix estimated for one of the two networks that you will test.

Objectives • Gain some practice on automatic image quality assessment based on the material from this chapter. • Use a pre-trained model based on the classical ResNet architecture to classify low quality CMR images. • Analyze the results of automatic image classification and in particular quantify model performance. Computing Requirements As for the other hands-on tutorials, this notebook starts with a brief “System setting” section, which imports the necessary packages, installs the potentially missing ones, and imports our own modules. Specific to this tutorial, you will need the tensorflow library, tailored for numerical computing with deep learning models.

7.10

Opinion

Carefully considering the quality of the acquired data for a cardiac imaging exam is not a new issue, but it becomes more and more important in the context of automated processing, with requirements on the accuracy of results and the need to process large amounts of data. AI techniques will inevitably improve clinical practice, but ensuring good quality is important for these new methods to be accepted by end users. The evaluation of image quality and the detection of artefacts are crucial aspects of quality control. Improving image quality with deep learning is not limited to academic research, as manufacturers now incorporate such techniques in their commercial solutions. Nowadays, this is particularly the case for CT [42] and for CMR. Moreover, the detection of artefacts is crucial when they can impact

152

7

I. Oksuz et al.

downstream processes or are not easily visible. We can imagine in the near future the development of techniques for real-time detection of artefacts that could automatically correct the acquisition settings. Such upcoming developments could partially solve the problem of noise and artefacts in the images. Similar processes could be conceived to handle lower-quality ECGs (e.g. noisy signals typically observed in a 3T magnet environment or in the presence of extra systoles). Other issues that can appear during the data acquisition could be managed by AI. Consider the definition of the imaging plane. Some imaging techniques such as CMR require expertise to localize this plane. This is also the case for echocardiography as the sonographer directly controls the plane localization (which is why this exam is considered more operator-dependent). Automating the acquisition in echocardiography depending on the patient is a huge challenge that could be tackled by deep learning in the longer term. Plane localization is also important during CMR as no correction is possible after the exam (contrary to CT). Technicians with high expertise on these types of exam are essential, otherwise crucial errors such as missing parts of the heart or wrong plane orientation could appear. However, specialized technicians are not present in all clinical centres, which could hamper the use of CMR in a broad spectrum of situations. Deep learning-based planning of the exam could be a solution. More specifically, most diseases require a standard protocol that could be applied directly without any modification. However, the definition of the acquired planes remains complex, and requires advanced knowledge of the anatomy and the technique, and should be specific to each patient. The underlying goal would be to automatically define the acquisition planes for all the sequences used in the exam. The main difficulty is to accurately define the principal orientation planes of the heart from the scout images. In this sense, AI can largely contribute to the democratization of the management of CMR. Missing data is another issue that could be partially addressed. This issue could consist of missing slices in tomographic images, or part of the heart being outside the image (in case of wrong definition of the region of interest). Detecting this missing information during the quality control procedure would warn the physician about potentially incorrect values of the estimated parameters. Furthermore, AI methods could automatically generate the missing data in the exam. For example, generative adversarial networks (GANs) could be used to create synthetic realistic slices, although this option may be questionable as these algorithms can be seen as black box probabilistic models whose behavior is not fully understood, and may not be acceptable to clinicians [43]. In addition, this brings ethical issues to the table: should synthetic data generated by GANs, even if they are trained with real patient data, be considered as part of the patients’ health information at the same level compared to raw data? Objective clinical parameters are increasingly considered in the diagnosis, and registering them in the report becomes more and more prevalent. Therefore, automatic segmentation and quantification of these parameters is playing a more and more important role globally (see . Chap. 4). Today, deep learning approaches outperform other state-of-the-art methods for segmentation and classification tasks. Their results are generally impressive and it is clear that these approaches will be integrated soon in most of the available commercial software. However, incorrect

Quality Control

153

7

segmentation can still occur despite carefully designed neural networks. Sometimes, even if the segmentation is not perfect, results can be acceptable and registered in a report. Nonetheless, validating this process by a clinician is mandatory, both for practical and ethical considerations. During validation, aberrant segmentations must be discarded or corrected, meaning that quality control is inherent to this process. Generic quality measures such as the Dice index are relevant to assess the performance of an automatic segmentation method from an image processing point of view. However, some of them may not be relevant in a clinical setting where ground truth does not exist. In addition, these measures may not be always exhaustive for a given cardiac imaging exam, and may be complemented by additional indices specific to the imaging technique considered. Finally, quality control at the level of the estimated biomarkers is also possible, and first consists of checking if the output values are plausible. For conventional exams, in the absence of complex disease, specialized physicians generally have an idea of the range of measurements to be expected. Automating the verification of the results needs to mimic this process, by first detecting abnormal results, and then warning the physician if results do not correspond to the studied pathology. Deep learning approaches are not mandatory if one only considers the estimated values, as such a verification is very simple. However, if one needs to relate the pathology with the estimated measurements, using deep learning approaches may lead to more efficient quality control. Considering a set of metrics and managing the interactions between them could become complex and deep learning approaches could assist the physician in this task. Moreover, as specialized physicians are not always available in all clinical centres, deep learning could help in the inclusion of these objective metrics in the diagnosis pipeline. In a holistic way, although data processing becomes more and more automatic, the data management and checking of the results remains one of the physician’s missions. However, if the exam acquisition becomes faster (in particular for tomography), and if the time taken to validate the results remains similar, the cardiologist will drastically reduce the time spent with each patient. Thus, automatic quality control of the exam, with some alerts if necessary, will reduce the cumbersome time spent by each cardiologist to check the exams, which can in turn be used in a more useful manner such as interacting with the patient. Deep learning in cardiac imaging, and more specifically during quality control, should not aim to replace the physician, but to make their lives easier. Acknowledgements IO was supported by the 2232 International Fellowship for

Outstanding Researchers Program of TUBITAK (Project No: 118C353). However, the entire responsibility of the work belongs to the authors. EPA was supported by the EPSRC (EP/R005516/1) and by core funding from the Wellcome/EPSRC Centre for Medical Engineering (WT 203148/Z/16/Z).

154

I. Oksuz et al.

References

7

1. Lalande A, Salvé N, Comte A, Jaulent M, Legrand L, Walker P, Cottin Y, Wolf J, Brunotte F. Left ventricular ejection fraction calculation from automatically selected and processed diastolic and systolic frames in short-axis cine-MRI. J Cardiovasc Magn Reson. 2004; 6:817−27. 2. Leclerc S, Smistad E, Pedrosa J, Østvik A, Cervenansky F, Espinosa F, Espeland T, Berg EAR, Jodoin PM, Grenier T, Lartizien C, D’hooge J, Lovstakken L, Bernard O. Deep learning for segmentation using an open large-scale dataset in 2D echocardiography. IEEE Trans Med Imaging. 2019; 38(9): 2198−210. 3. Ouyang D, He B, Ghorbani A, Yuan N, Ebinger J, Langlotz CP, Heidenreich PA, Harrington RA, Liang DH, Ashley EA, Zou JY. Video-based AI for beat-to-beat assessment of cardiac function. Nature. 2020; 580:252−6. 4. Abdi AH, Luong C, Tsang T, Allan G, Nouranian S, Jue J, Hawley D, Fleming S, Gin K, Swift J, Rohling R, Abolmaesumi P. Automatic quality assessment of echocardiograms using convolutional neural networks: feasibility on the apical four-chamber view. IEEE Trans Med Imaging. 2017; 36(6):1221−30. 5. Zhang J, Gajjala S, Agrawal P, Tison GH, Hallock LA, Beussink-Nelson L, Lassen MH, Fan E, Aras MA, Jordan C, Fleischmann KE, Melisko M, Qasim A, Shah SJ, Bajcsy R, Deo RC. Fully automated echocardiogram interpretation in clinical practice. Circulation. 2018; 138(16):1623−35. 6. Labs R, Vrettos A, Azarmehr N, Howard J, Shun-shin M, Cole G, Francis D, Zolgharni M. Automated assessment of image quality in 2D echocardiography using deep learning, Intelligent Medicine. 2022. In press. 7. Klinke V, Muzzarelli S, Lauriers N, Locca D, Vincenti G, Monney P, Lu C, Nothnagel D, Pilz G, Lombardi M, et al. Quality assessment of cardiovascular magnetic resonance in the setting of the European CMR registry: description and validation of standardized criteria. J Cardiovasc Magn Reson. 2013; 15(1):55. 8. Alfudhili K, Masci PG, Delacoste J, Ledoux J-B, Berchier G, Dunet V, Qanadli SD, Schwitter J, Beigelman-Aubry C. Current artefacts in cardiac and chest magnetic resonance imaging: tips and tricks. Br J Radiol. 2016; 89(1062):20150987. 9. Ferreira PF, Gatehouse PD, Mohiaddin RH, Firmin DN. Cardiovascular magnetic resonance artefacts. J Cardiovasc Magn Reson. 2013;15(1):41. 10. Lorch B, Vaillant G, Baumgartner C, Bai W, Rueckert D, Maier A. Automated detection of motion artefacts in MR imaging using decision forests. J Med Eng. 2017; 2017. 11. Oksuz I, Ruijsink B, Puyol-Antón E, Bustin A, Cruz G, Prieto C, Rueckert D, Schnabel JA, King AP. Deep learning using K-space based data augmentation for automated cardiac MR motion artefact detection. In: International conference on medical image computing and computer-assisted intervention. Springer; 2018. p. 250−58. 12. Oksuz I, Ruijsink B, Puyol-Antón E, Clough JR, Cruz G, Bustin A, Prieto C, Botnar R, Rueckert D, Schnabel JA, et al. Automatic CNN-based detection of cardiac MR motion artefacts using K-space data augmentation and curriculum learning. Med Image Anal. 2019; 55:136−47. 13. Oksuz I, Clough JR, Ruijsink B, Anton EP, Bustin A, Cruz G, Prieto C, King AP, Schnabel JA. Deep learning-based detection and correction of cardiac MR motion artefacts during reconstruction for high-quality segmentation. IEEE Trans Med Imaging. 2020; 39(12):4001−10. 14. Sundararajan R, Patel H, Shanbhag D, Vaidya V. An SVM based approach for cardiac view planning. 2014; arXiv:1407.3026. 15. Lu X, Jolly M-P, Georgescu B, Hayes C, Speier P, Schmidt M, Bi X, Kroeker R, Comaniciu D, Kellman P, et al. Automatic view planning for cardiac MRI acquisition. In: International conference on medical image computing and computer-assisted intervention. Springer; 2011. p. 479−86. 16. Lê M, Lieman-Sifry J, Lau F, Sall S, Hsiao A, Golden D. Computationally efficient cardiac views projection using 3D convolutional neural networks. In: Cardoso MJ, Arbel T, Carneiro G, SyedaMahmood TF, Tavares JMRS, Moradi M, Bradley AP, Greenspan H, Papa JP, Madabhushi A, Nascimento JC, Cardoso JS, Belagiannis V, Lu Z, editors. Deep learning in medical image analysis and multimodal learning for clinical decision support - third international workshop, DLMIA 2017, and 7th international workshop, ML-CDS 2017, held in conjunction with MICCAI 2017, Québec

Quality Control

17.

18.

19.

20.

21.

22.

23.

24.

25.

26.

27. 28.

29.

30.

31.

155

7

City, QC, Canada, September 14, 2017, proceedings, series lecture notes in computer science, vol. 10553. Springer; 2017. p. 109−16. Alansary A, Folgoc LL, Vaillant G, Oktay O, Li Y, Bai W, Passerat-Palmbach J, Guerrero R, Kamnitsas K, Hou B, McDonagh SG, Glocker B, Kainz B, Rueckert D. Automatic view planning with multi-scale deep reinforcement learning agents. In: Frangi AF, Schnabel JA, Davatzikos C, Alberola-López C, Fichtinger G, editors. Medical image computing and computer assisted intervention - MICCAI 2018 - 21st international conference, Granada, Spain, September 16-20, 2018, proceedings, part I, series lecture notes in computer science, vol. 11070. Springer; 2018. p. 277−85. Oksuz I, Ruijsink B, Puyol-Antón E, Sinclair M, Rueckert D, Schnabel JA, King AP. Automatic left ventricular outflow tract classification for accurate cardiac MR planning. In: IEEE 15th international symposium on biomedical imaging (ISBI 2018). IEEE; 2018. p. 462−65. Zhang L, Gooya A, Dong B, Hua R, Petersen SE, Medrano-Gracia P, Frangi AF. Automated quality assessment of cardiac MR images using convolutional neural networks. In: Tsaftaris SA, Gooya A, Frangi AF, Prince JL, editors.Simulation and synthesis in medical imaging - first international workshop, SASHIMI 2016, held in conjunction with MICCAI 2016, Athens, Greece, October 21, 2016, proceedings, series lecture notes in computer science, vol. 9968, 2016. p. 138−45. Zhang L, Gooya, A, Frangi AF. Semi-supervised assessment of incomplete LV coverage in cardiac MRI using generative adversarial nets. In: Tsaftaris SA, Gooya A, Frangi AF, Prince JL, editors. Simulation and synthesis in medical imaging - second international workshop, SASHIMI 2017, held in conjunction with MICCAI 2017, Québec City, QC, Canada, September 10, 2017, proceedings, series lecture notes in computer science, vol. 10557. Springer; 2017. p. 61−8. Tarroni G, Oktay O, Bai W, Schuh A, Suzuki H, Passerat-Palmbach J, de Marvao A, O’Regan DP, Cook S, Glocker B, Matthews PM, Rueckert D. Learning-based quality control for cardiac MR images. IEEE Trans Med Imaging. 2019; 38(5):1127−38. Zhang L, Pereañez M, Piechnik SK, Neubauer S, Petersen SE, Frangi AF. Image quality assessment for population cardiac magnetic resonance imaging. In: Deep learning and convolutional neural networks for medical imaging and clinical informatics. Springer; 2019. p. 299−321. Zhang L, Pereañez M, Bowles C, Piechnik S, Neubauer S, Petersen S, Frangi A. Missing slice imputation in population CMR imaging via conditional generative adversarial nets. In: International conference on medical image computing and computer-assisted intervention. Springer; 2019. p. 651−59. Audelan B, Delingette H. Unsupervised quality control of image segmentation based on Bayesian learning. In: International conference on medical image computing and computer-assisted intervention. Springer; 2019. p. 21−9. Fahmy A, El-Rewaidy H, Nezafat M, Nakamori S, Nezafat R. Automated analysis of cardiovascular magnetic resonance myocardial native T1 mapping images using fully convolutional neural networks. J Cardiovasc Magn Reson. 2019; 21(1):7. Hann E, Biasiolli L, Zhang Q, Popescu IA, Werys K, Lukaschuk E, Carapella V, Paiva JM, Aung N, Rayner JJ, et al. Quality control-driven image segmentation towards reliable automatic image analysis in large-scale cardiovascular magnetic resonance aortic cine imaging. In: International conference on medical image computing and computer-assisted intervention. Springer; 2019. p. 750−58. Kass M, Witkin A, Terzopoulos D. Snakes: active contour models. Int J Comput Vis. 1988; 1(4):321− 31. Kohlberger T, Singh V, Alvino C, Bahlmann C, Grady L. Evaluating segmentation error without ground truth. In: International conference on medical image computing and computer-assisted intervention. Springer; 2012. p. 528−36. Roy AG, Conjeti S, Navab N, Wachinger C, Initiative ADN, et al. Bayesian QuickNAT: model uncertainty in deep whole-brain segmentation for structure-wise quality control. NeuroImage. 2019; 195:11−22. Sander J, de Vos BD, Wolterink JM, Iš gum I. Towards increased trustworthiness of deep learning segmentation methods on cardiac MRI. In: Medical imaging 2019: image processing, vol. 10949. International society for optics and photonics, 2019. p. 1094919. Puyol-Antón E, Ruijsink B, Baumgartner CF, Sinclair M, Konukoglu E, Razavi R, King AP. Automated quantification of myocardial tissue characteristics from native T1 mapping using neural

156

32.

33. 34.

35.

36.

7

37.

38.

39.

40.

41. 42. 43.

I. Oksuz et al.

networks with Bayesian inference for uncertainty-based quality-control. J Cardiovasc Magn Reson. 2020; 22:60. Baumgartner CF, Tezcan KC, Chaitanya K, Hötker AM, Muehlematter UJ, Schawkat K, Becker AS, Donati O, Konukoglu E. PHiSeg: capturing uncertainty in medical image segmentation. In: Shen D, Liu T, Peters TM, Staib LH, Essert C, Zhou S, Yap P-T, Khan A, editors. Medical image computing and computer assisted intervention - MICCAI 2019. Cham: Springer International Publishing; 2019. p. 119−27. Bouix S, Martin-Fernandez M, Ungar L, Nakamura M, Koo M-S, McCarley RW, Shenton ME. On evaluating brain tissue classifiers without a ground truth. Neuroimage. 2007; 36(4):1207−24. Robinson R, Valindria VV, Bai W, Oktay O, Kainz B, Suzuki H, Sanghvi MM, Aung N, Paiva JM, Zemrak F, et al. Automated quality control in image segmentation: application to the UK Biobank cardiovascular magnetic resonance imaging study. J Cardiovasc Magn Reson. 2019; 21(1):18. Dubost F, de Bruijne M, Nardin M, Dalca AV, Donahue KL, Giese A-K, Etherton MR, Wu O, de Groot M, Niessen W, et al. Multi-atlas image registration of clinical data with automated quality assessment using ventricle segmentation. Medical image analysis, 2020. p. 101698. Alfaro-Almagro F, Jenkinson M, Bangerter NK, Andersson JL, Griffanti L, Douaud G, Sotiropoulos SN, Jbabdi S, Hernandez-Fernandez M, Vallee E, et al. Image processing and quality control for the first 10,000 brain imaging datasets from UK Biobank. Neuroimage. 2018; 166:400−24. Tarroni G, Bai W, Oktay O, Schuh A, Suzuki H, Glocker B, Matthews PM, Rueckert D. Largescale quality control of cardiac imaging in population studies: application to UK Biobank. Sci Rep. 2020; 10(1):1−11. N. Painchaud, Y. Skandarani, T. Judge, O. Bernard, A. Lalande, and P. Jodoin, Cardiac segmentation with strong anatomical guarantees, IEEE Transactions on Medical Imaging, pp. 1, 2020;39:3703−13. Ruijsink B, Puyol-Antón E, Oksuz I, Sinclair M, Bai W, Schnabel JA, Razavi R, King AP. Fully automated, quality-controlled cardiac analysis from CMR: validation and large-scale application to characterize cardiac function. JACC: Cardiovasc Imaging. 2020; 13(3):684−95. Machado I, Puyol-Anton E, Hammernik K, Cruz G, Ugurlu D, Ruijsink B, Castelo-Branco M, Young A, Prieto C, Schnabel JA, King AP. Quality-aware cine cardiac MRI reconstruction and analysis from undersampled K-space data. In: Proceedings of the workshop on statistical atlases and computational modelling of the heart (STACOM), 2021. ACDC challenge website. 7 https://www.creatis.insa-lyon.fr/Challenge/acdc/. Singh R, Weiwen W, Wang G. Artificial intelligence in image reconstruction: the change is here. Phys Med. 2020; 79:113−25. Skandarani Y, Lalande A, Afilalo J, Jodoin P. Generative adversarial networks in cardiology. Can J Cardiol. 2022; 38:196−203.

157

AI and Decision Support Mariana Nogueira and Bart Bijnens Contents 8.1

Introduction – 158

8.2

What Does AI Bring to the Table to Support the Clinician? – 158

8.3

Current Challenges and the Importance of Interpretability – 160

8.4

Addressing Challenges With Interpretable AI—The Potential of Representation Learning – 163

8.5

Closing Remarks – 168 References – 169

Authors’ contribution: • Main chapter: MN, BB © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Duchateau and A. P. King (eds.), AI and Big Data in Cardiology, https://doi.org/10.1007/978-3-031-05071-8_8

8

158

M. Nogueira and B. Bijnens

n Learning Objectives At the end of this chapter you should be able to: O8.A Explain why it is important to understand routine medical reasoning to successfully introduce and integrate AI into clinical practice O8.B Explain how AI can assist clinical decision making in the context of evidenceand eminence-based knowledge O8.C Explain the importance of interpretability in AI based decision support O8.D Justify the relevance of representation learning and unsupervised learning as an alternative or a complement to supervised learning O8.E Explain how multiview machine learning can enable multiple sources of data to be considered in phenogrouping patients

8.1

8

Introduction

Routine cardiovascular care involves the acquisition, processing and interpretation of large amounts of (clinical, lab, imaging, signal etc.) data for continuous decision making. All of these tasks are far from trivial: on the one hand, good quality information acquisition and processing (e.g. segmentation/delineation, estimating measurements) require highly-trained staff and can be cumbersome, time-consuming processes. On the other hand, the integration and interpretation of all available information can be extremely challenging for clinicians (more so, the less experienced they are). The appeal of AI in cardiology resides in its potential to positively impact all of these processes, be it by easing, optimising or automating cumbersome manual tasks, or by supporting interpretation and decision making. Although classical machine learning techniques are still represented, deep learning, in all its variants, has taken over recent literature, having proven to deliver unprecedented performance for many tasks. However, most proposed approaches are often perceived by clinicians as hardly-interpretable black boxes, offering limited insight on the “reasoning” that maps a certain input to a specific output. While lack of easy interpretability is not necessarily a serious obstacle for clinical use when supporting/replacing manual tasks, it can definitely be a deal-breaker in interpretation and decision support tasks. This chapter focuses on real-life clinical decision support, providing an overview of pertinent considerations relevant for AI: what does AI bring to the table, what are some of the main current challenges, and how can some of these challenges be addressed?

8.2

What Does AI Bring to the Table to Support the Clinician?

Before reflecting on how AI can contribute to the field of clinical decision support, let us discuss the intrinsic complexity of clinical decision making, and the role of evidence-based knowledge in the process.

AI and Decision Support

159

8

. Fig. 8.1 Clinical decisions result from a complex weighing of eminence- and evidence-based knowledge. In both approaches, the basis for decision-making comes down to comparing and positioning the patient with regard to a previous, known population. In the context of decision support, evidence-based knowledge is commonly explored in the form of guidelines, building upon evidence found in clinical studies/trials

A clinical decision is typically the result of a complex weighing of evidence and eminence-based knowledge (see . Fig. 8.1). Eminence-based knowledge refers to the clinician’s own experience: in order to make the best decision for a specific patient, the clinician takes into account knowledge on the evolution and outcomes of previous similar cases, learned from studying, interactions with peers or personal experience. An obvious issue with a purely eminence-based approach to decision making is that it is highly subjective and limited by the clinician’s experience and training. Having objective, evidence-based knowledge available to support decision making is thus of major importance, especially for less experienced clinicians. In the context of decision support, evidence-based knowledge is commonly provided in the form of practice guidelines, building upon evidence found in clinical studies/trials. When resorting to such guidelines, the clinician is also, in a way, positioning the patient with regard to a previous population/cohort, so, in both eminence- and evidence-based approaches, the basis for decision making comes down to comparing and positioning the patient with regard to a previously known population, and comparing the available information for diagnosis, prognosis and treatment planning. Despite the proven value of conventional guidelines and recommendations, they have important limitations which must be taken into account: on the one hand, in their formulation, a thorough exploration of the original rich data space (e.g. imaging and signal data) is often simplified and replaced by the exploration of a handful of easy, interpretable reference measurements. This type of approach is unlikely to capture complex patterns that clinicians identify, for example, with visual inspection of raw imaging or signal data. Guidelines are also commonly based on hard thresholds regarding the selected measurements, which often amounts to an inadequate “binarization” of complex transitions from healthy to pathological states. Furthermore, they are conventionally extracted from clinical-trial-like scenarios, where every aspect from inclusion criteria, data acquisition, treatment and followup is strictly protocolized and standardized. This often limits their generalizability to routine clinical practice, where all the aforementioned steps can be highly nonstandardized. Given these limitations, the clinician needs to critically assess the validity of a certain guideline in a context-dependent manner and thus, rightfully, uses them as a “guide” rather than a rule. In summary, evidence-based knowledge

160

8

M. Nogueira and B. Bijnens

. Fig. 8.2 Potential advantages of machine learning (ML) over conventional clinical guidelines: (1) taking better advantage of all available information, by accommodating raw data in their learning framework, (2) extending the scope of evidence-based decision support to real-world clinical settings. The latter step is especially challenging, requiring special caution upon implementation

helps bridging gaps in eminence-based knowledge, and vice-versa, towards making the best decision for the patient. . Figure 8.2 illustrates how AI (or more specifically, machine learning) can impact evidence-based decision support. On the one hand, because AI allows capturing of complex patterns directly, and jointly, from heterogeneous sources of data (e.g. images, signals), it represents an opportunity for the development of evidence-based decision support systems that take better advantage of all available information, when compared to conventional clinical guidelines. Indeed, it can even help uncover complex patterns and associations in the data that clinicians are not able to identify (or have not been trained to recognize) with a simple visual inspection. On the other hand, to extend the scope of evidence-based decision support to real-world clinical scenarios, AI can be applied to data closer to clinical reality. However, there are important challenges to consider when it comes to the use of AI for decision support, especially so when based on data that are acquired in routine clinical practice (i.e. typically nonstandardized), and every approach should, as described above, balance a “hard” evidence-based side, with a “soft” eminencebased point of view, taking into account the limitations of the data/knowledge used. As such, the selection of an AI based approach should always be cautiously considered in light of the problem at hand and associated challenges. The next section briefly elaborates on common challenges of AI in decision support contexts. 8.3

Current Challenges and the Importance of Interpretability

First of all, let us address the concept of interpretability in the context of AI. Its definition has been widely debated, and there is currently no unique (or specific

AI and Decision Support

161

8

enough to enable formalization) definition for it [1, 2]. It is out of the scope of this chapter to elaborate on this matter; however, for the sake of clarity, let us define it as the ability to understand the logical reasoning of an AI system behind a suggested clinical decision, in light of the input clinical features. As previously noted, the emergence of deep learning has brought unprecedented performance in many predictive tasks, which resulted in it quickly becoming a dominant framework in AI based biomedical research. However, with a mindset centred on maximising performance, some important clinical aspects are often overlooked: interpretability and the ability for individual patient use. Indeed, the performance level of deep learning models, especially when a large amount of training data are available, is possible in great part due to their complexity, which, as previously mentioned, often turns them into hardly-interpretable black boxes in the eyes of a clinician. This is not much of a problem in scenarios where they are used towards automating (manual) image processing tasks, such as segmentation. Even if they are always biased, to some extent, towards the data they were trained with, they still offer the best state-of-the-art “initial guess”, which can easily be evaluated and adjusted, if needed, by the clinician. As such, their integration into clinical workflows can improve efficiency and standardization of these cumbersome and observer-dependent tasks. In a decision support scenario, on the other hand, the output is typically a single diagnosis/outcome label, whose error clinicians cannot intuitively evaluate and correct for, like they can in the case of a poorly predicted segmentation. Because these models’aim is to directly influence the course of patient management, prediction errors bear serious risks. Herein, we briefly elaborate on some scenarios where lack of interpretability of an AI prediction model could become hazardous for an individual patient in a real-world clinical decision support environment. Imbalance & overfitting: Often, the available data for learning are not representative of the true clinical variability of the problem at hand, especially when based on controlled clinical trials. As a result, the models are biased towards their training sets, and generalize poorly when presented with previously unseen samples [3], e.g. regarding new patient phenotypes, outcomes or new scanners/data acquisition systems. Finding a representative sample of a disease or outcome of interest is often particularly challenging (leading to the well-known issue of class imbalance). Given the crucial importance of recognising the outlier patient (with a rare disease/complication/presentation), which often involves high-risk decisions, it is important that any AI based decision support system has some incorporated way of assessing/communicating representativeness of the individual at hand with regard to the training data. We return to the issue of bias and, in particular, data provenance in . Chap. 9. Dichotomization: Another relevant issue that limits generalizability to real-world clinical settings, is the tendency to dichotomize disease for prediction purposes [4], when in many real world scenarios it would more appropriately be represented as a spectrum. Again, this is partly related to the use of controlled clinical trial data where often extremes are compared rather then the full spectrum of a routine clinical population, as well as to the (legally inspired) use of clinical guidelines with hard decision thresholds when labelling the training data.

162

8

M. Nogueira and B. Bijnens

Reliability of clinical data: As opposed to many real-world data sources, the accuracy and reliability of clinical data are much lower than often assumed. On the one hand, a rule-of-thumb regarding the uncertainty on a clinically measured variable is often recognized as ±5% under ideal circumstances and can increase in routine practice. On the other hand, and this goes together with the above comments on dichotomization, many outcome labels used for training are just a temporal snapshot of the status of the patient, and even in hard outcomes such as death, unless the full lifespan of the patient is incorporated (which is almost never the case) there is no way of knowing the status the day after the assessment. Similarly, unless in very controlled cases, clinicians work with a differential diagnosis, taking into account comorbidities, thus also making diagnostic labels in routine clinical practice less of a “given fact” as often is assumed in AI approaches. Adversarial attacks: Adversarial attacks consist of perturbations to the inputs of an AI system, explicitly calculated to “fool” it into misclassifying them. It is often the case that, with tiny, visually imperceptible, changes at the individual pixel level it is possible to induce an AI system into (confidently) making contradictory decisions. In [5], (adversarial) noise was added to make visually imperceptible changes in the image of a benign mole (>99% confidence), fooling the model into classifying it as malignant with 100% confidence. While the added noise has near-zero probability of occurrence by chance, the authors also show that simpler perturbations, like a precise rotation, are sometimes enough to induce the AI system into switching diagnoses. Adversarial attacks expose, on the one hand, generalizability issues, and, on the other hand, vulnerability to manipulation, with potentially important implications in a multitude of scenarios, including insurance fraud, biasing trial outcomes in one’s favour, and others [5]. Biased practice: Real-world clinical practice is highly non-standardized and location-/environment-dependant. A biased practice can itself lead to the learning of dangerous associations, which are harder to detect, the more complex the model at hand. For example, Ambrosino et al. [6] learned a rule-based model for risk assessment and prognosis in pneumonia. Because the asthmatics in the training dataset had been treated more aggressively upon presentation than non-asthmatics, which lowered their risk of developing pneumonia, the model learned the rule that asthma lowers the risk of developing pneumonia. The interpretability of this type of model allows for dangerous associations of this nature to be more easily detected and corrected. If a complex, hardly-interpretable model had been used instead, this type of association could be much more difficult to detect, and said model would become a hazard if transferred to a clinical workflow. Causality vs. association: In [7], Zech et al. used a deep learning framework to detect pneumonia in chest radiographs. The authors point out how the model could learn to assign higher probability of pneumonia to radiographs extracted from a specific scanner if said scanner would have been used to examine higherrisk populations, or how it could learn to detect obvious chest tubes over a subtler pneumothorax. This illustrates the relevance of the causality vs. association issue, when highly predictive variables are not necessarily causative of the outcome of interest [8].

AI and Decision Support

. Fig. 8.3

163

8

Challenges and directions for AI in different clinical tasks

In the end, be it due to generalizability or bias concerns, decision support systems that are based on hardly-interpretable predictions have limited value in a realworld clinical scenario, even if test performance is excellent: when presented with a suggestion, clinicians need to understand it in order to trust it. We note here that the explainability of decisions by models is not necessarily the same as interpretability in a clinical context. It is not enough to provide information on which (part of the) data and which values are used for the decision, a clinician also needs to know how this relates to the pathophysiological processes driving the change in the data, just like is done in clinical guidelines and decision trees. The realization that hardly-interpretable decision support models have a limited chance of being deployed, and thus actually impacting a real-world clinical environment, is motivating a more interpretability-conscious exploration of AI for decision support. Knowing that interpretability often comes at the expense of performance, and vice-versa, the challenge is in finding a good compromise. . Figure 8.3 summarizes the main takeaways of this section.

8.4

Addressing Challenges With Interpretable AI—The Potential of Representation Learning

In the pursuit of more interpretable AI approaches, many have turned to representation learning—obtaining simplified, latent, (typically) lower-dimensional representations of the data that are interpretable in light of the original data space. These representations can be explored in both supervised and unsupervised contexts.1 In supervised contexts, they can be obtained via (supervised) dimensionality reduction techniques (e.g. linear discriminant analysis (LDA) [9]), but they can also be retrieved from (and thus help addressing interpretability issues in) prediction models (e.g. deep learning models contain such representations in their hidden layers). In supervised contexts, the obtained representation is constrained by labelling 1

Supervised and unsupervised learning approaches were first introduced in . Chap. 2.

164

M. Nogueira and B. Bijnens

. Fig. 8.4 Obtaining a simplified representation of complex data via unsupervised dimensionality reduction. Close (distant) positioning of individuals indicates a similar (dissimilar) presentation

8

information, and optimized for the specific prediction problem at hand. By definition, this makes it more sensitive to labelling inaccuracies, such as the previously discussed improper “dichotomization”. On the other hand, unsupervised representation learning (or dimensionality reduction) methods (e.g. principal component analysis (PCA) [10]; manifold learning techniques [11]) obtain latent representations that capture data patterns in a process that is “blind” to labelling information, grouping them based on similarity. In this section, we briefly elaborate on how latent representations can be the basis for a more interpretable approach to AI based clinical decision support. For illustration, we focus on a generic unsupervised representation learning scenario. Exploring (Unsupervised) Representation Learning for Decision Support Unsupervised representation learning allows the formation of simplified representations of complex data, where individuals are organized according to their similarity in the original data. Take the illustration in . Fig. 8.4. The same set of features is extracted for a set of patients. These features can be of any nature and dimensionality, e.g. images, signal, single value clinical variables and so forth. Unsupervised dimensionality reduction techniques will, first, identify the most relevant factors of variation in the data, according to some method-dependent criterion, and discard those considered irrelevant (e.g. noise, redundancies). A simplified representation of the original data is obtained where position is parameterized by the main factors of variation only. Intuitively, a close (distant) positioning of individuals indicates a similar (dissimilar) expression of the measured features. So we know that in small neighbourhoods of the simplified representation, we will find patients who present similarly for their available conglomerate of information; we can then link different regions with specific patient presentations (phenogrouping), based on the regional expression of the input features (see

AI and Decision Support

165

8

. Fig. 8.5 Phenogrouping and identifying phenotype-specific risk assessment and treatment planning, based on the simplified representation

. Fig. 8.5, left). This step makes the simplified representation more interpretable and is conceptually similar to what clinicians currently do in daily practice: identifying comparable individuals in different populations of healthy and diseased. In the example, for the sake of simplicity, features are considered to be single-valued, and the predominant values of the different features within each phenogroup are graphically illustrated in bar charts; however, the same type of reasoning would apply with higher dimensional features—if features were, for example, curves, the predominant shapes in each phenogroup would be displayed instead. If knowledge on outcomes of interest is available, each region (phenogroup) can also be linked to a specific risk of experiencing said outcome, based on regional incidence (see . Fig. 8.5, middle). Because outcome labels do not take part in the learning of the simplified representation, this type of approach is more robust to the “dichotomization” issue of the previous section, allowing to identify “grey zones” with intermediate phenotypes and outcome incidences between the “healthy” and “disease” extremes, and therefore model disease as a spectrum. If knowledge on treatment response is available, different regions (phenogroups) can also be linked with the treatments which provide the best chances of success (see . Fig. 8.5, right). After the simplified representation is well characterized, it can be used to support interpretation and decision support when handling new patients: the projection model used to map the original (learning) patient set to the simplified representation space can also be used to project data of new patients; new patients will be projected closely to the projections of similar presenting patients of the original patient set. With this, the new patients are positioned with regard to the existing phenogroups, and the previous characterization of the space can assist the clinician in personalized risk assessment and treatment planning (see . Fig. 8.6). This type of framework thus allows clinicians to access important insight as to “why” a certain decision is suggested (i.e. back to the “why” of a given patient being positioned in a specific phenogroup). Furthermore, the fact that it, in a way, mimics the experienced clinician’s reasoning—positioning new patients with regard to previously seen cases and using knowledge on interventions and resulting outcomes to support decisions—adds to its potential for a harmonious integration in a clinical

166

M. Nogueira and B. Bijnens

. Fig. 8.6 Personalized risk assessment and treatment selection for new patients based on regional positioning in the simplified representation

8

environment, using evidence to learn from, but providing eminence-like use of the support information. In the field of cardiovascular disease, for example, this type of approach has recently been used in the phenogrouping of heart failure [12−14] and in the identification of patients benefiting from cardiac resynchronization therapy [14], based on myocardial velocity traces and other clinical data. Representation Learning from Multiview Data One already mentioned challenge in cardiovascular care is the integration and joint interpretation of very heterogeneous sources of information (e.g. image, signal). Using conventional machine learning methods, such as the previously mentioned LDA, PCA or classical manifold learning techniques, a joint learning from heterogeneous data requires that they are somehow combined (e.g. through concatenation [15]) in the form of a single feature vector. In the process, potentially important differences regarding functional and structural properties are ignored. Multiview learning is a more recent learning paradigm that allows a differential handling of different “views” of the data, taking into account their specific characteristics, aiming at an improved learning performance. Unsupervised multiview representation learning techniques can be divided in two main categories: alignment-based and fusion-based [16]. Alignment-based techniques find simplified representations for each of the different views that maximize certain alignment metrics (e.g. correlation, similarity). Well-known examples are Canonical Correlation Analysis (CCA) [10] or Partial Least squares [17], and derived techniques (e.g. Kernel CCA [18], Deep CCA [19]). On the other hand, fusion-based techniques manage to merge information from different views before learning a single and compact representation of the data. Well-known examples include unsupervised Multiple Kernel Learning (MKL) [12, 20] and multiview generalizations of neural network models (e.g. multi-modal deep autoencoder [21]). In

AI and Decision Support

167

8

the previously referred heart failure studies ([12−14]), for example, unsupervised MKL was used to merge the information contained in the multiple velocity traces and other clinical data. When it comes to supervised representation learning, many conventional methods have also seen their formulations generalized to multiview scenarios. Examples are multiview formulations of linear discriminant analysis [22−24], of Support Vector Machines (SVMs)—e.g. multiple kernel formulation [25] (MKL-SVM) or multiview Laplacian formulation [26]—or multiview deep learning frameworks. In the cardiovascular field, for example, multiview LDA and multiview Laplacian SVMs were used in the identification of dilated cardiomyopathy by combining cardiac motion information from echocardiography and magnetic resonance images [26], and MKL-SVM was used to combine cardiac motion information with other clinical parameters for cardiac resynchronization therapy response prediction [27]. For more complete overviews of multiview learning techniques, we refer the reader to [15, 16, 28]. Latent Trajectories—Representation Learning From Dynamic Data In many real-world scenarios, not only do data come in heterogeneous formats, but they are also dynamically changing—at different timescales, depending on the problem at hand. The described framework can also accommodate dynamic data: dynamic changes in patient data will translate into dynamic changes in their positioning in the simplified representation; thus, in this case, each patient will be linked, not to a single point, but to a trajectory in the simplified representation space. Patients can then be compared in the trajectory space, for diagnosis or monitoring purposes—the trajectory of a “new patient” can be positioned with regard to those of previous patients with known outcomes. In any case, given that initial presentation can be very heterogeneous, and the characteristics of “healthy” and pathological trajectories can highly depend on it, it makes sense, once again, to restrict comparisons to patients with similar initial presentations. After this starting point alignment (in some scenarios, an alignment based on additional temporal landmarks, e.g. through registration, might make sense), trajectories can be, in principle, quantitatively compared, in their (aligned) raw form (although some type of interpolation might be needed to compensate for different temporal samplings), or via other time series comparison techniques (e.g. feature-based or model-based [29]). . Figure 8.7 illustrates a hypothetical real-time monitoring scenario that extends the example of . Figs. 8.5−8.6—the trajectory of the “new patient” after receiving “treatment B” is monitored from initial presentation (t0 ) to time point t, and compared to that expected based on the “healthy” trajectories of patients with similar initial presentations subjected to the same treatment. In this case, we see a significant deviation between the expected and the observed trajectory, which should alert the healthcare provider to the possibility that something is wrong. At this point, every aspect from risk estimates to treatment strategies should be re-evaluated, and the reference “healthy” path re-estimated, based on the new conditions. The previously described framework is thus compatible with the analysis of dynamic data, with possible applications in real-time monitoring and decision

168

8

M. Nogueira and B. Bijnens

. Fig. 8.7 Hypothetical scenario for dynamic evolution of the new patient of . Fig. 8.6. Dynamic data are seen as trajectories in the simplified representation space. A personalized estimate for a reference healthy path can be obtained based on the healthy trajectories of previous similar patients. In this scenario, the new patient’s trajectory is deviating from the expected. Risk assessment, treatment planning and the expected healthy path should be continuously updated

support situations. In the field of cardiovascular disease, “latent trajectories” have been explored, for example, to position individuals in the heart failure spectrum, based on their trajectories during stress tests, using unsupervised MKL on myocardial velocity waveforms [30]; (in single-view contexts) for the detection of electrocardiogram anomalies, using a manifold learning algorithm [31]; for the unveiling of abnormal LV (shape) dynamics in hypertrophic cardiomyopathy [32], using PCA; for extracting latent heartbeat trajectories from echocardiography video data, using an autoencoder-based framework, and demonstrating their clinical relevance in various predictive tasks, such as EF prediction [33].

8.5

Closing Remarks

Throughout this section, we have focused on representation learning as one possible way to achieve higher interpretability decision support solutions. It is important to remember that this type of approach is not free of challenges; indeed, small training samples, class imbalance, or practice biases still represent challenges for this type of approach. However, given that the approach intrinsically mimics and supports the classical clinical thought- and decision-processes, and that it explicitly explores the variability in the training populations, which can easily be linked to pathophysiological processes, the clinician is given essential insight into the individual’s data to better discern if a suggested decision has a pathophysiological basis or if it derives from methodological susceptibilities and, in the latter scenario, they can still explore the representation for understanding and integrating all available data, and therefore for supporting decisions. Ultimately, this makes a substantial

AI and Decision Support

169

8

difference regarding compatibility, as well as natural integration, with real-world clinical workflows, when compared to purely black box approaches. Acknowledgements This research was partly funded by the Fundació La Marató

de TV3—Ref 202016-30-31.

References 1. Doš ilovi´c FK, Brˇci´c M, Hlupi´c N. Explainable artificial intelligence: A survey. In: 41st International convention on information and communication technology. Electronics and microelectronics (MIPRO). 2018. p. 0210−5. 2. Gilpin L, Bau D, Yuan B, Bajwa A, Specter M, Kagal L. Explaining explanations: An overview of interpretability of machine learning. Proc. IEEE 5th International Conference on data science and advanced analytics (DSAA). 2018:80−9. 3. Chen C, Qin C, Qiu H, Tarroni G, Duan J, Bai W, Rueckert D. Deep learning for cardiac image segmentation: A review. Front Cardiovasc Med. 2020;7:25. 4. Johnson KW, Soto JT, Glicksberg BS, Shameer K, Miotto R, Ali M, Ashley E, Dudley JT. Artificial intelligence in cardiology. J Am College Cardiol. 2018;71(23):2668−79. 5. Finlayson SG, Bowers JD, Ito J, Zittrain JL, Beam AL, Kohane IS. Adversarial attacks on medical machine learning. Science. 2019;363(6433):1287−9. 6. Ambrosino R, Buchanan B, Cooper G, Fine M. The use of misclassification costs to learn rulebased decision support models for cost-effective hospital admission strategies. In: Proceedings of the annual symposium on computer application in medical care. Symposium on computer applications in medical care, vol. 02; 1995. p. 304−8. 7. Zech JR, Badgeley MA, Liu M, Costa AB, Titano JJ, Oermann EK. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. PLoS Med. 2018;15(11): e1002683. 8. Shameer K, Johnson KW, Glicksberg BS, Dudley JT, Sengupta PP. Machine learning in cardiovascular medicine: Are we there yet? Heart. 2018;104(14):1156−64. 9. Fisher RA. The use of multiple measurements in taxonomic problems. Ann Eugenics. 1936;7(2):179−88. 10. Hotelling H. Relations between two sets of variables. Biometrika. 1936;28(3−4):321−77. 11. Cayton L. Algorithms for manifold learning. UCSD, Tech Rep. 2005; CS2008−0923. 12. Sanchez-Martinez S, et al. Characterization of myocardial motion patterns by unsupervised multiple kernel learning. Med Image Anal. 2017;35:70−82. 13. Sanchez-Martinez S, Duchateau N, Erdei T, Kunszt G, Aakhus S, Degiovanni A, Marino P, Carluccio E, Piella G, Fraser AG, Bijnens BH. Machine learning analysis of left ventricular function to characterize heart failure with preserved ejection fraction. Circulation: Cardiovasc Imaging. 2018;11(4). 14. Cikes M, Sanchez-Martinez S, Claggett B, Duchateau N, Piella G, Butakoff C, Pouleur AC, Knappe D, Biering-Sørensen T, Kutyifa V, Moss A, Stein K, Solomon SD, Bijnens B. Machine learning-based phenogrouping in heart failure to identify responders to cardiac resynchronization therapy. Eur J Heart Failure. 2019;21(1):74−85. 15. Xu C, Tao D, Xu C. A survey on multi-view learning. arXiv. 2013. 16. Li Y, Yang M, Zhang Z. A survey of multi-view representation learning. IEEE Trans Knowled Data Eng. 2018;1:09. 17. Wold H. Partial least squares. John Wiley & Sons, Inc.;1985. 18. Akaho S. A kernel method for canonical correlation analysis. arXiv. 2006. 19. Andrew G, Arora R, Bilmes J, Livescu K. Deep canonical correlation analysis. In: Dasgupta S, McAllester D, editors. Proceedings of the 30th international conference on machine learning. Ser. Proceedings of machine learning research, vol. 28, 3rd ed. Atlanta, Georgia, USA: PMLR; 2013. p. 1247−55.

170

8

M. Nogueira and B. Bijnens

20. Lin Y, Liu T, Fuh C. Multiple kernel learning for dimensionality reduction. IEEE Trans Pattern Anal Mach Intell. 2011;33(6):1147−60. 21. Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng AY. Multimodal deep learning. In: ICML; 2011. p. 689−96. 22. Diethe T, Hardoon D, Shawe-Taylor J. Multiview fisher discriminant analysis. In: NIPS workshop on learning from multiple sources; 2008:1. 23. Kan M, Shan S, Zhang H, Lao S, Chen X. Multi-view discriminant analysis. IEEE Trans Pattern Anal Mach Intell. 2016;38(1):188−94. 24. Sun S, Xie X, Yang M. Multiview uncorrelated discriminant analysis. IEEE Trans Cybern. 2016;46(12):3272−84. 25. Sonnenburg S, Rätsch G, Schäfer C, Schölkopf B. Large scale multiple kernel learning. J Mach Learn Res. 2006;7(07):1531−65. 26. Puyol-Antón E, Ruijsink B, Gerber B, Amzulescu MS, Langet H, De Craene M, Schnabel JA, Piro P, King AP. Regional multi-view learning for cardiac motion analysis: Application to identification of dilated cardiomyopathy patients. IEEE Trans Biomed Eng. 2019;66(4):956−66. 27. Peressutti D, Sinclair M, Bai W, Jackson T, Ruijsink J, Nordsletten D, Asner L, Hadjicharalambous M, Rinaldi CA, Rueckert D, King AP. A framework for combining a motion atlas with non-motion information to learn clinically useful biomarkers: Application to cardiac resynchronisation therapy response prediction. Med Image Anal. 2017;35:669−84. 28. Wang W, Arora R, Livescu K, Bilmes J. On deep multi-view representation learning. In: Proceedings of the 32nd International conference on international conference on machine learning—vol. 37, ser. ICML’15. JMLR.org, 2015. p. 1083−92. 29. Warren Liao T. Clustering of time series data−a survey. Pattern Recogn. 2005;38(11):1857−74. 30. Nogueira M, Craene MD, Sanchez-Martinez S, Chowdhury D, Bijnens B, Piella G. Analysis of nonstandardized stress echocardiography sequences using multiview dimensionality reduction. Med Image Anal. 2020;60: 101594. 31. Li Z, Xu W, Huang A, Sarrafzadeh M. Dimensionality reduction for anomaly detection in electrocardiography: A manifold approach. In: Ninth international conference on wearable and implantable body sensor networks. 2012;2012:161−5. 32. Madeo A, Piras P, Re F, Gabriele S, Nardinocchi P, Teresi L, Torromeo C, Chialastri C, Schiariti M, Giura G, Evangelista A, Dominici T, Varano V, Zachara E, Puddu PE. A new 4D trajectorybased approach unveils abnormal LV revolution dynamics in hypertrophic cardiomyopathy. PLOS One. 2015;10(4):1−33. 33. Laumer F, Fringeli G, Dubatovka A, Manduchi L, Buhmann JM. DeepHeartBeat: Latent trajectory learning of cardiac cycles using cardiac ultrasounds. In: Alsentzer E, McDermott MBA, Falck F, Sarkar SK, Roy S, Hyland SL, editors. Proceedings of the machine learning for health NeurIPS workshop, ser. Proceedings of machine learning research, vol. 136, 11th ed. PMLR; 2020. p. 194−212.

171

AI in the Real World Alistair A. Young, Steffen E. Petersen and Pablo Lamata Contents 9.1

Introduction – 172

9.2

Asking the Right Question – 173

9.3

Provenance of Data – 174

9.4

Structural Risk – 175

9.5

Shallow Learning – 175

9.6

Does My Model Look Good in This? – 176

9.7

Mechanistic Models for AI Interpretability – 177

9.8

Utility of Community-Led Challenges – 178

9.9

Closing Remarks – 179 References – 180

Authors’ contribution: • Main chapter: AY, SP, PL.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Duchateau and A. P. King (eds.), AI and Big Data in Cardiology, https://doi.org/10.1007/978-3-031-05071-8_9

9

172

A. A. Young et al.

n Learning Objectives: At the end of this chapter you should be able to: O9.A Describe the importance of considering how AI models will fit into clinical workflows, and validation of their impact on patients in these workflows O9.B Explain how data provenance issues can lead to biases in AI model performance O9.C Explain the risks and sources of overfitting in AI methods for cardiology and some strategies for mitigating these risks O9.D Summarize efforts that have been made to improve reporting of AI methods in healthcare with a view to ensuring patient benefit O9.E Explain the importance of interpretability in AI for cardiology applications, and what is meant by a “digital twin”

“Real motive problem, with an AI. Not human, see?”—Dixie Flatline, in Neuromancer by William Gibson (1984).

9.1

9

Introduction

AI methods have shown considerable promise in assisting clinical workflows. However, despite initial successes and high-profile studies, naïve applications of AI methods to cardiology risk increasing harm to the patient. Deep learning methods have the potential to provide workable solutions to previously untenable problems. However, several common pitfalls need to be considered when applying these methods in practice, since the temptation is to short circuit the usual design and testing processes common in traditional engineering. Since AI methods have naturally evolved from classical data science methods, they inherit and amplify many of the associated known pitfalls. Very recently many of these issues have been highlighted in the COVID-19 pandemic, in which AI based algorithms for risk prediction failed to replicate in other centres [1]. Unless steps are taken to mitigate these problems, deep learning methods tend to provide shallow, brittle solutions which exploit feature creep or implicit bias in the design. This chapter seeks to identify these common problems and misconceptions and offer methods for their mitigation. Firstly, we need to consider the purpose of the algorithm and whether the right question is being addressed. The problem of overfitting (see . Chap. 2, Sect. 2.5), ubiquitous in deep learning methods, is also discussed from the point of view of clinical applications. Dependencies on the provenance of the data and issues with training algorithms based on human reader annotations are highlighted. We then review recommendations for AI applications in cardiovascular imaging and risk prediction. Issues of causality and interpretability are considered in the context of a “digital twin” in cardiology and we discuss how mechanistic models of physiological and biophysical processes can aid interpretation and application of AI methods. Finally, we review requirements for reproducible results and advocate for increased use of open benchmark datasets and open algorithm design

AI in the Real World

173

9

to enable reproducibility and validation and facilitate robust advancement of the field.

9.2

Asking the Right Question

AI has inherited many pitfalls from data science and statistical inference [2], and in many ways is a natural progression in human evolution [3]. However, the combined power of computing resources and complex automated algorithms means these problems can become amplified several-fold. For example, the problem of whether a particular algorithm is appropriate in a particular context is similar to the application of appropriate statistical tests in traditional clinical studies, in that tests need to be suitable for the hypothesis and should test it appropriately. Common tasks in which AI has made an impact recently include identification of disease (binary) from data such as images, calculation of risk (continuous) or flagging possible areas of anomaly for human review [4]. A primary clinical consideration is the principle of “do no harm”. Medical imaging in particular has the potential to cause more harm than good if disease is flagged as being actionable but is actually benign or vice versa. Victims of medical imaging technology are becoming increasingly prevalent [5]. Screening programmes are particularly fraught since prevalence is low and the costs and harm produced by false positives can be high. An example of lack of “fitness for purpose” was demonstrated in an algorithm designed to identify critical findings in head CT [6], in which area under the receiver operating characteristic curve (AUC) (see . Chap. 2, footnote 7) was high in several tasks, including detection of intracranial hemorrhage (AUC 0.90). However, algorithms targeted for high sensitivity (ability to predict true positives if the disease is present) can result in a high false discovery rate (proportion of all positive predictions which are false) and low positive predictive value (proportion of all positive predictions which are true). In the example of intracranial hemorrhage (AUC 0.90, sensitivity 0.90, specificity 0.73), the positive predictive value was 31% (over twothirds of all positive predictions were false) which could lead to overtreatment of patients. Also, the false negative rate (proportion of false negatives if the disease is present) was 10%, potentially leading to a not-insignificant number of patients with disease going without treatment [7]. The question of which metric to target is highly dependent on the prevalence (proportion of all patients who have the disease). The AUC is a useful metric for determining algorithm performance. However, the ROC itself may also provide useful information. This can depend on the application as sometimes the ROC appears to give no added value above that of the AUC when evaluating the performance of clinical prediction models [8]. However, ROC analysis is beneficial when an average performance needs to be estimated. For example, when comparing the performance of algorithms against the “average” reader, summary ROCs should be evaluated rather than simple averaging of each reader’s specificity and sensitivity independently (which results in underestimation of average reader accuracy) [9]. Note that cross validation studies should also form summary ROCs across folds for reporting average performance.

174

9

A. A. Young et al.

In a landmark study of breast cancer [10], a screening algorithm was shown to improve false positive and false negative rates as well as reduce workload in a double-read process. However, application to cancer screening remains fraught due to low prevalence and potential for over-diagnosis even if the test is more efficient.1 Detection of malignancy may provide a more applicable tool, but clinical trials would be needed to test whether net harm is reduced. In a comparison of three commercial methods for mammography screening, the most accurate configuration was obtained by combining the best algorithm with first-reader radiologists [11]. In addition to clinical trials, prospective experience in real-world applications is often fraught [12]. In a report of deployment of an AI screening tool to assist detection of diabetic retinopathy [13], challenges were described in integrating with the clinical workflow and maximizing benefit to patients and practitioners. Changes in workflow were particularly difficult to adopt, and socio-economic factors led to variable algorithm performance. Variable image quality in real-world settings was also problematic as the algorithm was trained using high quality images. Interestingly, feedback of both images and reports to patients were useful for explanation and communication of results. This also enabled nurses to improve their ability to interpret the images and results. However, detection of disease by the algorithm resulted in additional workload placed on practitioners at the point of test, who then needed to council patients on their next course of action. Also, this action typically involved cost and effort on the part of the patients, who needed to be aware of this possibility before agreeing to take the test. In testing performance, it is important to compare results on an external cohort (not seen by the algorithm previously, typically from a different centre) against the performance of health professionals [14]. An initial meta-analysis showed that only 17% of published studies to June 2019 included a comparison of out-of-sample performance with health professionals. More recent clinical evaluations are becoming available [15, 16]. Clinical utility should be evaluated rather than predictive performance, since the latter is often performed in isolation from commonly available clinical information [15]. Application in a clinical workflow therefore requires consideration of costs and benefits, i.e. the value of the AI solution. In particular, algorithms which can recommend change of treatment need to quantify how treatment would change, the cost of overtreatment and other impacts on the patient.

9.3

Provenance of Data

A common adage in machine learning is the “unreasonable effectiveness of data” [17], in which large quantities of data make simpler algorithms at least as effective as complex algorithms. However, it is also well known that algorithms are only as good as the data used in training. In training supervised AI algorithms, care must be taken to ensure that the ground truth is commensurate with the final application. In segmenting CMR images (an area in which AI methods can exceed human performance [18]) the results of the algorithm are very dependent on the 1

7 https://www.wired.com/story/artificial-intelligence-makes-bad-medicine-even-worse/.

AI in the Real World

175

9

human annotations used in training. Suinesiaputa et al. [19] found that bias can vary between core labs, even when CMR is known to have superior accuracy to most other imaging methods. Thus, algorithms trained using annotations from one core lab will produce biased results when compared with another. This is an area in which domain adaptation methods are currently under investigation to mitigate this risk. A further example of bias due to data provenance can be found in [20], in which a deep learning-based CMR segmentation model was found to be biased against minority racial groups as a result of lack of representation of these groups in the training data.

9.4

Structural Risk

In classical machine learning, generalizability of models is often described in terms of empirical risk and structural risk. Empirical risk can be estimated by performance on test data. Structural risk is often quantified by model complexity and can be minimized by making assumptions about the nature of the task. These assumptions are often built into the design of the model or the methods for regularization. The term “inductive bias” is used to refer to the set of assumptions either implicitly or explicitly included in the algorithm to enable generalization to unseen datasets. An explanation for the ability of neural networks to perform well on unseen data despite having a large number of parameters might be that they inherently tend to produce regularized solutions (i.e. the sum of weight values in the network is constrained) due to the inductive bias implicitly built into deep learning architectures trained with stochastic gradient descent [21]. This leads to the phenomenon of better test performance with increasing network capacity, despite an overwhelmingly large number of parameters, in contradiction to the classical bias-variance tradeoff (see . Chap. 2, Sect. 2.5).

9.5

Shallow Learning

As described previously, machine learning can be thought of as statistical models applied to complex data. As the models become more complicated, the number of parameters to be optimized grows, so that deep learning models often have millions of parameters. Hence it could be said that all deep learning models are overfitted, even if steps are taken to mitigate this. For example, cross validation (see . Chap. 2, Model Validation) is a procedure in which models trained on some partitions of the full dataset are tested on other partitions. However, this does not protect against overfitting. Rather, it provides a measure of how performance can degrade due to overfitting in that particular dataset. This is an extension of the adage that all models are wrong, but some are useful: all deep learning models are overfitted, but some still provide good performance. In cardiology, overfitting can manifest in different ways. This is part of the reason why algorithms often perform poorly in “out-of-distribution” scenarios. Since normal heart anatomy and function has relatively low variability, and pathology

176

9

A. A. Young et al.

increases variability, models trained on healthy volunteers can fail when applied to patients. For example, models trained using data from the UK Biobank (a large cohort but relatively healthy adults) often fail on patients with congenital heart disease. Thus, deep learning is actually quite “shallow”, in that typical models do not provide insight into the nature of the phenomena, unless forced to do so. If they can solve a problem with learning spurious features (such as markings inadvertently left on films by institutions with sicker patients), they will. Such ‘‘feature creep’’ is common in AI applications and is often very had to detect. Domain adaptation methods and data augmentation techniques (see . Chap. 7, Motion Artefact Detection) may mitigate these problems, and this is an area of rapid development. In addition to the structural risk problem, AI methods commonly result in algorithms which fail differently in real world deployment. This may be due to the under-specification inherent in common AI pipelines [22]. Typically, there can be many solutions (e.g. sets of parameters) which give similar performance under standard training/validation/test frameworks (in which all datasets are designed to be independent and identically distributed). However, these networks give rise to different failure modes when “stress-tested”. One proposed solution to this problem is to combine multiple models trained on different cohorts (i.e. ensemble learning, see Technical Note, . Sect. 4.4) [23], which has been successfully applied in data challenges.

9.6

Does My Model Look Good in This?

Recently there have been several attempts to provide frameworks or guidelines for reporting on AI model provenance, usage and evaluation. One such framework [24] proposed a set of structured information which should be provided on intended usage and possible impacts on downstream users. These include mechanisms to provide ethical evaluations including evaluation benchmarks across different cultural, demographic or phenotypic groups, similar to issues arising in clinical trial design. This requires substantial time and effort but can help avoid unintended consequences down the track when they are difficult to correct or amend. In clinical applications, Sendak et al. [25] propose the use of a “Model Facts Label”, borrowing from clinical regulatory terminology (on-label refers to indications where the treatment is approved and “Drug Facts” have been useful in risk communication). This is particularly relevant because, once algorithms are available in practice, there is a temptation to use them in applications for which they were not designed or tested. In an example discussed in that paper, a model was trained to predict risk of death among patients presenting to a hospital with pneumonia. The algorithm found reduced risk in patients with asthma, leading to a possible under-treatment of this group, when this was in fact due to the higher treatment already performed. This is an example of “collider bias”, which may be mitigated by the concept of a “label” for algorithms to clarify indications for use. A similar motivation has led to an extension of the CONSORT guidelines for reporting randomized clinical trials to the reporting of AI interventional

AI in the Real World

177

9

trials (Consolidated Standards of Reporting Trials—Artificial Intelligence) [26]. This was designed to mitigate the risk of bias in reported outcomes and promote transparency, including clear description of use cases, inputs and outputs, human-AI interaction and provision of analysis of error cases [27]. A companion effort SPIRIT-AI (Standard Protocol Items: Recommendations for Interventional Trials—Artificial Intelligence) comprises recommendations for design of clinical trials involving AI. A more radiology focused set of guidelines was proposed by Mongan et al. [28]. This is a checklist for AI in medical imaging (CLAIM), based on standards for reporting of diagnostic accuracy studies (STARD). A summary of model design, initialization of parameters, methods for explainability or interpretability, provenance of ground truth is required to enable reproducibility. These recommendations have grown out of similar efforts in data science applied to multivariate risk prediction such as the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) [29]. In cardiovascular imaging, a multidisciplinary position paper [30] recommended seven requirements for reporting AI solutions. These summarize guidelines for design, implementation and reporting of studies, including best practice checklists to ensure appropriate design and study plan, data cleaning, model specification, training, evaluation, replicability and reporting. In particular, code sharing and example datasets are seen as powerful tools to aid replication in addition to experiments on external datasets. In the UK, the NHS has developed a set of guidelines to aid decisions about employing commercial AI solutions in clinical environments.2 The first recommendation is whether the problem to be solved is appropriate for an AI solution. Empirical tests of algorithm accuracy are highly recommended as well as considerations of ethical implementation and whether outputs will lead to benefit for downstream users. Regulatory requirements are considered along with the product’s intended use and risk classification. In the USA, regulatory frameworks around software as a medical device have been designed to address many of the issues around safety, efficacy and performance; however, some work may still need to be done to adapt to modern AI methods [31]. In particular, care needs to be taken to separate the task specification from the algorithm specification and ensure that the task specification is based on consensus and standards defined by the community of practitioners. Also, the algorithm should be auditable to ensure periodic monitoring of quality.

9.7

Mechanistic Models for AI Interpretability

In . Chap. 8, in the section Current Challenges and the Importance of Interpretability we discussed the role of interpretability in AI. The limited interpretability of many AI methods is an impediment to their application in clinical practice. Although engineers well understand the function of each block, so they are not “black box”, it is often unclear how the data led to the result determined. If the 2

7 https://www.nhsx.nhs.uk/ai-lab/explore-all-resources/adopt-ai/a-buyers-guide-to-ai-in-healthand-care/.

178

9

A. A. Young et al.

algorithm cannot explain its predictions, it is difficult to explain to patients why additional procedures are warranted, or why no action is being taken. A similar problem is faced in radiomics, in which large numbers of image-derived features are used to predict risk or detect disease [32]. However, the biological significance of these features is often unclear, and problems with reproducibility and generalizability have been recognized, with a corresponding call for prospective studies of impact on outcomes. One way of providing a biological basis of features derived from AI methods is to interpret results with computational models of physiology and function [33]. The idea of a digital twin can help in this regard, in which clinical data are integrated with mechanistic and statistical models to enable individual customization of therapy. Mechanistic models of shape and function enable deductive analyses to be integrated with data-driven inductive analyses to create and maintain a digital twin of the patient [34]. This approach may mitigate well-known problems with the application of precision medicine using AI to enable identification of individual therapies [15]. Firstly, causal inference is difficult to achieve with current AI methods, since observational studies lack the ability to both predict likely outcomes and also predict counterfactual scenarios. Secondly, statistical models (including most machine learning models) cannot on their own predict outcomes for individuals. Mechanistic models can provide the link between observational analysis and mechanistic analysis, refining and testing hypotheses generated by data-driven inductive processes. For example, AI algorithms have identified particular variants of ECG waveforms using clustering methods [35] which may be linked with ion channel or structural abnormalities using mechanistic models. Mechanistic models can also aid model robustness by generating data using forward simulations, giving a broader range of distributions than would typically be available in the clinic. Biomechanics-informed regularization can constrain the network solution towards physically plausible solutions [36]. Quantification of variability in a cohort also enables evaluation of mechanisms of risk through mechanistic models. In congenital heart disease, for example, adverse outcomes are linked with multidimensional biventricular shape changes in complex ways [37]. Mechanistic simulations may also aid “model patching”, which is a data augmentation method proposed to improve performance in subgroups with limited training data though generative models [38]. A combination of mechanistic and machine learning analyses may improve patient evaluation by exploiting multidimensional information in ways not obvious to human observers [39, 40]. In both mechanistic and statistical models, quantification of uncertainty in results is essential to understanding the limitations of particular datasets in constraining parameters of the solution [41].

9.8

Utility of Community-Led Challenges

To avoid a replication crisis in AI-assisted cardiology, there must be a concerted effort to establish standards for reproducibility and transparency similar to those advocated for other data science domains [42]. Open science practices are essential

AI in the Real World

179

9

for the development of data science applications, but are particularly important in AI medical applications [43]. Descriptions of algorithms should include sufficient details to support reproducibility and transparency, ideally with code and example datasets [44]. Community led “challenges” provide a mechanism for exploring these issues. A challenge is an open competition or benchmark exercise in which several participants test algorithms on a standard dataset. Challenges should provide a fair and direct comparison of different methodological solutions to a common problem, and are useful in establishing the current state of the art. Good challenges foster reproducibility in data science, highlight current gaps and areas of improvement and advance the field by highlighting open issues and providing valuable datasets and tools to the community. Recent efforts to standardize challenge design and reporting can improve the quality of such benchmark exercises [45, 46]. However, care should be taken when interpreting results of AI challenges. The design of many challenges enables a large number of participants with increasing chance of the “winner” obtaining the best evaluation metric by chance (“overfitting of the crowds”3 ). Also, challenge design or evaluation metrics may not be appropriate to the intended use case, or the data used may not match the patients for whom the results might be applied [47]. As these issues are becoming more widely understood, it is clear that challenges are a substantial benefit to the scientific community, rapidly advance the field and engage a wider range of data scientist than might have occurred otherwise.

9.9

Closing Remarks

AI methods are a natural progression of data science and statistical modelling methods and can be seen as a natural step in the evolution of human efforts to solve natural problems. However, pitfalls identified in previous statistical data science applications are amplified in AI. Combination with mechanistic modelling may offer much needed interpretability and lead to new understanding of biological mechanisms. This will lead to greater generalization and transparency, more trustable models and more successful transfer to the clinic. Acknowledgements AAY, SEP and PL acknowledge support of the UKRI

London Medical Imaging Artificial Intelligence Centre for Value Based Healthcare (Grant No.631 104691), and the Wellcome/EPSRC Centre for Medical Engineering (Grant No. WT203148/Z/16/Z). AY is supported by National Institutes of Health R01HL121754. PL acknowledges British Heart Foundation (Grant No. PG/16/75/32383) and a Wellcome Trust Senior Research Fellowship (Grant No. 209450/Z/17/Z). SEP acknowledges the support from the National Institute for Health and Care Research Barts Biomedical Research Centre. SEP has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825903 (euCanSHare project). SEP acknowl-

3

7 https://laurenoakdenrayner.com/2019/09/19/ai-competitions-dont-produce-useful-models/.

180

A. A. Young et al.

edges support from the “SmartHeart” EPSRC programme grant (www.nihr.ac.uk; EP/P001009/1).

References

9

1. Barish M, Bolourani S, Lau L, Shah S, Zanos T. External validation demonstrates limited clinical utility of the interpretable mortality prediction model for patients with COVID-19. Nat Mach Intell. 2020. 2. Fan J, Han F, Liu H. Challenges of big data analysis. Natl Sci Rev. 2014;1(2):293−314. 3. Dennett D. Darwin’s dangerous idea. Penguin;1995. 4. Panayides A, Amini A, Filipovic N, Sharma A, Tsaftaris S, Young A, Foran D, Do N, Golemati S, Kurc T, Huang K, Nikita K, Veasey B, Zervakis M, Saltz J, Pattichis C. AI in medical imaging informatics: Current challenges and future directions. IEEE J Biomed Health Inform. 2020;24(7):1837− 57. 5. Hayward R. VOMIT (victims of modern imaging technology)-an acronym for our times. British Med J. 2003;326:1273. 6. Chilamkurthy S, Ghosh R, Tanamala S, Biviji M, Campeau N, Venugopal V, Mahajan V, Rao P, Warier P. Deep learning algorithms for detection of critical findings in head CT scans: a retrospective study. Lancet. 2018;392(10162):2388−96. 7. Dewey M, Schlattmann P. Deep learning and medical diagnosis. Lancet. 2019;394(10210):1710−1. 8. Verbakel J, Steyerberg E, Uno H, De Cock B, Wynants L, Collins G, Van Calster B. ROC curves for clinical prediction models part 1. ROC plots showed no added value above the AUC when evaluating the performance of clinical prediction models. J Clin Epidemiol. 2020;126:207−16. 9. Oakden-Rayner L, Palmer L. Docs are ROCs: a simple off-the-shelf approach for estimating average human performance in diagnostic studies. arXiv:2020. 10. McKinney SM, Sieniek M, Shetty S. International evaluation of an AI system for breast cancer screening. IEEE Trans Med Imaging. 2020;577(4):89−94. 11. Salim M, Wahlin E, Dembrower K, Azavedo E, Foukakis T, Liu Y, Smith K, Eklund M, Strand F. External evaluation of 3 commercial artificial intelligence algorithms for independent assessment of screening mammograms. JAMA Oncol. 2020;6(10):1581−8. 12. Paleyes A, Urma R.-G, Lawrence N. Challenges in deploying machine learning: a survey of case studies. ACM Comput Surv. 2022; 55:1−29. 13. Beede E, Baylor E, Hersch F, Iurchenko A, Wilcox L, Ruamviboonsuk P, Vardoulakis L. A humancentered evaluation of a deep learning system deployed in clinics for the detection of diabetic retinopathy. In: Proc. 2020 CHI conference on human factors in computing systems, 2020. 14. Liu X, Faes LKAU, Wagner SK, Fu DJ, Bruynseels A, Mahendiran T, Moraes G, Shamdas M, Kern C, Ledsam JR, Schmid MK, Balaskas K, Topol EJ, Bachmann LM, Keane PA, Denniston AK. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. Lancet Digit Health. 2019;1:e271−97. 15. Wilkinson J, Arnold K, Murray E, van Smeden M, Carr K, Sippy R, de Kamps M, Beam A, Konigorski S, Lippert C, et al. Time to reality check the promises of machine learning-powered precision medicine. Lancet Digit Health. 2020;2(12):e677-80. 16. Nagendran M, Chen Y, Lovejoy C, Gordon A, Komorowski M, Harvey H, Topol E, Ioannidis J, Collins G, Maruthappu M. Artificial intelligence versus clinicians: systematic review of design, reporting standards, and claims of deep learning studies. BMJ. 2020;368: m689. 17. Halevy A, Norvig P, Pereira F. The unreasonable effectiveness of data. IEEE Intell Syst. 2009;24(2):8−12. 18. Bai W, Sinclair M, Tarroni G, Oktay O, Rajchl M, Vaillant G, Lee A, Aung N, Lukaschuk E, Sanghvi M, et al. Automated cardiovascular magnetic resonance image analysis with fully convolutional networks. J Cardiovasc Magn Reson. 2018;20(1):65. 19. Suinesiaputra A, Medrano-Gracia P, Cowan BR, Young AA. Big heart data: Advancing health informatics through data sharing in cardiovascular imaging. IEEE J Biomed Health Inform. 2015;19(4):1283−90.

AI in the Real World

181

9

20. Puyol-Anton E, Ruijsink B, Piechnik SK, Neubauer S, Petersen SE, Razavi R, King AP. Fairness in cardiac MR image analysis: An investigation of bias due to data imbalance in deep learning based segmentation. In: Proceedings of medical image computing and computer-assisted interventions (MICCAI), 2021. 21. Belkin M, Hsu D, Ma S, Mandal S. Reconciling modern machine-learning practice and the classical bias−variance trade-off. In: Proceedings of the national academy of sciences, vol. 116, no. 32; 2019. p. 849−15. 22. D’Amour A, Heller K, Moldovan D, et al. Underspecification presents challenges for credibility in modern machine learning. J Mach Learn Res. 2022;23:1−61. 23. Wu H, Zhang H, Karwath A, Ibrahim Z, Shi T, Zhang X, Wang K, Sun J, Dhaliwal K, Bean D, Cardoso VR, Li K, Teo JT, Banerjee A, Gao-Smith F, Whitehouse T, Veenith T, Gkoutos GV, Wu X, Dobson R, Guthrie B. Ensemble learning for poor prognosis predictions: a case study on SARS-CoV2. J Am Med Inform Assoc. 2020. 24. Mitchell M, Wu S, Zaldivar A, Barnes P, Vasserman L, Hutchinson B, Spitzer E, Raji R, Gebru T. Model cards for model reporting. Proc. Conf. on Fairness, Accountability, and Transparency (FAT*’19). 2019:220−9. 25. Sendak M, Gao M, Brajer N, Balu S. Presenting machine learning model information to clinical end users with model facts labels. NPJ Digit Med. 2020;3:41. 26. Ibrahim H, Liu X, Rivera S, Moher D, Chan A, Sydes M, Calvert M, Denniston A. Reporting guidelines for clinical trials of artificial intelligence interventions: the SPIRIT-AI and CONSORTAI guidelines. Trials. 2021;22(1):11. 27. Harvey H, Oakden-Raynor L. Guidance for interventional trials involving artificial intelligence. Radiol: Artif Intell; 2020. 28. Mongan J, Moy L, Kahn C. Checklist for artificial intelligence in medical imaging (CLAIM): A guide for authors and reviewers. Radiol: Artif Intell; 2020: 25:e200029. 29. Collins G, Moons K. Reporting of artificial intelligence prediction models. Lancet. 2019;393(10181):1577−9. 30. Sengupta PP, Shrestha S, Berthon B, Messas E, Donal E, Tison GH, Min JK, D’hooge J, Voigt J.-U, Dudley J, Verjans JW, Shameer K, Johnson K, Lovstakken L, Tabassian M, Piccirilli M, Pernot M, Yanamala N, Duchateau N, Kagiyama N, Bernard O, Slomka P, Deo R, Arnaout R. Proposed requirements for cardiovascular imaging-related machine learning evaluation (PRIME): A checklist: Reviewed by the american college of cardiology healthcare innovation council. JACC: Cardiovasc Imaging. 2020;13(9):2017−35. 31. Larson DB, Harvey H, Rubin DL, Irani N, Tse JR, Langlotz CP. Regulatory frameworks for development and evaluation of artificial intelligence-based diagnostic imaging algorithms: Summary and recommendations. J Am College Radiol. 2021;18:413−24. 32. Pinto Dos Santos D, Dietzel M, Baessler B. A decade of radiomics research: are images really data or just patterns in the noise? Eur Radiol. 2021;31(1):1−4. 33. Corral-Acero J, Margara F, Marciniak M, Rodero C, Loncaric F, Feng Y, Gilbert A, Fernandes JF, Bukhari HA, Wajdan A, Martinez MV, Santos MS, Shamohammdi M, Luo H, Westphal P, Leeson P, DiAchille P, Gurev V, Mayr M, Geris L, Pathmanathan P, Morrison T, Cornelussen R, Prinzen F, Delhaas T, Doltra A, Sitges M, Vigmond EJ, Zacur E, Grau V, Rodriguez B, Remme EW, Niederer S, Mortier P, McLeod K, Potse M, Pueyo E, Bueno-Orovio A, Lamata P. The ‘Digital Twin’ to enable the vision of precision cardiology. Eur Heart J. 2020;41(48):4556−64. 34. Lamata P. Teaching cardiovascular medicine to machines. Cardiovasc Res. 2018;114(8):e62-4. 35. Lyon A, Ariga R, Mincholé A, Mahmod M, Ormondroyd E, Laguna P, de Freitas N, Neubauer S, Watkins H, Rodriguez B. Distinct ECG phenotypes identified in hypertrophic cardiomyopathy using machine learning associate with arrhythmic risk markers. Front Physiol. 2018;9:213. 36. Qin C, Wang S, Chen C, Qiu H, Bai W, Rueckert D. ‘‘Biomechanics-informed neural networks for myocardial motion tracking in MRI,’’ In: Medical image computing and computer assisted intervention—MICCAI. Springer International Publishing. 2020;2020:296−306. 37. Forsch N, Govil S, Perry JC, Hegde S, Young AA, Omens JH, McCulloch AD. Computational analysis of cardiac structure and function in congenital heart disease: Translating discoveries to clinical strategies. J Comput Sci. 2020:101211.

182

9

A. A. Young et al.

38. Goel K, Gu A, Li Y, Re C. Model patching: Closing the subgroup performance gap with data augmentation. arXiv:2020. 39. Salehyar S, Forsch N, Gilbert K, Young AA, Perry JC, Hegde S, Omens JH, McCulloch AD. A novel atlas-based strategy for understanding cardiac dysfunction in patients with congenital heart disease. Mol Cell Biomech. 2019;16(3):179−83. 40. Suinesiaputra A, McCulloch AD, Nash MP, Pontre B, Young AA. Cardiac image modelling: Breadth and depth in heart disease. Med Image Anal. 2016;33:38−43. 41. Chang KC, Dutta S, Mirams GR, Beattie KA, Sheng J, Tran PN, Wu M, Wu WW, Colatsky T, Strauss DG, Li Z. Uncertainty quantification reveals the importance of data variability and experimental design considerations for in silico proarrhythmia risk assessment. Front Physiol. 2017;8:917. 42. Nichols T, Das S, Eickhoff S, Evans A, Glatard T, Hanke M, Kriegeskorte N, Milham M, Poldrack R, Poline J, et al. Best practices in data analysis and sharing in neuroimaging using MRI. Nat Neurosci. 2017;20(3):299−303. 43. Haibe-Kains B, Adam G, Hosny A, Khodakarami F, Massive Analysis Quality Control Society Board of D, Waldron L, Wang B, McIntosh C, Goldenberg A, Kundaje A, et al. Transparency and reproducibility in artificial intelligence. Nature. 2020;586(7829):E14−E16. 44. Kitamura FC, Pan I, Kline TL. Reproducible artificial intelligence research requires open communication of complete source code. Radiol: Artif Intell. 2020;2(4):e200060. 45. Maier-Hein L, Reinke A, Kozubek M, Martel AL, Arbel T, Eisenmann M, Hanbury A, Jannin P, Müller H, Onogur S, Saez-Rodriguez J, van Ginneken B, Kopp-Schneider A, Landman BA. Bias: Transparent reporting of biomedical image analysis challenges. Med Image Anal. 2020;66: 101796. 46. Wiesenfarth M, Reinke A, Landman BA, Eisenmann M, Aguilera Saiz L, Cardoso MJ, Maier-Hein L, Kopp-Schneider A. Methods and open-source toolkit for analyzing and visualizing challenge results. Sci Rep. 2021;11:2369. 47. Maier-Hein L, Eisenmann M, Reinke A, Onogur S, Stankovic M, Scholz P, Arbel T, Bogunovic H, Bradley A, Carass A, Feldmann C, Frangi A, Full P, van Ginneken B, Hanbury A, Honauer K, Kozubek M, Landman B, März K, Maier O, Maier-Hein K, Menze B, Müller H, Neher P, Niessen W, Rajpoot N, Sharp G, Sirinukunwattana K, Speidel S, Stock C, Stoyanov D, Taha A, van der Sommen F, Wang C, Weber M, Zheng G, Jannin P, Kopp-Schneider A. Why rankings of biomedical image analysis competitions should be interpreted with care. Nat Commun. 2018;9:5217.

183

Analysis of Non-imaging Data Nicolas Duchateau, Oscar Camara, Rafael Sebastian and Andrew King Contents 10.1

Introduction – 184

10.2

Electrophysiology – 184

10.3

ECG Analysis – 188

10.4

Electronic Health Records – 193

10.5

Closing Remarks – 195 References – 196

Authors’ contribution: • Main chapter: ND, OC, RS, AK. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Duchateau and A. P. King (eds.), AI and Big Data in Cardiology, https://doi.org/10.1007/978-3-031-05071-8_10

10

184

N. Duchateau et al.

n Learning Objectives At the end of this chapter you should be able to: 10.A Explain the potential role of AI in analysis of electrophysiology data 10.B Describe some applications of AI-based analysis of electrocardiograms (ECGs) and outline some of the difficulties and challenges that must be addressed 10.C Explain how AI can be used in the analysis and automated production of electronic health records (EHRs)

10.1

Introduction

Imaging data play a central role in cardiology, and much of the recent research activity in AI for cardiology has focused on cardiac imaging. For this reason, imagingbased AI has been the main focus of this book. However, there are a range of other data sources that are of importance in clinical decision making in cardiology. In this chapter we review the most relevant of these, with a focus on the ways in which AI has been proposed for use to streamline and improve clinical workflows.

10.2

10

Electrophysiology

Cardiac electrophysiology deals with the diagnosis and treatment of the electrical function of the heart. In general, it involves the analysis of electrical phenomena by means of different sources of information such as the ECG, body surface potential maps (BSPMs), or the more invasive means of intracardiac catheter recordings. Its main area of work is the analysis and treatment of rhythm disorders (arrhythmias), which are managed by cardiac electrophysiologists, who acquire and analyze electrophysiology studies that aim to elucidate symptoms, evaluate abnormal ECGs and assess the risk of arrhythmias in the present and future. Among the different therapeutic options available for cardiac arrhythmia, we can highlight drug therapy, surgical implantation (pacemakers, implantable cardioverter-defibrillators or ICDs), and cardiac ablation (radiofrequency ablation, cryoablation). Due to the complexity to plan and optimize cardiac therapies, several novel approaches and technologies have grown in popularity during the last decades to aid electrophysiologists. Among them, it is worth mentioning precision cardiology that involves the construction of patient-specific representations of an individual heart to perform electrical simulations [1] (see . Chap. 9, Sect. 9.7). In the area of cardiac electrophysiology, the advent of machine learning is having a major impact at different levels in several applications, from the automatic interpretation of ECGs to basic research on arrhythmia mechanisms, both experimental and computational [2−4].

Analysis of Non-imaging Data

185

10

Precision Cardiology The goal of precision cardiology is to come up with methods and tools that allow doctors to develop and provide personalized treatments to each individual, taking into account inter-individual variability. It is an innovative approach that aims at improving risk stratification and at identifying personalized management through targeted diagnostic and therapeutic strategies. This is perfectly represented by the concept of a ‘digital twin’ (see . Chap. 9, Sect. 9.7), which aims to define patientspecific virtual hearts that dynamically integrate the clinical data acquired over time for an individual combined with previous observations from experiments and multi-scale simulations [1]. Such a virtual model can be used to aid doctors to make diagnoses and prognoses, tailor treatments to individual patients and make predictions of patient health evolution [5]. Biophysical simulations are successful at integrating multiscale, multiphysics information with the aim of uncovering mechanisms that can explain functions [6]. For instance, a digital twin equipped with physics-based models could be used to predict the response of a patient to a specific medical device, such as a cardiac pacemaker, or even to personalize the configuration of the device to its particular anatomical and functional properties (ventricular wall morphology, location of coronary veins, presence and location of scar tissue). Although, at first glance, the relationship between machine learning and multiscale biophysical simulations does not seem obvious, they can benefit from each other in a number of applications [7], such as the integration of physics-based knowledge in the form of governing equations (learning the underlying physics), or constraints to manage ill-posed problems (e.g. electrocardiographic imaging (ECGI) inverse problems) [8] or handle sparse and noisy data [9]. Another important use of machine learning in precision cardiology is the definition of surrogate models that can predict the response of a complex biophysical model from a reduced number of clinical inputs. This is possible due to the ability of machine learning to reveal correlations between different features that can be exploited by biophysical models to, for instance, classify or stratify patients. Since creating a personalized model is time consuming and requires expert input and many different types of data, machine learning techniques such as transfer learning are good alternatives to make predictions in a fast and reliable way without the need to create a full detailed model from scratch [4]. Machine Learning in Cardiac Computational Modeling Digital twins must include the particular properties of an individual, so that simulations on the model are able to predict outcome of antiarrhythmic treatments, or stratify patients. To build a digital twin, the first step is to reconstruct the patientspecific 3-D anatomy of the patient’s heart. For the case of the geometry of the atria and ventricles, the use of deep learning based methods, and the proliferation of some particular models, such as the U-Net [10] has opened up new possibilities to build detailed models from clinical data with very little user interaction. However, if one wants to incorporate other physiological properties into the model to be able to perform biophysical simulations of cardiac electrophysiology, many additional

186

10

N. Duchateau et al.

features have to be extracted from the patient’s clinical records, imaging data, and electrophysiological measurements to personalize the model [11]. For instance, the underlying organization of cardiac tissue, so-called fiber orientation, that determines the principal direction of the depolarization wavefront in cardiac tissue has to be incorporated into the model, but this cannot be obtained in vivo using imaging techniques. Physics-informed neural networks (PINNs) have been developed to learn properties such as the fiber orientation from in vivo anatomical maps (e.g. FiberNet [12]). PINNs are variants of machine learning based methods that are used to solve inverse problems governed by partial differential equations, and do not typically need large amounts of labeled data to make accurate predictions thanks to the incorporation of physical laws into their loss functions [13]. Other studies have focused on personalizing parameters of a simplified electrical model, for example activation onset location and tissue conductivity from patients that presented premature ventricular contractions, using Kernel Ridge Regression [14]. In this work, the authors were able to personalize the cardiac electrophysiological model and predicted new patient-specific pacing conditions. Biophysical simulations of the heart have also been used as tools to generate synthetic datasets that include detailed anatomical and electrical information to train machine learning systems for different applications [15−19]. Personalization of models to reproduce the electrical activation sequence of the heart is also an active area of research. Several sources of data have been employed to adapt the model to the patient, such as electro-anatomical maps (EAMs, acquired invasively with a catheter), to BSPMs and ECGs. EAMs are created sequentially by acquiring random discrete samples from different heart beats that are scattered all over the heart’s endocardial cavity. As a result, EAMs often present large errors and inconsistencies that can affect the decision taken during the radiofrequency ablation intervention. Recently, a PINN has been proposed (EikonalNet) to overcome these limitations, imposing wave propagation dynamics to the estimated EAM, and adding a quantification of the uncertainty [9]. This is possible thanks to the current understanding of the system, which could be used to constrain the design space using the known underlying wave propagation dynamics. In the same work, an active learning algorithm was proposed to guide the electrophysiologist in the data acquisition process during the intervention. Similar works have used machine learning to estimate the sequence of activation from motion patterns (using Kernel Ridge Regression) [20], or directly from images (using least-squares SVM) [15], since there exists a relationship between the electrical activation and mechanical contraction. Non-invasive ECGI has also been employed as a source of information to personalize cardiac electrophysiology models when combined with machine learning algorithms, such as the Time-Delay Artificial Neural Network (TDANN) [8], transfer learning [21] or support vector regression (SVR) [22]. Machine Learning in Cardiac Arrhythmia Mechanisms Biophysical models can provide insight into the heart as a system at a high level of resolution and precision. They can systematically probe various pathological conditions and treatments, and they can do this faster, more cost effectively and go

Analysis of Non-imaging Data

187

10

beyond what is experimentally possible. The massive datasets produced by these simulations are suited to machine learning analysis to uncover hidden relationships between parameters. At the cellular level, machine learning has been employed in ion channel modeling to (i) predict functional changes in channels due to mutations [23]; (ii) identify the structure/function relationship in voltage potassium channels [24]; or find relationships between kinetic properties of ion channel recovery and dynamics of arrhythmias [25]. It is also worth mentioning its application to investigating drug cardiotoxicity by predicting hERG (ether-a-go-go-related gene) related cardiotoxicity of a given compound, which is a surrogate marker of pro-arrhythmic risk [26]. At the organ level, machine learning has been applied to the investigation of reentrant activity. In particular, Muimani et al. [27] developed a deep learning method (a CNN, see . Chap. 3, Sect. 3.5) for the detection of unbroken and broken spiral waves, which are analogs of life-threatening cardiac arrhythmias, and their efficient elimination by targeted delivery of low amplitude current. Other studies have focused on predicting the effect of the fibrosis density and entropy on the maintenance of reentrant drivers by using patient specific computational models of the atria and SVMs with second degree polynomial kernels [28]. Machine Learning in Therapy Guidance The combination of machine learning and digital twin technology could also be a powerful tool for therapy guidance, with a large potential to be transferred to electrophysiology labs. Currently, most common arrhythmias are treated by catheterbased ablation, which destroys the ability of cardiac tissue to trigger and conduct electrical signals, and can stop several types of arrhythmias, such as ventricular tachycardia (VT) or atrial fibrillation (AF). An important area of study that combines biophysical modeling and machine learning has focused on predicting the location of arrhythmic sources, such as ectopic foci or rotor drivers, in the atria and ventricles. In [16] a SVM classifier was built to determine non-invasively from the virtual BSPM of a patient, the location (region based) of the ectopic focus that was triggering the atrial tachycardia, with an accuracy over 90%. Yang et al. [29] used CNNs to detect the exit site of postinfarction VT on the basis of the 12-lead ECG, which was subsequently validated by computer simulations. In [30], the use of sequential factorized autoencoders (a type of deep CNN) was proposed to find the location of VT exit sites, taking into account differences in 12-lead ECG due to patient variability at electrical (source of VT) and anatomical (heart anatomy) levels. Regarding ablation of AF, there has been a large number of studies that aim to predict ablation success or recurrence after ablation based on clinical recordings, which analyze the 12-lead ECG, patients’ anatomy, or distribution of fibrosis. Computer simulations on patient-specific geometries including fibrosis segmented from LGE-MRI were conducted to pre-operatively predict recurrence of AF after ablation together with a machine learning based classifier [31].

188

N. Duchateau et al.

Limitations Although there are big expectations and optimism for the potential applications of machine learning techniques to physics-based modeling, it is important to be aware of its limitations. In general, machine learning techniques identify correlations but are agnostic as to causality, while multiscale modeling can find causal mechanisms. Besides, it is very common to see cases in which machine learning systems do not generalize well, i.e. the system is not really learning from the samples, but memorizing them (i.e. the model overfits, see . Chap. 2, Sect. 2.5). In addition, in many studies it is assumed that the distributions of the training and the test data are the same, which may be not true. Finally, another recurrent problem in many cases is that there is class imbalance, i.e. a particular class is over represented compared to others.

10.3

ECG Analysis

Transition to the Digital Era

10

The electrocardiogram (ECG) is a central tool in the assessment of a patient’s condition and their follow-up. It is non-invasive and inexpensive compared to other devices, available in a large variety of clinical environments and used by a large array of healthcare professionals with varying knowledge on cardiology, an important point which can hamper ECG interpretation. Over recent decades, computational techniques have substantially improved the quantification and analysis of ECG signals [32], and the use of machine learning has further increased the efficiency and robustness of these tasks [33, 34]. As access to data is key to developing high performing machine learning algorithms, the entrance of the field of ECG analysis into the digital era has clearly boosted the use of machine learning models, as visible from the publications registries1 and continuously increasing industrial investment. However, the route to the digital world is not straightforward for ECG data. Many hospitals still rely on paper-printed ECG records, which requires addressing a large amount of issues before their computational analysis: digitization of the printed records, extraction of the signals from the background, standardization of the traces, etc. Many efforts have been made to properly standardize the existing data, but still heterogeneity between the proposed formats hampers the interoperability of analysis tools [35−37]. Besides, even for a given data format, many differences can remain in the stored data, as illustrated in . Fig. 10.1. For example, the duration and number of cardiac cycles considered actually depends on the underlying disease and the type of acquisition (for example, in 12-lead ECG as opposed to Holter acquisitions). Given this context, it is evident that despite being 1-D, the computational analysis of ECG signals is not at all easier compared to 2-D images. 1

The query “ECG machine learning” in Pubmed returns 350+ papers for 2021 against around 75 and 20 papers ten and twelve years previously.

Analysis of Non-imaging Data

189

10

. Fig. 10.1 Variability in the ECG signals from the CPSC2018 database [42] used in the PhysioNet 2020 challenge [41], consisting of 12-derivation signals from 6877 subjects. For visualization purposes, all signals were temporally resampled to 100 instants with the beginning and end of the cycle normalized to 0 and 100%, respectively, and averaged across all the cycles of a given subject. a V1 to V6 derivations (half of the cycle) for the 917 subjects labeled as normal sinus rhythm (the thick black trace corresponds to the average of all signals). Despite all being “normal”, we observe a large variability in the QRS complex amplitude, and in the timing and amplitude of the T wave. b Comparison of normal sinus rhythm and atrial fibrillation subjects based on two features extracted from the ECG signals using standard signal processing. Two clusters are easily visible, indicating that these two features may be enough to classify most subjects, but the presence of some subjects near the other cluster indicates that more advanced features or signal analyses are required to improve diagnosis. c Representative ECG (average across a subgroup, V6 derivation displayed) for five subgroups for which specific QRS and T wave changes are visible depending on the subgroup, motivating the use of a more sophisticated analysis of ECG patterns

One also needs to remember that given the scarcity of large standardized databases of digital ECG signals, such computational analysis started long before the advent of machine learning with many efforts for community-based postprocessing tools using standard signal processing [32]. Among these, popular methods largely relied on smart signal analysis (e.g. wavelet-based methods that are able to represent the multi-scale structure of signals) [38] and generic-but-relevant decisions (rule- or threshold-based, using relevant features extracted from the signals). Highly curated databases have now started to emerge (see database reviews in [34, 39, 40]) to drive the whole community around data analysis challenges [41],

190

N. Duchateau et al.

which open the path to applying machine learning models but also to compare them to common ground truth data. Machine learning naturally has the potential to move this automated analysis forward, with methods better suited to the data under study. In the following, we will discuss how two main tasks of ECG analysis are handled: automatic feature extraction and automatic diagnosis. We will pay specific attention to issues that highly condition the performance of machine learning, such as the database size, the quality and variability of annotations as well as the interpretability of the results. Automatic Quantification

10

A first task for the computational analysis of ECGs with machine learning consists of automatic quantification, namely the automatic extraction of features of interest in the signals. As discussed in the Clinical Introduction to . Chap. 4 (see Sect. 4.1), typical measurements from ECG signals consist of the onset/offset of each cycle, and complementary markers of the cardiac cycle such as the events of the QRS, P and T waves, and the duration of the cardiac phases that can be derived from these events, since they are biomarkers of different cardiac diseases (e.g. enlarged QRS as a surrogate of electrical dyssynchrony, or elevated ST segment for infarction). From a machine learning perspective, extracting these events can be formulated as a supervised problem, where the training labels come from ECG signals in which the events have been annotated by experts. Naturally, their identification may be more or less challenging depending on the quality of the signals, the derivation, and the disease under study. Testing machine learning models of different complexities and increasing the database size and richness are ways to prevent this, although the latter may not be possible in all situations. Although automatic diagnosis (discussed in the next subsection) attracts most of the attention in ECG analysis, several works have attempted to match or exceed the performance of standard ECG quantification methods by using machine learning. Convolutional neural networks (CNNs, see Sect. 3.5) are attractive compared to fully-connected networks (FCNs, see Sect. 4.3), as they use convolutions that both reduce the number of network connections (the number of parameters to optimize) and better take into account the structure of the input data (the spatial arrangement of pixels for images, and the temporal sequence of values for signals), as demonstrated on ECG data by, for example, [43, 44]. Inspired by its success in image segmentation tasks, a variant of the U-net architecture has been recently adapted for ECG quantification [45−48]. Another branch of works has considered Recurrent Neural Networks (RNNs, see Sect. 4.3), which are tailored for analyzing temporal sequences of data, and in particular the long short-term memory (LSTM) architecture, which partially addresses some computational issues of RNNs [49, 50]. Automatic Diagnosis Once the features of interest have been extracted, these can be fed into subsequent models for the characterization of populations (e.g. examining statistical differences between two subgroups), or automatic diagnosis. In theory, as for 2-D images,

Analysis of Non-imaging Data

191

10

neural networks may also offer an all-in-one approach that avoids the need to extract pre-identified (i.e., ‘hand-crafted’) features from the signals, and instead performs both feature extraction and diagnosis at once. However, more complex models mean many parameters to optimize and present the risk of overfitting if not enough data are available, which can be critical for ECG signals due to their potentially large variability and the limited amount of well-curated databases for training. Thus, for automatic diagnosis, the use of a separate feature extraction step can be a way to reach more powerful and simplified data representations based on expert prior knowledge. Given the potential amount of features and their partial redundancy, feature extraction can be coupled with dimensionality reduction to reach more robust representations for use by machine learning models for automatic diagnosis. Given the abundance of publications on this topic, we refer the interested reader to reviews of the literature addressing this question [39, 51], including some specific to deep learning [34, 40, 52, 53] which mostly rely on CNN and RNN architectures. We include a brief summary of this body of work below. A first group of works focuses on heart beat classification, for which very high performance (more than 95% accuracy) has been achieved in much of the recent literature. A second group of works is aimed at automatic diagnosis of patients based on complete ECG recordings; the performance of these methods highly depends on the disease. This is clearly illustrated in the 2020 PhysioNet challenge [41], which provided 66,405 ECG recordings (43,101 with labels for training) and evaluated the results from 217 teams who attempted to automatically classify the ECG signals. Interestingly, the organizers designed a specific metric to compare the outputs of the competitors, using a reward process that softens some misdiagnoses depending on the severity of the disease or potentially different labelling of variants of a disease (e.g. “Complete right bundle branch block” vs. “Right bundle branch block”). A more recent paper focused on the PTB-XL database [54], which was part of the 2020 PhysioNet challenge, and provided a complementary view on deep learning methods for diagnosis on this database, with an interesting hierarchical organization of the diagnostic labels and some insights on the uncertainty and interpretability of such models. Current Open Questions As briefly summarized above, there are reasons to believe that high performing machine learning-based analysis of ECGs will become a reality for several applications in the near future, with the proviso that learning to cope with real-world data may present challenges. As most of the methods involve supervised learning, the availability of large datasets with high quality annotations is crucial. The uncertainty in the manual ECG annotations from a single expert can already be dramatic, and consensus in the annotation of events by different experts may be hard to reach [55, 56]. In addition, carefully and consistently annotating large series of signals is not feasible on

192

10

N. Duchateau et al.

any local database. The scarcity of well-annotated databases probably explains why a lot of focus is on the classification of ECG signals, and much less on delineation and feature extraction. One promising area for future work lies in the generation of realistic synthetic data, which by definition comes with ground truth annotations. This strategy has been successfully demonstrated in computer vision [57] and medical imaging [58] applications, and has started to be adapted to electrophysiology data [59, 60]. Developers and users of machine learning tools also need to keep in mind that 1-D (i.e. signals) does not necessarily mean simpler than 2-D (i.e. images). There exists a lot of variability in the signals due to noise, acquisition factors, or disease, which makes the detection of subtle events very challenging. Besides, the temporal dimension contains much of the useful information in ECG analysis where several cycles are often considered, compared to image analysis where a single image (for static data) or a single cardiac cycle (for temporal sequence analysis) is generally considered representative of the patient under study. Also, although experienced users of neural networks tend to understand the role of subparts of the network and specific architecture choices, the path to the decision taken by the network is still hard to interpret. As described in other parts of this book (see footnote 2 in Chap. 2, section Data Descriptors), interpretability is crucial for the transfer of these technologies to the clinic and this issue has started to be addressed by the machine learning community. A simple approach can be to produce attention maps that highlight the specific regions of the signal that led to the decision. For ECG analysis, this has been demonstrated on 2-D pictures of ECG signals, therefore borrowing the concept of attention maps from 2-D CNN and image analysis [61], and on actual 1-D signals [54]. Despite these issues, the advent of machine learning brings many hopes to the field of ECG analysis. The role of data analysis challenges will likely play an important role in realising these hopes, since they provide well-curated large databases and a specific question to address each year. They also serve to closely follow the evolution of the state-of-the-art and compare existing methods in a standardized manner. In this sense, the annual PhysioNet/Computing in Cardiology challenges have to be commended as they encourage focus on evaluation of performance specifically for ECG applications. In addition, their recent versions [41] comply with the good practices highlighted in recent meta-analyses of health data challenges [62]. This is surprisingly not the case for a large amount of data analysis competitions, although it should be of prime importance given the trend to use such public databases for local training or even for validation purposes. This is especially important given the amount of industrial investment in ECG analysis solutions, in particular for very precise applications such as diagnosing atrial fibrillation but also widening the spectrum of available signals (e.g. from wearables, smart watches, etc.), which bring complementary memory and speed issues that researchers will need to address.

Analysis of Non-imaging Data

10.4

193

10

Electronic Health Records

Transition to the Digital Era Due to the boom of electronic devices and analysis techniques such as AI, but also the wide adoption of digital technologies within hospitals, the use of Electronic Health Records (EHR) has drastically increased in the last decade. They encompass a centralized collection of a patient’s data followed along time through hospital visits or remote monitoring, and facilitate analyses and reporting at the scale of an individual patient or at population level [63]. The transition to digital technologies and big data raises a question that is not specific to EHRs, but is certainly shared by their users: how to properly manage this data deluge, a question which covers issues around acquisition, storage, maintenance, and access to these data. More specific to the EHR, moving to such data requires a careful digitization of handwritten notes and voice records, for which a first set of AI methods from computer vision and speech processing are very relevant. This process is generally seen as supervised, in the sense that the inputs are mapped or tagged to given categories or values through classification or regression (for example, the concept “heart failure” written several times in a report should be tagged as a single “heart failure” item, which means recognizing the words “heart” and “failure”, and considering them jointly). Natural Language Processing (NLP) is a family of methods that are highly relevant for examining structured texts and enabling the machine to “understand” them. Among the AI methods it relies on, Recurrent Neural Networks (RNN) are suited for data that are sequentially ordered (typically, the text in a written document) as they can include long-range dependencies between the feature representations (the hidden states of the RNN). As nicely summarized in [64], they should not only address the extraction of single concepts, but also be able to spot the temporality of these events (namely how to convert a sometimes vague period of time into data that can be analyzed), and the relations in the text (for example, causes or conditions). One important issue is that the huge amount of information contained in EHRs is currently insufficiently standardized across hospitals, clinicians, diseases, several visits, etc. The EHR scientific community is progressively moving towards more standardized formats, as done previously with the DICOM format for medical images. For this purpose, public datasets are of high value as they structure the community around a common task or challenge. A widely recognized example is the MIMIC-III dataset (Medical Information Mart for Intensive Care [65], which consists of deidentified EHRs from around 60K intensive care unit admissions. To further structure the contents of these data, the combination of NLP and ontologies defined a priori can be useful, although these may be challenging to define. A broader view can also be adopted to better exploit the available data. For medical images this means, for example, considering both the image contents and the associated metadata either from the DICOM file or available in the EHR, as explicitly reviewed in [66].

194

N. Duchateau et al.

Disease Perspective Once the information has been extracted and structured, AI techniques and in particular machine learning and deep learning are now a “must-have” for the analysis. As for images and signals, machine learning can be used to address many challenges related to disease analysis with EHRs [64, 67], both in a retrospective or a prospective way: 5 Detection [68−70]: for example, diagnosis (supervised) or abnormality detection (which can be unsupervised). See overview in . Chap. 5. 5 Prediction [71−74]: for example, prognosis or evolution of specific values (either using a single timepoint, or through methods that explicitly address the temporality of events, such as RNNs or regression models). See overview in . Chap. 6. 5 Phenotyping [75, 76] (partially discussed in . Chap. 8) to discover new concepts or confront existing ones with the data, for which unsupervised learning techniques are interesting as they can aggregate patients with similar data or conditions (clustering) or highlight the main characteristics of a dataset. 5 Better representing a dataset [77−80], which encompasses the previous item, and for which a specific family of representation learning algorithms exists [81], either using classical machine learning (manifold learning) or neural networks (autoencoders, see . Chap. 4, Sect. 4.3, and Chap. 5, Sect. 5.3). The review in [64] nicely distinguishes between the objective of representing a medical concept across a population, or the data associated to a single patient.

10

Nonetheless, users should carefully balance the sophistication of the techniques used against the actual gain for the medical application. Indeed, a recent review [82] analyzed the evolution of an algorithm’s performance in longitudinal EHR studies, where neural networks did not necessarily bring a clear gain in the last years. This report has to be tempered against the potentially increasing complexity of the databases, but the authors remind us that the difficulty of the medical questions and the variety of outcomes are clear bottlenecks for computational techniques using EHRs. They also argue for better standardization and organization in the EHR scientific community, referring to the good example of the ECG analysis community, which both structured the data formats, the feature extraction algorithms, and even databases including yearly data challenges.2 Hospital and Patient Perspective The wide spectrum of data covered by the EHR and their rather global acceptance also opens up new perspectives beyond disease specific studies. Having patient records in a centralized and somehow standardized format first benefits the managing of resources by the hospitals. In this context, AI techniques can be very valuable for comparisons and management at the scale of a whole hospital. When replaced in a temporal perspective, they can go beyond the prediction of mortality and estimate the length of stay of patients [82]. Automated reporting 2

7 http://physionetchallenges.github.io/.

Analysis of Non-imaging Data

195

10

techniques are being developed in an attempt to speed-up the cumbersome processes by clinicians and the hospital staff [83], which can be seen as the process of generating contents in a structured way, and therefore encompasses AI generative models. However, working at the scale of a whole hospital or even a network of hospitals brings additional challenges. Algorithms should target near-to-real-time access, or at least provide rapid information retrieval tools. Federated learning (see . Chap. 5) is a framework that can be very useful to move beyond the limited point-ofview of a given hospital [84]: it involves training algorithms across multiple data warehouses without explicitly exchanging data. In the context of healthcare, this is highly desirable to develop more robust models with much better generalization ability, avoiding bias to some populations, and achieving better performance for rare diseases, while being safe in terms of privacy and security issues. EHRs still come with many challenges around the standardization and fusion of many heterogeneous and time-varying data. However, the dynamism of EHR analysis with AI opens up promising perspectives to better contextualize the patients’ data, including exploitation of external factors that are available in EHRs but not necessarily included in current analyses, accompanied by a much more regular follow-up and traceability that can benefit both the patient and the clinical institutions.

10.5

Closing Remarks

Cardiologists routinely make use of non-imaging data when making clinical decisions, so it seems inevitable that such data will play an important role in the future of AI in cardiology. In particular, some types of non-imaging data are routinely and widely available (i.e. ECGs and EHRs), so if techniques could be developed to better exploit the richness of these data this would be very attractive in terms of incorporating AI into current clinical workflows. However, non-imaging data sources are not immune from the difficulties and challenges associated with imaging data, such as standardization of formats, missing/corrupted data and privacy concerns. These issues must be satisfactorily addressed before AI techniques based on non-imaging data sources can be translated into the clinic. In addition, challenges will likely be faced when using AI to combine features learnt from imaging and non-imaging data. Such an approach mimics the way in which cardiologists consider multiple sources of information when making decisions about patient management, and so has great potential, but it does increase the complexity of the models and of the data curation process. These are not concerns to be taken lightly, and further work is required before AI can truly emulate the way in which cardiologists are able to deal with such complexity in a seemingly effortless way. Acknowledgements ND was supported by the French ANR (LABEX PRIMES of

Univ. Lyon [ANR-11-LABX-0063] within the program “Investissements d’Avenir” [ANR-11-IDEX-0007], and the JCJC project “MIC-MAC” [ANR-19-CE45-0005]).

196

N. Duchateau et al.

RS was supported by Generalitat Valenciana Grant AICO/2021/318 (Consolidables 2021) and Grant PID2020-114291RB-I00 funded by MCIN/ 10.13039/501100011033 and by “ERDF A way of making Europe”. AK was supported by the EPSRC (EP/P001009/1), the Wellcome/EPSRC Centre for Medical Engineering at the School of Biomedical Engineering and Imaging Sciences, King’s College London (WT 203148/Z/16/Z) and the UKRI London Medical Imaging & Artificial Intelligence Centre for Value Based Healthcare.

References

10

1. Corral-Acero J, Margara F, Marciniak M, Rodero C, Loncaric F, Feng Y, Gilbert A, Fernandes JF, Bukhari HA, Wajdan A, Martinez MV, Santos MS, Shamohammdi M, Luo H, Westphal P, Leeson P, DiAchille P, Gurev V, Mayr M, Geris L, Pathmanathan P, Morrison T, Cornelussen R, Prinzen F, Delhaas T, Doltra A, Sitges M, Vigmond EJ, Zacur E, Grau V, Rodriguez B, Remme EW, Niederer S, Mortier P, McLeod K, Potse M, Pueyo E, Bueno-Orovio A, Lamata P. The ‘‘Digital Twin’’ to enable the vision of precision cardiology. Eur Heart J. 2020; 41(48):4556−64. 2. Trayanova NA, Popescu DM, Shade JK. Machine learning in arrhythmia and electrophysiology. Circ Res. 2021; 128(4):544−66. 3. Nagarajan VD, Lee S-L, Robertus J-L, Nienaber CA, Trayanova NA, Ernst. Artificial intelligence in the diagnosis and management of arrhythmias. Eur Heart J. 2021; 42(38):3904−16. 4. Peng GC, Alber M, Buganza Tepole A, Cannon WR, De S, Dura-Bernal S, Garikipati K, Karniadakis G, Lytton WW, Perdikaris P, et al. Multiscale modeling meets machine learning: what can we learn?. Arch Comput Methods Eng. 2021; 28(3):1017−37. 5. Sánchez de la Nava AM, Atienza F, Bermejo J, Fernández-Avilés F. Artificial intelligence for a personalized diagnosis and treatment of atrial fibrillation. Am J Physiol Heart Circ Physiol. 2021; 320(4):H1337−47. 6. Chabiniok R, Wang VY, Hadjicharalambous M, Asner L, Lee J, Sermesant M, Kuhl E, Young AA, Moireau P, Nash MP, et al. Multiphysics and multiscale modelling, data-model fusion and integration of organ physiology in the clinic: ventricular cardiac mechanics. Interface Focus. 2016; 6(2):20150083. 7. Alber M, Buganza Tepole A, Cannon WR, De S, Dura-Bernal S, Garikipati K, Karniadakis G, Lytton WW, Perdikaris P, Petzold L, et al. Integrating machine learning and multiscale modeling— perspectives, challenges, and opportunities in the biological, biomedical, and behavioral sciences. NPJ Digit Med. 2019; 2(1):1−11. 8. Malik A, Peng T, Trew ML. A machine learning approach to reconstruction of heart surface potentials from body surface potentials. In: 40th annual international conference of the IEEE engineering in medicine and biology society (EMBC). IEEE; 2018. p. 4828−31. 9. Sahli Costabal F, Yang Y, Perdikaris P, Hurtado DE, Kuhl E. Physics-informed neural networks for cardiac activation mapping. Front Phys. 2020; 8:42. 10. Bernard O, Lalande A, Zotti C, Cervenansky F, Yang X, Heng P, Cetin I, Lekadir K, Camara O, Ballester MAG, Sanroma G, Napel S, Petersen SE, Tziritas G, Grinias E, Khened M, Varghese A, Krishnamurthi G, Rohé M, Pennec X, Sermesant M, Isensee F, Jaeger P, Maier-Hein KH, Full PM, Wolf I, Engelhardt S, Baumgartner CF, Koch LM, Wolterink JM, Isgum I, Jang Y, Hong Y, Patravali J, Jain S, Humbert O, Jodoin P. Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: is the problem solved? IEEE Trans Med Imaging. 2018; 37(11):2514−25. 11. Lopez-Perez A, Sebastian R, Ferrero JM. Three-dimensional cardiac computational modelling: methods, features and applications. Biomed Eng Online. 2015; 14(1):1−31. 12. Ruiz Herrera C, Grandits T, Plank G, Perdikaris P, Sahli Costabal F, Pezzuto S. Physics-informed neural networks to learn cardiac fiber orientation from multiple electroanatomical maps. Eng Comput. 2022; 38:3957−73.

Analysis of Non-imaging Data

197

10

13. Raissi M, Perdikaris P, Karniadakis GE. Physics-informed neural networks: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J Comput Phys. 2019; 378:686−707. 14. Giffard-Roisin S, Jackson T, Fovargue L, Lee J, Delingette H, Razavi R, Ayache N, Sermesant M. Noninvasive personalization of a cardiac electrophysiology model from body surface potential mapping. IEEE Trans Biomed Eng. 2017; 64(9):2206−18. 15. Prakosa A, Sermesant M, Allain P, Villain N, Rinaldi CA, Rhode K, Razavi R, Delingette H, Ayache N. Cardiac electrophysiological activation pattern estimation from images using a patientspecific database of synthetic image sequences. IEEE Trans Biomed Eng. 2013; 61(2):235−45. 16. Ferrer-Albero A, Godoy EJ, Lozano M, Martínez-Mateu L, Atienza F, Saiz J, Sebastian R. Noninvasive localization of atrial ectopic beats by using simulated body surface P-wave integral maps. PLoS One. 2017;12(7): e0181263. 17. Costabal FS, Matsuno K, Yao J, Perdikaris P, Kuhl E. Machine learning in drug development: characterizing the effect of 30 drugs on the QT interval using Gaussian process regression, sensitivity analysis, and uncertainty quantification. Comput Methods Appl Mech Eng. 2019; 348:313−33. 18. Godoy EJ, Lozano M, García-Fernández I, Ferrer-Albero A, MacLeod R, Saiz J, Sebastian R. Atrial fibrosis hampers non-invasive localization of atrial ectopic foci from multi-electrode signals: a 3D simulation study. Front Physiol. 2018; 9:404. 19. Doste R, Sebastian R, Gomez JF, Soto-Iglesias D, Alcaine A, Mont L, Berruezo A, Penela D, Camara O. In silico pace-mapping: prediction of left vs. right outflow tract origin in idiopathic ventricular arrhythmias with patient-specific electrophysiological simulations. EP Eur. 2020; 22(9):1419−30. 20. Prakosa A, Sermesant M, Delingette H, Saloux E, Allain P, Cathier P, Etyngier P, Villain N, Ayache N. Non-invasive activation times estimation using 3D echocardiography. In: International workshop on statistical atlases and computational models of the heart. Springer; 2010. p. 212−21. 21. Giffard-Roisin S, Delingette H, Jackson T, Webb J, Fovargue L, Lee J, Rinaldi CA, Razavi R, Ayache N, Sermesant M. Transfer learning from simulations on a reference anatomy for ECGI in personalized cardiac resynchronization therapy. IEEE Trans Biomed Eng. 2018; 66(2):343−53. 22. Jiang M, Lv J, Wang C, Huang W, Xia L, Shou G. A hybrid model of maximum margin clustering method and support vector regression for solving the inverse ECG problem. In: Computing in cardiology. IEEE; 2011. p. 457−60. 23. Clerx M, Heijman J, Collins P, Volders PGA. Predicting changes to INa from missense mutations in human SCN5A. Sci Rep. 2018; 8(1):12797. 24. Li B, Gallin WJ. Computational identification of residues that modulate voltage sensitivity of voltage-gated potassium channels. BMC Struct Biol. 2005; 5:16. 25. Lawson BA, Burrage K, Burrage P, Drovandi CC, Bueno-Orovio A. Slow recovery of excitability increases ventricular fibrillation risk as identified by emulation. Front Physiol. 2018; 9:1114. 26. Wacker S, Noskov SY. Performance of machine learning algorithms for qualitative and quantitative prediction drug blockade of hERG1 channel. Comput Toxicol. 2018; 6:55−63. 27. Mulimani MK, Alageshan JK, Pandit R. Deep-learning-assisted detection and termination of spiral and broken-spiral waves in mathematical models for cardiac tissue. Phys Rev Res. 2020; 2(2): 023155. 28. Zahid S, Cochet H, Boyle PM, Schwarz EL, Whyte KN, Vigmond EJ, Dubois R, Hocini M, Haïssaguerre M, Jaïs P, et al. Patient-derived models link re-entrant driver localization in atrial fibrillation to fibrosis spatial pattern. Cardiovasc Res. 2016; 110(3):443−54. 29. Yang T, Yu L, Jin Q, Wu L, He B. Localization of origins of premature ventricular contraction by means of convolutional neural network from 12-lead ECG. IEEE Trans Biomed Eng. 2017; 65(7):1662−71. 30. Gyawali PK, Horacek BM, Sapp JL, Wang L. Sequential factorized autoencoder for localizing the origin of ventricular activation from 12-lead electrocardiograms. IEEE Trans Biomed Eng. 2019; 67(5):1505−16. 31. Shade JK, Ali RL, Basile D, Popescu D, Akhtar T, Marine JE, Spragg DD, Calkins H, Trayanova NA. Preprocedure application of machine learning and mechanistic simulations predicts likelihood of paroxysmal atrial fibrillation recurrence following pulmonary vein isolation. Circ: Arrhythmia Electrophysiol. 2020; 13(7):e008213.

198

10

N. Duchateau et al.

32. Sörnmo L, Laguna P. Bioelectrical signal processing in cardiac and neurological applications. Burlington: Academic Press; 2005. 33. Gacek A, Pedrycz W. ECG signal processing, classification and interpretation: a comprehensive framework of computational intelligence. London Limited: Springer; 2012. 34. Hong S, Zhou Y, Shang J, Xiao C, Sun J. Opportunities and challenges of deep learning methods for electrocardiogram data: a systematic review. Comput Biol Med. 2020; 122: 103801. 35. Bond R, Finlay D, Nugent C, Moore G. A review of ECG storage formats. Int J Med Inform. 2011; 80:681−97. 36. Trigo J, Alesanco A, Martínez I, García J. A review on digital ECG formats and the relationships between them. IEEE Trans Inf Technol Biomed. 2012; 16:432−44. 37. Badilini F, Young B, Brown B, Vaglio M. Archiving and exchange of digital ECGs: a review of existing data formats. J Electrocardiol. 2018; 51:S113-5. 38. Martínez J, Almeida R, Olmos S, Rocha A, Laguna P. A wavelet-based ECG delineator: evaluation on standard databases. IEEE Trans Biomed Eng. 2004; 51:570−81. 39. Lyon A, Mincholé A, Martínez J, Laguna P, Rodriguez B. Computational techniques for ECG analysis and interpretation in light of their contribution to medical advances. J R Soc Interface. 2018; 15:20170821. 40. Somani S, Russak A, Richter F, Zhao S, Vaid A, Chaudhry F, De Freitas J, Naik N, Miotto R, Nadkarni G, Narula J, Argulian E, Glicksberg B. Deep learning and the electrocardiogram: review of the current state-of-the-art. EP Eur. 2021; euaa377. 41. Perez Alday E, Gu A, Shah AJ, Robichaux C, Ian Wong A, Liu C, Liu F, Bahrami Rad A, Elola A, Seyedi S, Li Q, Sharma A, Clifford G, Reyna M. Classification of 12-lead ECGs: the PhysioNet/Computing in Cardiology Challenge 2020. Physiol Meas. 2021; 41:124003. 42. Liu F, Liu C, Zhao L, Zhang X, Wu X, Xu X, Liu Y, Ma C, Wei S, He Z, Li J, Yin Kwee E. An open access database for evaluating the algorithms of electrocardiogram rhythm and morphology abnormality detection. J Med Imaging Health Inform. 2018; 8:1368−73. 43. Sodmann P, Vollmer M, Nath N, Kaderali L. A convolutional neural network for ECG annotation as the basis for classification of cardiac rhythms. Physiol Meas. 2018; 39: 104005. 44. Camps J, Rodríguez B MA. Deep learning based QRS multilead delineator in electrocardiogram signals. Proc Comput Cardiol Conf (CinC). 2018; 45:1−4. 45. Jimenez-Perez G, Alcaine A, Camara O. U-Net architecture for the automatic detection and delineation of the electrocardiogram. Proc Comput Cardiol (CinC). 2019; 46:1−4. 46. Jimenez-Perez G, Alcaine A, Camara O. Delineation of the electrocardiogram with a mixed-qualityannotations dataset using convolutional neural networks. Sci Rep. 2021; 11:863. 47. Moskalenko V, Zolotykh N, Osipov G. Deep learning for ECG segmentation. In: Advances in neural computation, machine learning, and cognitive research III. Springer International Publishing; 2020. p. 246−54. 48. Tison G, Zhang J, Delling F, Deo R. Automated and interpretable patient ECG profiles for disease detection, tracking, and discovery. Circ: Cardiovasc Qual Outcomes. 2019; 12:e005289. 49. Abrishami H, Han C, Zhou X, Campbell M, Czosek R. Supervised ECG interval segmentation using LSTM neural network. In: Proceedings international conference on bioinformatics and computational biology (BIOCOMP), 2018. 50. Puthusserypady S, Peimankar A. DENS-ECG: a deep learning approach for ECG signal delineation. Expert Syst Appl. 2021; 165:113911. 51. Mincholé A, Camps J, Lyon A, Rodríguez B. Machine learning in the electrocardiogram. J Electrocardiol. 2019; 57:S61−4. 52. Ebrahimi Z, Loni M, Daneshtalab M, Gharehbaghi A. A review on deep learning methods for ECG arrhythmia classification. Expert Syst Appl: X. 2020; 7:100033. 53. Parvaneh S, Rubin J, Babaeizadeh S, Xu-Wilson M. Cardiac arrhythmia detection using deep learning: A review. Journal of Electrocardiology. 2019;57:S70−4. 54. Strodthoff N, Wagner P, Schaeffter T, Samek W. Deep learning for ECG analysis: benchmarks and insights from PTB-XL. IEEE J Biomed Health Inform. 2021; 25:1519−28. 55. Jain R, Tandri H, Daly A, Tichnell C, James C, Abraham T, Judge D, Calkins H, Dalal D. Readerand instrument-dependent variability in the electrocardiographic assessment of arrhythmogenic right ventricular dysplasia/cardiomyopathy. J Cardiovasc Electrophysiol. 2011; 22:561−8.

Analysis of Non-imaging Data

199

10

56. Tomlinson D, Bashir Y, Betts T, Rajappan K. Accuracy of manual QRS duration assessment: its importance in patient selection for cardiac resynchronization and implantable cardioverter defibrillator therapy. Europace. 2009; 11:638−42. 57. Richter R, Vineet V, Roth S, Koltun V. Playing for data: ground truth from computer games. Proc Eur Conf Comput Vis (ECCV), LNCS. 2016; 9906:102−18. 58. Heimann T, Mountney P, John M, Ionasec R. Real-time ultrasound transducer localization in fluoroscopy images by transfer learning from synthetic training data. Med Image Anal. 2014; 18:1320−8. 59. Doste R, Lozano M, Gomez J, Alcaine A, Mont L, Berruezo A, Camara O, Sebastian R. Predicting the origin of outflow tract ventricular arrhythmias using machine learning techniques trained with patient-specific electrophysiological simulations. Proc Comput Cardiol (CinC). 2019; 46:1−4. 60. Jimenez-Perez G, Acosta J, Alcaine A, Camara O. Generalizing electrocardiogram delineation: training convolutional neural networks with synthetic data augmentation. 2021; Available online 7 https://arxiv.org/abs/2111.12996. 61. Unterhuber M, Rommel K, Kresoja K, Lurz J, Kornej J, Hindricks G, Scholz M, Thiele H, Lurz P. Deep learning detects heart failure with preserved ejection fraction using a baseline electrocardiogram. Eur Heart J-Digit Health. 2021; ztab081. 62. Maier-Hein L, Eisenmann M, Reinke A, Onogur S, Stankovic M, Scholz P, Arbel T, Bogunovic H, Bradley A, Carass A, Feldmann C, Frangi A, Full P, van Ginneken B, Hanbury A, Honauer K, Kozubek M, Landman B, März K, Maier O, Maier-Hein K, Menze B, Müller H, Neher P, Niessen W, Rajpoot N, Sharp G, Sirinukunwattana K, Speidel S, Stock C, Stoyanov D, Taha A, van der Sommen F, Wang C, Weber M, Zheng G, Jannin P, Kopp-Schneider A. Why rankings of biomedical image analysis competitions should be interpreted with care. Nat Commun. 2018; 9:5217. 63. Kao DP, Trinkley KE, Lin C-T. Heart failure management innovation enabled by electronic health records. JACC: Heart Fail. 2020; 8(3):223−33. 64. Shickel B, Tighe PJ, Bihorac A, Rashidi P. Deep EHR: a survey of recent advances in deep learning techniques for electronic health record (EHR) analysis. IEEE J Biomed Health Inform. 2018; 22:1589−604. 65. Johnson A, Pollard T, Shen L, Lehman L, Feng M, Ghassemi M, Moody B, Szolovits P, Celi L, Mark R. MIMIC-III, a freely accessible critical care database. Sci Data. 2016; 3:160035. 66. Huang S-C, Pareek A, Seyyedi S, Banerjee I, Lungren MP. Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digit Med. 2020; 3(136). 67. Harerimana G, Kim JW, Yoo H, Jang B. Deep learning for electronic health records analytics. IEEE Access. 2019; 7:101 245−59. 68. Choi E, Schuetz A, Stewart WF, Sun J. Using recurrent neural network models for early detection of heart failure onset. J Am Med Inform Assoc. 2016; 24(2):361−70. 69. Lipton ZC, Kale DC, Elkan C, Wetzel R. Learning to diagnose with LSTM recurrent neural networks, Proc. ICLR 2016. 70. Latif J, Xiao C, Tu S, Rehman SU, Imran A, Bilal A. Implementation and use of disease diagnosis systems for electronic medical records based on machine learning: a complete review. IEEE Access. 2020; 8:150 489−513. 71. Choi E, Bahadori MT, Schuetz A, Stewart WF, Sun J. Doctor AI: predicting clinical events via recurrent neural networks. In: Doshi-Velez F, Fackler J, Kale D, Wallace B, Wiens J, editors.Proceedings of the 1st machine learning for healthcare conference, series proceedings of machine learning research, vol. 56. Northeastern University, Boston, MA, USA: PMLR, 18−19 Aug 2016. p. 301−18. 72. Miotto R, Li L, Kidd BA, Dudley JT. Deep Patient: an unsupervised representation to predict the future of patients from the electronic health records. Sci Rep. 2016; 6:26094. 73. Yuan Q, Cai T, Hong C, Du M, Johnson BE, Lanuti M, Cai T, Christiani DC. Performance of a machine learning algorithm using electronic health record data to identify and estimate survival in a longitudinal cohort of patients with lung cancer. JAMA Netw Open. 2021; 4(7):e2 114 723− e2 114 723. 74. Wesolowski S, Lemmon G, Hernandez EJ, Henrie A, Miller TA, Weyhrauch D, Puchalski MD, Bray BE, Shah RU, Deshmukh VG, Delaney R, Yost HJ, Eilbeck K, Tristani-Firouzi M, Yandell

200

75. 76. 77.

78.

79.

80.

81.

82. 83.

10

84.

N. Duchateau et al.

M. An explainable artificial intelligence approach for predicting cardiovascular outcomes using electronic health records. PLOS Digit Health. 2022; 1(1):1−17. Lasko TA, Denny JC, Levy MA. Computational phenotype discovery using unsupervised feature learning over noisy, sparse, and irregular clinical data. PLOS ONE. 2013; 8(6):1−13. Beaulieu-Jones BK, Greene CS. Semi-supervised learning of the electronic health record for phenotype stratification. J Biomed Inform. 2016; 64:168−78. Tran T, Nguyen TD, Phung D, Venkatesh S. Learning vector representation of medical objects via EMR-driven nonnegative restricted Boltzmann machines (eNRBM). J Biomed Inform. 2015; 54:96−105. Choi E, Bahadori MT, Searles E, Coffey C, Thompson M, Bost J, Tejedor-Sojo J, Sun J. Multilayer representation learning for medical concepts. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, series KDD ’16. New York, NY, USA: Association for Computing Machinery. 2016; p. 1495−504. Choi E, Bahadori MT, Searles E, Coffey C, Thompson M, Bost J, Tejedor-Sojo J, Sun J. Medical concept representation learning from electronic health records and its application on heart failure prediction, arXiv. 2016. Landi I, Glicksberg BS, Lee H-C, Cherng S, Landi G, Danieletto M, Dudley JT, Furlanello C, Miotto R. Deep representation learning of electronic health records to unlock patient stratification at scale. NPJ Digit Med. 2020; 3(96). Si Y, Du J, Li Z, Jiang X, Miller T, Wang F, Jim Zheng W, Roberts K. Deep representation learning of patient data from electronic health records (EHR): a systematic review. J Biomed Inform. 2020; 103671. Bellamy D, Celi L, Beam AL. Evaluating progress on machine learning for longitudinal electronic healthcare data, arXiv. 2020. Messina P, Pino P, Parra D, Soto A, Besa C, Uribe S, Andía M, Tejos C, Prieto C, Capurro D. A survey on deep learning and explainability for automatic image-based medical report generation, arXiv. 2020. Rieke N, Hancox J, Li W, Milletarì F, Roth HR, Albarqouni S, Bakas S, Galtier MN, Landman BA, Maier-Hein K, Ourselin S, Sheller M, Summers RM, Trask A, Xu D, Baust M, Cardoso MJ. The future of digital health with federated learning. NPJ Digit Med. 2020; 3:119.

201

Conclusions Andrew King and Nicolas Duchateau

Authors’ contribution: • Main chapter: AK, ND.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Duchateau and A. P. King (eds.), AI and Big Data in Cardiology, https://doi.org/10.1007/978-3-031-05071-8_11

11

202

11

A. King and N. Duchateau

AI methods, and in particular those based on the latest deep learning techniques (as described in Chap. 3) have made rapid progress across a range of applications in cardiology. Whilst some applications, such as diagnosis (Chap. 5) and prognosis (Chap. 6), are still mainly in the fundamental research stage, others, such as automated segmentation and biomarker quantification (Chap. 4) are now moving towards clinical translation. This means that now is a critical time for ensuring that these technical advances are matched by corresponding advances in terms of equitable patient benefit. This book has tried to showcase the state-of-the-art and raise important issues that should be considered with the aim of patient benefit in mind. We have discussed the importance of answering the right question (Chap. 9). For example, although undoubtedly interesting, there is little patient benefit in producing an automated AI-based diagnosis tool for a task that can be easily and quickly performed by cardiologists. Linked to this point, we have highlighted the importance of assessing the impact of AI on clinical workflows and patient outcomes, i.e. what role will AI play and what would be the impact on patients (positive and negative) in this context? We have also highlighted the importance of considering how clinicians will interact with AI, for example by reviewing and/or correcting outputs, or interpreting recommendations in a decision support setting (Chap. 8). Furthermore, how should these considerations affect the design of AI tools and the research challenges we should focus on? A common theme running through several chapters of the book has been that of effective and clinically appropriate validation of AI. It is important to start simple, both in terms of the models used and the validation data employed—there is little point in using an unnecessarily complex model when a simple one will work just as well (c.f. the principle of Occam’s razor). Likewise, when developing models it is often more efficient to control some of the variation in the data to begin with. Once the model is working well on the controlled data then complexity can be increased. However, with clinical translation in mind, it is imperative that validation eventually employs external validation sets consisting of large and diverse real-world clinical data. Furthermore, the validation should (eventually) be focused on the impact of the AI tools on patient outcomes, rather than surrogate measures such as classification accuracy. Finally, the benefits should be equitable, so when reporting validation results it is important to include complete and open reporting of performance across subgroups such as gender and race (Chaps. 6 and 9). Now is also the time to consider what comes next. We may soon be at a stage where much of the “low hanging fruit” for AI in cardiology have been picked (e.g. Chap. 4, Measurement and Quantification). However, as noted above, more challenging tasks remain, such as the role of AI in diagnosis (Chap. 5) and prognosis (Chap. 6), and these may have more specific requirements for AI tools such as interpretability/explainability. It is important to bring patients and the general public with us on this journey into an AI-enabled future in healthcare. This raises the need for effective public engagement to address the public’s (often valid) concerns about the use of AI.

Conclusions

203

11

Interpretability/explainability may also have a role to play here in engaging patients in decisions that will affect their futures. As well as AI, another key concept dealt with in this book has been the role of “big data”. We live in an age where data are everywhere, and the combination of big data and AI has great potential if used sensibly and effectively. A key challenge is to find the right balance between the ready availability of large amounts of rich data and actually exploiting this richness in a proper manner. Big data may be a necessary but not sufficient condition for effective AI. However, there are valid privacy concerns associated with medical data and it seems unlikely that these will ever go away completely. New approaches such as federated learning (Chap. 5) offer great potential for the exploitation of big data for real patient benefit. Federated learning is a fast-moving field that we expect to significantly impact the field of AI in cardiology in coming years. We should also be aware of the potential of combining AI with other, related research fields. For example, combining AI techniques with state-of-the-art biophysical modelling is highly promising in several respects. It allows access to physiological parameters not available from the imaging data (for example myocardial contractility, while images only allow estimating myocardial deformation). It also provides meaningful physiological guidance to data-based approaches, and therefore represents a promising way of overcoming AI’s lack of interpretability and could help to promote clinicians’ trust in AI. On a broader perspective, it enables the vision of a digital twin (Chaps. 9 and 10) which can complement the existing data by simulating future scenarios of patient evolution or therapy, which will be of high value for moving from generic healthcare to personalized medicine. In summary, these are exciting times to be involved in AI research for cardiology (and other areas). But this is not a time to “take our eye off the ball” and sit back and expect patient benefit to arrive. On the contrary, this is the most critical time to make sure that we do the right things to ensure that everybody benefits from the coming changes in the way we analyze and exploit big data with AI. It is arguably the case that most people working in AI and cardiology at the moment are very much from the technical side and lack the perspective/knowledge to address the key issues mentioned above. Therefore, inter-disciplinary working has never been more important. It is crucial that technologists engage with the clinical aspects of their work, and that clinicians engage with the technological side. We hope that this book can motivate more people to begin this journey and help to bring tangible benefits to us all. Acknowledgements AK was supported by the EPSRC (EP/P001009/1), the Wel-

come/EPSRC Centre for Medical Engineering at the School of Biomedical Engineering and Imaging Sciences, King’s College London (WT 203148/Z/16/Z) and the UKRI London Medical Imaging & Artificial Intelligence Centre for Value Based Healthcare. ND was supported by the French ANR (LABEX PRIMES of Univ. Lyon [ANR11-LABX-0063] within the program “Investissements d’Avenir” [ANR-11-IDEX0007], and the JCJC project “MIC-MAC” [ANR-19-CE45-0005]).

205

Supplementary Information Solution of the Exercises – 206 Index – 213

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Duchateau and A. P. King (eds.), AI and Big Data in Cardiology, https://doi.org/10.1007/978-3-031-05071-8

206

Solution of the Exercises

Solution of the Exercises Chapter 2 Exercise 1 For example: −Lack of availability of annotations −Lack of trust in annotations −No clear task / interested in data exploration

Exercise 2 Imaging data: routine modalities, in particular MRI (due to better image quality in each slice, and the availability of large annotated databases) although mostly 2D slices treated independently. Non-imaging data: in particular ECG, with large public databases, data challenges, shared tools across the scientific community, and easy-to-define supervised problems. Modalities with less standardized acquisitions, lower imaging quality, smaller populations, or less annotated data are more challenging. Also, the analysis of 3D data and temporal sequences is more challenging, it requires more computational power, being careful about spatial or temporal consistency, and less data are available.

Exercise 3 By choosing the hyperparameter setting that maximised performance on the test set it ceases to be a test set, rather it is a validation set. Since validation performance is biased, we would expect performance to be less when deployed on real data because of potential ‘domain’ differences between the validation set and the real world data.

Exercise 4 If a large amount of unannotated data are available, a semi-supervised approach could be considered.

207 Solution of the Exercises

Exercise 5 Choice of input data—echo and ECG are routinely acquired? Gather (retrospective?) data on occurrence of life-threatening arrythmias, including both patients who were given ICD under current guidelines and those who weren’t but suffered a life-threatening arrythmia Supervised model with arrythmia occurrence as output. Input(s) to model could be hand-crafted features from echo/ECG, or learnt features using deep learning. Validation: depending on amount of data available, use train/validation/test split or cross-validation approach. Would need to compare new approach with existing clinical guidelines.

Exercise 6 LV strain is typically calculated from segmentations of the LV. Machine learning could be used for automated segmentation: −Train supervised model to make pixelwise predictions of class −Use manual segmentations as training/validation/test data −If unannotated data available, could also use a semi-supervised approach Machine learning could also be used for a MACE prediction model: −Supervised model would take strain as input, predict MACE as output − Would need data from patients who have/haven’t experienced MACE within a limited time period after imaging Depending on the amount of data, could use train/validation/test split or crossvalidation to validate the model.

Chapter 3 Exercise 1 ‘epoch’ is seeing each data pair once, ‘iteration’ is seeing a subset of data pairs. Iterations times batch size should equal total training set size.

Exercise 2 Batch gradient descent: all training data in gradient update (prohibitively slow and memory intensive). Stochastic gradient descent: use single training pair in each update.

208

Solution of the Exercises

Mini-batch stochastic gradient descent: choose subset of training pairs for each update.

Exercise 3 Both use dot product of input parameters and weights. Perceptron uses sign activation but logistic regression uses sigmoid.

Exercise 4 Use non-linear decision function. Measure more variables. Transform the data.

Exercise 5 A deep network will be necessary to learn the complex mapping. With fully-connected layers this would lead to an explosion in parameters. CNNs allow deep networks without this explosion, so would be the preferred approach.

Chapter 4 Exercise 1 CNNs are efficient when dealing with imaging data due to reduced number of parameters. This is made possible by weight-sharing within layers, enabling much deeper architectures to be defined.

Exercise 2 Adv: Important to benchmark models’ performance on consistent data for a fair evaluation of performance. Drawbacks: Most databases are not huge in size, so a danger of all models effectively ‘‘overfitting’’ to a (relatively) small amount of public data. Also annotation protocols and ‘‘styles’’ can vary from centre to centre so again a danger of models being tuned to a particular protocol/style. Ideally the use of public databases should be combined with thorough validation on large-scale real-world clinical data.

209 Solution of the Exercises

Exercise 3 RNNs are designed to process time series data, so problems in which temporal data are available would be ideal, for example processing cine CMR images to derive function biomarkers, or processing ECG signals.

Exercise 4 These connections link corresponding layers between the encoder and decoder parts. Their aim is to recover fine details that may have been lost in the encoder.

Exercise 5 Adv: potentially more efficient and automated. Disadv: Lack of interpretability, e.g. how would cardiologists know when they could trust the figures produced? With segmentations they can quickly look at them and see if they are satisfying (key issue of clinical trust in DL).

Exercise 6 Lower image quality, more artefacts, most clinical imaging is 2-D not 3-D.

Chapter 5 Exercise 1 ML: hand-crafted features, DL: learnt features.

Exercise 2 Combine learning of classifier model with learning of features used for classification.

Exercise 3 (Warning: this is still a matter of opinion) Raises need for interpretability. Need to validate their impact of clinical workflows rather than just accuracy of classification.

210

Solution of the Exercises

Exercise 4 Ability to quantify uncertainty could be important if tools used for decision support (see previous exercise).

Exercise 5 More of a moral question—is it better that overall performance is better even if subgroups suffer reduced performance? Needs of the many outweigh the needs of the few?

Exercise 6 CNNs have performed well on imaging based tasks. But they (typically) require lots of data—how much data are available for training/validation? Possible class imbalance problem if diseases are very rare. Need for interpretability? Open and complete reporting of performance across sex/race subgroups—are these diseases more prevalent in some groups that others? This should be considered if/when applying Fair ML techniques. Importance of validation on real-world data, e.g. external datasets featuring imaging from multiple centres/vendors.

Exercise 7 Potential for knowledge discovery. Representation will not be biased by labels, which might not be perfect. The representation could still be used for supervised ML, but the unsupervised representation might be more interpretable.

Chapter 6 Exercise 1 K-M shows survival probability over time. Can compare K-M curves for two groups, one with and one without treatment. Yes, using the predicted outcome instead of the theoretical outcome as label, although this is more relevant for unsupervised approaches (see tutorial).

211 Solution of the Exercises

Exercise 2 Traditional: mostly simple descriptors and demographic data. AI: In principle can also learn features from more complex data such as imaging, ECG, electrophysiology etc.

Exercise 3 Supervised: the learnt representation will be optimised for the task, but difficult and time-consuming to obtain large labelled datasets. Unsupervised: Can exploit large unlabelled datasets, learn representations that are independent of labels (which may be noisy).

Exercise 4 Up to the reader.

Chapter 7 Exercise 1 Breathing motion, cardiac cycle motion, mistriggering, magnetic field related issues etc. Can lead to errors in derived biomarkers such as volumes and EF.

Exercise 2 Poor view planning changes appearance of anatomy in images so ML view classification models can be used for QC. ML can automate view planning by identifying key landmarks that can be used to define imaging planes.

Exercise 3 Can lead to errors in derived biomarkers, these biomarkers often used for disease diagnosis/characterisation (e.g. EF for heart failure) or predicting prognosis. Direct and indirect QC of segmentation ...

212

Solution of the Exercises

Exercise 4 Does the database already feature quality-control of images? If not, will need to ensure that artefact images are excluded from the study to avoid corrupting the results. If the database is large this will be laborious to do manually so automated tools could be useful. Same goes for segmentations—the DL-based segmentation model could fail and it is important to detect these failures for the same reason.

Exercise 5 It could be argued that in clinical workflows the outputs of AI models (e.g. segmentation) will always be inspected anyway, i.e. the cardiologist will perform the QC. (This is less feasible for large-scale population studies.) So it is a reasonable argument to make. But can we be sure that cardiologists under time pressure would spot all errors, even spatially/temporally localised ones?

213

A–D

Index A Ablation, radiofrequency, 110, 116, 118, 184, 186 ACDC dataset, 67, 79, 94, 98, 150 Activation function, 43 Active contour model, 144 Adaboost, 90 Adversarial attacks, 162 AI, definition, 3 AI effect, 3 AI winter, 5 AlphaGo, Google, 5 American Heart Association (AHA), 16, 75, 112, 122 Annotation, 8 Apical rocking, 115 Area Under ROC Curve (AUC), 21, 173 Arrhythmia, 139, 184 Artefacts, cardiac MR, 139 Artificial neural network, definition, 3, 43 Atlas-based segmentation, 146 Atlas, statistical, 19 Atrial fibrillation, 26, 108, 109, 112, 116, 187 Atrous convolutions, 74 Attention map, 122, 192 Augmentation, data, 141, 176, 178 Autoencoder, 91, 92, 166, 187, 194

B Babylon Health, 3 Back propagation, 51, 123 Bagging, 90 Basis function, 50 Batch gradient descent, optimization, 42 Batch size, optimization, 42, 54 Bayesian Neural Network (BNN), 93, 97 Bias, fairness, 96, 126, 162, 175, 202 Bias, neural network, 39 Bias, validation, 20 Bias-variance tradeoff, 20, 175 Big data, definition, 8 Biomarkers, 58, 80, 86, 126, 147 Biomechanical model, 178 Black box, 96, 158, 161 Body surface potential map, 184, 186 Boosting, 90, 141 Breathing motion, cardiac MR, 139

C Calcium score, 59, 72, 106

CAMUS dataset, 67, 70 Canonical Correlation Analysis (CCA), 166 Cardiac MR, 25, 80, 86, 120, 136, 139 Cardiac Resynchronization Therapy (CRT), 59, 112, 167 Cardiac scintigraphy, 136 Cardioversion, 116 Categorical data, 15 Causal inference, 162, 178 Causality, 127 Challenge datasets, 178 Checkers, 5 Cine imaging, cardiac MR, 25 Circle Cardiovascular Imaging, 4 CLAIM guidelines, 177 Class Activation Mapping (CAM), 122 Classification, 14, 37, 89 Class imbalance, 161, 188 Clustering, 14, 15, 91, 126, 194 Collider bias, 176 Compactness, model, 21 Computed Tomography (CT), 26, 59, 80, 120, 121, 136 Computing in Cardiology, 27 Concept activation vectors, 94 Conditional Generative Adversarial Network (cGAN), 92 CONSORT guidelines, 176 Continuous data, 15 Convolutional Neural Network (CNN), 52, 62, 90, 136, 190 Coronary angiography, CT, 26, 59 Counterfactual inference, 127 Cox proportional hazard model, 109 Cross entropy, 46 Cross validation, 20, 42, 175 Cryoablation, 116, 184 Curation, data, 8

D Dark rim artefact, cardiac MR, 137 Decision support, 3, 86, 99, 114, 126, 157 Decision tree, 15, 89 Deep Blue, 5 Deep CCA, 166 Deep learning, definition, 3 Descriptor, data, 16 Diagnosis, 202 Diagnosis, ECG, 190 Dice coefficient, 21, 92, 143 Digital twin, 127, 178, 203

214

Index

Dilated cardiomyopathy, 69, 94, 167 Dilated convolutions, 74 Dimensionality, 16 Dimensionality, intrinsic, 16 Dimensionality reduction, 14, 16, 19, 164, 191 Discrete data, 15 Disentanglement, 80 Domain adaptation, 92, 176 Domain, function, 37 Doppler, echocardiography, 23, 58 Downsampling, 63, 64, 75, 91

E Echocardiography, 23, 58, 80, 86, 120, 136, 138 EchoNet Dynamic dataset, 139 Ejection fraction, 59, 86, 87, 106, 112, 113, 118, 136 Electro-anatomical mapping, 186 Electrocardiogram (ECG), 27, 59, 62, 80, 110, 116, 137, 139, 152, 184, 186, 188 Electronic Health Record (EHR), 23, 27, 77, 107, 116, 118, 193 Electrophysiology, 118, 184 Eminence-based knowledge, 159 Encoder-Decoder Network (EDN), 63, 91 Endocarditis, 26 End-to-end training, 52, 97 Ensemble, 67, 90, 93, 176 Epoch, optimization, 42, 54 Ethical issues, 99, 126, 152 European Society of Cardiology (ESC), 112 Evidence-based knowledge, 159 Evidence Lower Bound (ELBO), 146 E-wave deceleration time, 58 Expectation maximization, 15 Expert system, 3 Explainability, 122, 163, 202 External validation, 20, 202

F Fairness, 96 Feature, data, 16 Feature map, CNN, 52, 75 Federated learning, 95, 99, 107, 126, 195, 203 Fidelity, data, 8 Filter, CNN, 52 Fractional area change, left ventricle, 58 Fully Connected Network (FCN), 61 Fully convolutional network, 63, 91, 146

G GE, 80

Generalization, 141 Generative Adversarial Network (GAN), 63, 91, 92, 142, 143, 152 Genetic algorithm, 15 German national cohort, 107 Google Research, 4 GradCAM, 122 Gradient descent, optimization, 40 Graph Convolutional Network (GCN), 63 Graphics Processing Unit (GPU), 72

H Hand-crafted features, 16, 90, 191 Hausdorff distance, 21, 144 Heart failure, 59, 62, 68, 87, 94, 96, 108, 112, 126, 166, 168, 193 HeartFlow, 7 Hidden layer, 51 Hyperparameter, 20, 42, 67, 92 Hypertrophic cardiomyopathy, 69, 94, 168

I Implantable Cardioverter-Defibrillator (ICD), 106, 118, 184 Independent Component Analysis (ICA), 15 Internal validation, 20 Inter-observer variability, 60, 72, 140 Interpretability, 16, 22, 90, 96, 106, 107, 158, 160, 163, 177, 192, 202 Intracardiac catheter, 184 Intra-observer variability, 60, 72, 140 Iteration, optimization, 42, 54

K Kaplan-Meier survival curve, 108 Kasparov, Garry, 5 Kernel CCA, 166 Kernel regression, 15 Kernel ridge regression, 186 Kernel SVM, 50 Kernel trick, 89 K-means clustering, 15

L Late Gadolinium Enhancement (LGE), cardiac MR, 25, 59, 120, 121 Leave-one-out cross validation, 20 Left bundle branch block, 112, 113, 120 Left ventricular mass, 64, 87 Left Ventricular Outflow Tract (LVOT), 142 Linear Discriminant Analysis (LDA), 15, 163

215 Index

Linear regression, 15 Logistic regression, 15, 40, 44 Logistic regression, multi-class, 46 Long Short-Term Memory (LSTM), 90, 141, 190 Loss function, 40 Loss function, cross entropy, 49 Loss function, logistic regression network, 46 Loss function, multi-class logistic regression network, 49 Loss function, multi-class perceptron, 48 Loss function, perceptron, 44

M Machine learning, definition, 3 Manifold learning, 15, 19, 164, 194 Markov Chain Monte Carlo (MCMC), 93 Max pooling, 74, 75 Mean contour distance, 144 Mean pressure gradient, 58 Mean shift clustering, 15 MIMIC-III dataset, 193 Mini-batch, optimization, 42 Mini-batch stochastic gradient descent, optimization, 42 Minsky, Marvin, 5 Mistriggering, cardiac MR, 139 M-mode, echocardiography, 23 MNIST, 52 Monte Carlo dropout, 93 Multi-atlas-based segmentation, 146 Multi-layer perceptron, 51 Multiple kernel learning, 166 Multiview learning, 166 MYCIN expert system, 3

L–R

P Pacemaker, 112, 184 Partial least squares, 166 Peak inflow rate, 58 Perceptron, 40, 42, 61 Perceptron, multi-class, 46 Perfusion imaging, cardiac MR, 59 Phase contrast, cardiac MR, 59 Philips, 80 Physics-Informed Neural Network (PINN), 186 PhysioNet, 27, 95, 194 Picture Archiving and Communication System (PACS), 23, 28 Pix2Pix, 92 Pooling, 74, 75, 92, 123 Positron Emission Tomography (PET), 26, 59 Precision, 21 Precision cardiology, 185 Principal Component Analysis (PCA), 15, 19, 117, 164 Privacy, data, 92, 107, 195, 203 Prognosis, 202 Provenance, data, 174 Public engagement, 202 Pubmed, 6, 188 P wave, ECG, 116

Q QRS complex, ECG, 59, 116 QRS duration, 112, 113 QRS duration, ECG, 59 QRS wave, ECG, 116 Quality control, 25, 136 Quantification, 57, 152, 202 Quantification, ECG, 59, 62, 190

N Natural Language Processing (NLP), 193 Neural network, 15 Neural network, definition, 3 nnU-net, 67, 70 Nominal data, 15

O Occam’s razor, 202 One hot encoding, 19 Ordinal data, 15 Outcome prediction, 105 Overfitting, 19, 20, 92, 115, 161, 175, 179, 188, 191

R Radiomics, 16, 80, 94, 120, 178 Random forest, 90, 110, 114, 117, 140 Range, function, 37 Ranked data, 15 Recall, 21 Receiver Operating Characteristic (ROC) curve, 21, 76, 110, 173 Receptive field, CNN, 73 Receptive field, convolution, 74 Reconstruction, cardiac MR, 25, 80 Recurrent Neural Network (RNN), 62, 90, 141, 190, 193 Registration, 146 Regression, 14, 89, 109 Regularization, 21

216

Index

Regulatory issues, 99 Reinforcement learning, 13, 142 Representation learning, 166 Ridge regression, 15 Root-mean-square error, 21 R-wave, ECG, 137

S Samuel, Arthur, 3, 5 Segmentation, 8, 14, 25, 58, 64, 80, 91, 139, 143, 147, 152, 202 Semi-supervised learning, 13, 142 Sensitivity, 21 Separability, cluster, 21 Septal flash, 115 Siemens, 80 Sigmoid function, 44 Sign function, 43 Single Photon Emission Computed Tomography (SPECT), 26, 59, 75, 136 Snakes, 144 Sodol, Lee, 5 Softmax function, 48, 68 Specificity, 21 Speckle tracking, echocardiography, 23, 58 SPIRIT-AI guidelines, 177 Stochastic gradient descent, optimization, 42, 92 Strain, 58, 64, 120 Strain, Lagrangian, 17 Stride, convolution, 75 S-T segment elevation, ECG, 59 Sudden cardiac death, 106, 111, 118 Sum-of-squared differences, 21 Supervised learning, 13, 37, 88, 113, 126, 163, 190 Support Vector Machine (SVM), 15, 88, 110, 117, 141, 148, 186

T T1 map, cardiac MR, 25, 59, 146 T2 map, cardiac MR, 25, 59 Testing set, 20 Training set, 19, 20, 38 Trans-oesophegeal echocardiography, 24 TRIPOD guidelines, 177 T wave, ECG, 116

U UK Biobank, 7, 8, 70, 95, 107, 142, 144, 176 Uncertainty, 22, 93, 114, 146 Underfitting, 20 U-net, 63, 67, 70, 78, 92, 147, 190 Unsupervised learning, 13, 114, 163 Upsampling, 63, 91

V Validation, 19, 202 Validation set, 20 Valve orifice area, 58 van Einthoven, William, 27 Variance, validation, 20 Variational autoencoder, 91, 93 Ventricular arrhythmia, 108, 118 Ventricular tachycardia, 187 VGG network, 76 View classification, 139 View planning, cardiac MR, 141

W Watson, IBM, 3, 5 Wearable devices, 126