Adaptive Radar Detection: Model-Based, Data-Driven and Hybrid Approaches 163081900X, 9781630819002

This book shows you how to adopt data-driven techniques for the problem of radar detection, both per se and in combinati

406 101 30MB

English Pages 234 [235] Year 2022

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Adaptive radar resource management 9780128029022, 0128029021

606 76 1MB Read more

Radar for indoor monitoring : detection, classification, and assessment 9781351651387, 1351651382

"This book aims to capture recent advances and breakthroughs in in-home radar monitoring of human motions and activ

1,046 122 29MB Read more

Meter-Wave Synthetic Aperture Radar for Concealed Object Detection 9781630810252

654 188 17MB Read more

Adaptive Hybrid Control of Quadrotor Drones 9811997438, 9789811997433

This book discusses the dynamics of a tail-sitter quadrotor and biplane quadrotor-type hybrid unmanned aerial vehicles (

394 134 10MB Read more

Adaptive Hybrid Active Power Filters 9789811088261, 9789811088278, 9811088276

This book introduces advanced thyristor-based shunt hybrid active power filters (HAPFs) for power quality improvement in

591 117 15MB Read more

Modern Radar Detection Theory 1613531990, 978-1-61353-199-0, 978-1-61353-200-3, 1613532008

Modern radar detection is the new frontier for advanced radar systems capable of operating in challenging scenarios with

747 81 6MB Read more

Signal processing for multistatic radar systems: adaptive waveform selection, optimal geometries and pseudolinear tracking algorithms 9780128153147, 0128153148

1. Introduction ; Part 1: Adaptive waveform selection -- 2. Waveform selection for multistatic tracking of a maneuvering

663 193 4MB Read more

Adaptive Collaborative Approaches in Natural Resource Governance : Rethinking Participation, Learning and Innovation 9781136486777, 9780415696531

The purpose of this book is to showcase a range of approaches that consider learning and collaboration as central proces

168 67 2MB Read more

Signal Processing for Multistatic Radar Systems: Adaptive Waveform Selection, Optimal Geometries and Pseudolinear Tracking Algorithms 0128153148, 9780128153147

Signal Processing for Multistatic Radar Systems: Adaptive Waveform Selection, Optimal Geometries and Pseudolinear Tracki

1,025 216 7MB Read more

Transformative Technologies in Education: A Comprehensive Exploration of Quantitative, Qualitative, and Hybrid Approaches

Transformative Technologies in Education: A Comprehensive Exploration of Quantitative, Qualitative, and Hybrid Approache

152 58 62KB Read more

Adaptive Radar Detection: Model-Based, Data-Driven and Hybrid Approaches
163081900X, 9781630819002

Author / Uploaded
Angelo Coluccia

Table of contents :
Adaptive Radar Detection Model-Based, Data-Driven, and Hybrid Approaches
Contents
Preface
Acknowledgments
1
Model-Based Adaptive Radar Detection
1.1 Introduction to Radar Processing
1.1.1 Generalities and Basic Terminology of Coherent Radars
1.1.2 Array Processing and Space-Time Adaptive Processing
1.1.3 Target Detection and Performance Metrics
1.2 Unstructured Signal in White Noise
1.2.1 Old but Gold: Basic Signal Detection and the Energy Detector
1.2.2 The Neyman–Pearson Approach
1.2.3 Adaptive CFAR Detection
1.2.4 Correlated Signal Model in White Noise
1.3 Structured Signal in White Noise
1.3.1 Detection of a Structured Signal in White Noise and Matched Filter
1.3.2 Generalized Likelihood Ratio Test
1.3.3 Detection of an Unknown Rank-One Signal in White Noise
1.3.4 Steering Vector Known up to a Parameter and Doppler Processing
1.4 Adaptive Detection in Colored Noise
1.4.1 One-Step, Two-Step, and Decoupled Processing
1.4.2 General Hypothesis Testing Problem via GLRT: A Comparison
1.4.3 Behavior under Mismatched Conditions: Robustness vs Selectivity
1.4.4 Model-Based Design of Adaptive Detectors
1.5 Summary
References
2 Classification Problems and Data-Driven Tools
2.1 General Decision Problems and Classification
2.1.1 M-ary Decision Problems
2.1.2 Classifiers and Decision Regions
2.1.3 Binary Classification vs Radar Detection
2.1.4 Signal Representation and Universal Approximation
2.2 Learning Approaches and Classification Algorithms
2.2.1 Statistical Learning
2.2.2 Bias-Variance Trade-Off
2.3 Data-Driven Classifiers
2.3.1 k-Nearest Neighbors
2.3.2 Linear Methods for Dimensionality Reduction and Classification
2.3.3 Support Vector Machine and Kernel Methods
2.3.4 Decision Trees and Random Forests
2.3.5 Other Machine Learning Tools
2.4 Neural Networks and Deep Learning
2.4.1 Multilayer Perceptron
2.4.2 Feature Engineering vs Feature Learning
2.4.3 Deep Learning
2.5 Summary
References
3
Radar Applications of Machine Learning
3.1 Data-Driven Radar Applications
3.2 Classification of Communication and Radar Signals
3.2.1 Automatic Modulation Recognition and Physical-Layer Applications
3.2.2 Datasets and Experimentation
3.2.3 Classification of Radar Signals and Radiation Sources
3.3 Detection Based on Supervised Machine Learning
3.3.1 SVM-Based Detection with Controlled PFA
3.3.2 Decision Tree-Based Detection with Controlled PFA
3.3.3 Revisiting the Neyman–Pearson Approach
3.3.4 SVM and NN for CFAR Processing
3.3.5 Feature Spaces with (Generalized) CFAR Property
3.3.6 Deep Learning Based Detection
3.4 Other Approaches
3.4.1 Unsupervised Learning and Anomaly Detection
3.4.2 Reinforcement Learning
3.5 Summary
References
4 Hybrid Model-Based and Data-Driven Detection
4.1 Concept Drift, Retraining, and Adaptiveness
4.2 Hybridization Approaches
4.2.1 Different Dimensions of Hybridization
4.2.2 Hybrid Model-Based and Data-Driven Ideas in Signal Processing and Communications
4.3 Feature Spaces Based onWell-Known Statistics or Raw Data
4.3.1 Nonparametric Learning: k-Nearest Neighbor
4.3.2 Quasi-Whitened Raw Data as Feature Vector
4.3.3 Well-Known CFAR Statistics as a Feature Vector
4.4 Rethinking Model-Based Detection in a CFAR Feature Space
4.4.1 Maximal Invariant Feature Space
4.4.2 Characterizing Model-Based Detectors in CFAR-FP
4.4.3 Design Strategies in the CFAR-FP
4.5 Summary
References
5 Theories, Interpretability,
and Other Open Issues
5.1 Challenges in Machine Learning
5.2 Theories for (Deep) Neural Networks
5.2.1 Network Structures and Unrolling
5.2.2 Information Theory, Coding, and Sparse Representation
5.2.3 Universal Mapping, Expressiveness, and Generalization
5.2.4 Overparametrized Interpolation, Reproducing Kernel Hilbert Spaces, and Double Descent
5.2.5 Mathematics of Deep Learning, Statistical Mechanics, and Signal
Processing
5.3 Open Issues
5.3.1 Adversarial Attacks
5.3.2 Stability, Efficiency, and Interpretability
5.3.3 Visualization
5.3.4 Sustainability, Marginal Return, and Patentability
5.4 Summary
References
List of Acronyms
List of Symbols
About the Author
Index

Citation preview

Adaptive Radar Detection Model-Based, Data-Driven, and Hybrid Approaches

Coluccia: “fm_v2” — 2022/10/7 — 13:05 — page i — #1

For a complete listing of titles in the Artech House Radar Series, turn to the back of this book.

Untitled-1 ii

10/7/2022 9:43:53 AM

Adaptive Radar Detection Model-Based, Data-Driven, and Hybrid Approaches Angelo Coluccia

Coluccia: “fm_v2” — 2022/10/7 — 13:05 — page iii — #3

Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the U.S. Library of Congress.

British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library.

Cover design by Andy Meaden ISBN 13: 978-1-63081-900-2

© 2023 ARTECH HOUSE 685 Canton Street Norwood, MA 02062

All rights reserved. Printed and bound in the United States of America. No part of this book may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage and retrieval system, without permission in writing from the publisher. All terms mentioned in this book that are known to be trademarks or service marks have been appropriately capitalized. Artech House cannot attest to the accuracy of this information. Use of a term in this book should not be regarded as affecting the validity of any trademark or service mark.

10 9 8 7 6 5 4 3 2 1

Coluccia: “fm_v2” — 2022/10/7 — 13:05 — page iv — #4

Contents

1

Preface

ix

Acknowledgments

xv

Model-Based Adaptive Radar Detection

1

1.1 Introduction to Radar Processing 1.1.1 Generalities and Basic Terminology of Coherent Radars 1.1.2 Array Processing and Space-Time Adaptive Processing 1.1.3 Target Detection and Performance Metrics

1

1.2 Unstructured Signal in White Noise 1.2.1 Old but Gold: Basic Signal Detection and the Energy Detector 1.2.2 The Neyman–Pearson Approach 1.2.3 Adaptive CFAR Detection 1.2.4 Correlated Signal Model in White Noise

9

1.3 Structured Signal in White Noise 1.3.1 Detection of a Structured Signal in White Noise and Matched Filter 1.3.2 Generalized Likelihood Ratio Test 1.3.3 Detection of an Unknown Rank-One Signal in White Noise v

Coluccia: “fm_v2” — 2022/10/7 — 13:05 — page v — #5

2 5 8

9 11 13 15 18 18 20 24

vi

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

1.3.4 Steering Vector Known up to a Parameter and Doppler Processing

25

1.4 Adaptive Detection in Colored Noise 1.4.1 One-Step, Two-Step, and Decoupled Processing 1.4.2 General Hypothesis Testing Problem via GLRT: A Comparison 1.4.3 Behavior under Mismatched Conditions: Robustness vs Selectivity 1.4.4 Model-Based Design of Adaptive Detectors

28

1.5

Summary

42

References

43

2

Classification Problems and Data-Driven Tools

49

2.1 2.1.1 2.1.2 2.1.3 2.1.4

General Decision Problems and Classification M -ary Decision Problems Classifiers and Decision Regions Binary Classification vs Radar Detection Signal Representation and Universal Approximation

49 50 55 61

Learning Approaches and Classification Algorithms 2.2.1 Statistical Learning 2.2.2 Bias-Variance Trade-Off

25 27

31 33

64

2.2

66 66 71

2.3 Data-Driven Classifiers 2.3.1 k-Nearest Neighbors 2.3.2 Linear Methods for Dimensionality Reduction and Classification 2.3.3 Support Vector Machine and Kernel Methods 2.3.4 Decision Trees and Random Forests 2.3.5 Other Machine Learning Tools

75 77 81 84

2.4 2.4.1 2.4.2 2.4.3

Neural Networks and Deep Learning Multilayer Perceptron Feature Engineering vs Feature Learning Deep Learning

85 86 88 89

2.5

Summary

93

References

93

Coluccia: “fm_v2” — 2022/10/7 — 13:05 — page vi — #6

72 73

Contents

vii

3

Radar Applications of Machine Learning

97

3.1

Data-Driven Radar Applications

97

3.2

Classification of Communication and Radar Signals 3.2.1 Automatic Modulation Recognition and Physical-Layer Applications 3.2.2 Datasets and Experimentation 3.2.3 Classification of Radar Signals and Radiation Sources 3.3 3.3.1 3.3.2 3.3.3 3.3.4 3.3.5 3.3.6

Detection Based on Supervised Machine Learning SVM-Based Detection with Controlled PFA Decision Tree-Based Detection with Controlled PFA Revisiting the Neyman–Pearson Approach SVM and NN for CFAR Processing Feature Spaces with (Generalized) CFAR Property Deep Learning Based Detection

100 100 102 107 109 110 111 112 114 117 120

3.4 Other Approaches 3.4.1 Unsupervised Learning and Anomaly Detection 3.4.2 Reinforcement Learning

123 123 125

3.5

Summary

126

References

127

4

Hybrid Model-Based and Data-Driven Detection

137

4.1

Concept Drift, Retraining, and Adaptiveness

137

4.2 Hybridization Approaches 4.2.1 Different Dimensions of Hybridization 4.2.2 Hybrid Model-Based and Data-Driven Ideas in Signal Processing and Communications Feature Spaces Based on Well-Known Statistics or Raw Data 4.3.1 Nonparametric Learning: k-Nearest Neighbors 4.3.2 Quasi-Whitened Raw Data as Feature Vector 4.3.3 Well-Known CFAR Statistics as a Feature Vector

139 139 140

4.3

Rethinking Model-Based Detection in a CFAR Feature Space 4.4.1 Maximal Invariant Feature Space

142 142 144 147

4.4

Coluccia: “fm_v2” — 2022/10/7 — 13:05 — page vii — #7

151 151

viii

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

4.4.2 Characterizing Model-Based Detectors in CFAR-FP 153 4.4.3 Design Strategies in the CFAR-FP 158 4.5

Summary

159

References

160

5

Theories, Interpretability, and Other Open Issues

165

5.1

Challenges in Machine Learning

165

5.2 Theories for (Deep) Neural Networks 5.2.1 Network Structures and Unrolling 5.2.2 Information Theory, Coding, and Sparse Representation 5.2.3 Universal Mapping, Expressiveness, and Generalization 5.2.4 Overparametrized Interpolation, Reproducing Kernel Hilbert Spaces, and Double Descent 5.2.5 Mathematics of Deep Learning, Statistical Mechanics, and Signal Processing

167 168

5.3 5.3.1 5.3.2 5.3.3 5.3.4

Open Issues Adversarial Attacks Stability, Efficiency, and Interpretability Visualization Sustainability, Marginal Return, and Patentability

181 181 182 184 185

5.4

Summary

187

References

188

List of Acronyms

195

List of Symbols

199

About the Author

203

Index

205

Coluccia: “fm_v2” — 2022/10/7 — 13:05 — page viii — #8

171 172 176 180

Preface Radar systems have already more than one century of history and underwent a significant evolution over decades, extending from the original military motivations to numerous applications in civil contexts, for instance in the automotive field. Although modern systems have multiple functionalities, the foundational goal of a radar certainly remains that of detecting the possible presence of a target by making an automatic decision based on the noisy signal observed back at the receiver. Despite its arguably seeming simplicity, such a task bears a number of formidable challenges; in the end, telling apart useful signal from noise background is a radical ontological question that goes beyond practical engineering, and has connections with information-theoretic and theoreticalphysics fundamental principles. In the specific case of radar systems, controllable performance is required, especially in terms of number of false alarms, but properties such as robustness are also desirable in practice. At the same time, the detector must adapt to a temporally and spatially varying disturbance background possibly made of noise, clutter, interference, and/or jamming. In recent years, the role of data-driven techniques has grown enormously, coming out of the original computer science, pattern recognition, and artificial intelligence fields to reach into many disciplines in engineering. This has been driven by significant advances in automatic classification and recognition tasks. Key factors in this revolution have been the huge increase in computational power due to hardware advances as well as the availability of huge datasets and ready-to-use software libraries. Multimedia (speech, image) and online-collected data (from platforms and apps providing social network, messaging, shopping, streaming, file sharing, and other services) are in fact abundant and easy to gather today, and tech companies are pushing for developing applications of their own data-driven frameworks in more and more contexts. ix

Coluccia: “fm_v2” — 2022/10/7 — 13:05 — page ix — #9

x

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

Also in the signal processing and communications areas, tools from machine learning and (shallow or deep) neural networks are increasingly adopted to attack problems traditionally addressed by model-based approaches. This trend has also reached the radar field, where, however, peculiar challenges are found. Datadriven methods are in fact typically used in a black-box fashion, exploiting the availability of a large amount of data to feed the box (e.g., a neural network), tuning its parameters by trial and error. This approach is able to capture the richness of real data that cannot be incorporated in a parametric model, hence often yielding impressive results, but also exhibiting several limitations: besides the abovementioned need for large training datasets, it lacks interpretability and control on performance and properties, it is not robust to adversarial attacks, and it exhibits misspecification and bias toward the training dataset. All such issues are critical for artificial intelligence at large. In the case of radars, serious concerns arise due to their adoption in security and safety contexts. Moving from these premises, the book explores the possibility to adopt data-driven techniques for the problem of target (signal) detection, alone or in combination with model-based approaches. In fact, as mentioned, purely data-driven solutions can provide remarkable results, but at the price of losing some desirable properties. At the same time, domain-specific modeling has been proving its effectiveness for decades, providing control and interpretability. An interesting possibility is thus the combination of both paradigms, a compromise that can be termed “hybrid.” While several excellent books are already available on the topic of (spacetime) adaptive detection—traditionally addressed by means of model-based statistical signal processing techniques only—and books reviewing data-driven techniques for different radar tasks and applications have also recently started to appear, a book unifying the traditional adaptive radar detection with the modern data-driven algorithmic view was indeed still missing. The distinguishing feature of this book is that it is written with the aim of truly bridging the two worlds (or two cultures) represented by the modelbased and data-driven paradigms. This is accomplished by mixing rigorous mathematical modeling and theoretical discussion with a practical description of algorithmic tools and data-driven intuition, along the fil rouge of statistical learning. A by-product of the learning perspective is also to take a new look at old work, in the hope that this unifying effort can be inspiring for engineers trained in radar systems who seek to evolve their knowledge toward data-driven solutions, but also for researchers who want to enter the specific topic of adaptive radar detection in a modern and comprehensive way. The ultimate aim is to foster a critical attitude toward data-driven tools, and also to pave the way for their judicious integration into the well-established field of adaptive radar detection. Overall, this methodological discussion and critical spin represents a timely and stimulating treatment of a topic that is expected to grow significantly in the near future.

Coluccia: “fm_v2” — 2022/10/7 — 13:05 — page x — #10

Preface

xi

More specifically, the classical formulation of radar detection as a hypothesis testing problem, namely based on Neyman-Pearson or generalized likelihood ratio test (GLRT) approaches, is revisited under the lens of the data-driven paradigm. Moreover, fundamental concepts, schemes, and algorithms from statistical learning, classification, and neural networks domains are reviewed, showing how they have been adapted to deal with the problem of radar detection in the literature. Emphasis is given to the quest for alternative design approaches that can guarantee the constant false alarm rate (CFAR) property exhibited by traditional model-based adaptive detection statistics. The design of detectors that are robust to signal mismatches, or can conversely reject unwanted signals, is also addressed. The book considers a large number of sources spanning several communities (radar, statistical signal processing, machine learning, artificial intelligence), enriched with original contributions, diagrams, and figures based on experimental data. It is aimed at providing a broad yet neat coverage of the topic, hopefully helping the interested technical audience to: • Have a handy, concise reference for many classic (model-based) adaptive radar detection schemes as well as the most popular machine learning techniques (including deep neural networks); • Identify suitable data-driven approaches for radar detection and the main related issues, based on critical discussions on their pros and cons; • Have a smooth introduction to advanced statistical techniques used in adaptive detection, assisted by intuitive visual interpretations and analogies to concepts from machine learning classification and data science; • Learn which aspects need to be carefully considered in the design of hybrid solutions that can take advantages from both model-based and data-driven approaches. As a prerequisite, it is assumed that the reader is familiar with the necessary mathematical, statistical, physical, and electrical engineering concepts. A basic understanding of signal theory and radar processing is also recommended. However, the book reviews several essential aspects and provides numerous references to other sources where the needed know-how can be acquired. It generally alternates between detailed explanations and concise reviews, to keep focus while still having an eye on the big picture. Short survey sections are also provided on certain aspects and on complementary topics or applications. Finally, the academic community interested in the new possibilities brought by datadriven and hybrid approaches to radar detection may find in this book an overview on existing research directions. The book is organized into five chapters. Chapter 1 concisely presents the fundamentals of radar and signal detection, then reviews the most important design ideas behind classical adaptive radar

Coluccia: “fm_v2” — 2022/10/7 — 13:05 — page xi — #11

xii

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

detectors. Emphasis is on providing insights and drawing relationships between different model-based solutions, rather than on a detailed treatment. This is a distinctive angle compared to the many excellent sources already available in the literature, and ultimately provides a handy reference to popular detectors based on traditional model-based tools, highlighting their intrinsic interpretability and control onto the achievable performance. Chapter 2 introduces the necessary theoretical elements and presents the main algorithmic tools adopted in machine learning, in particular for classification problems (of which detection is a special case). Classical hypothesis testing theory, including the Neyman-Pearson rationale, is reviewed and put in connection with the use of loss functions in classification problems, following the formulation of statistical learning. Essential principles, language, and concepts of the datadriven paradigm found in machine learning and data science are introduced and adapted to the typical background, jargon, and needs of engineers in the radar community. Several popular machine learning algorithms are compendiously described to provide a self-consistent reference. Chapter 3 details the applications of the tools introduced in Chapter 2 to the radar field. The chapter specifically discusses how machine learning and more generally data-driven tools have been applied to the problem of target detection in the literature, but other applications of machine learning techniques, including signal classification, are also reviewed. Chapter 4 is devoted to various forms of hybridization between modelbased and data-driven concepts. Techniques for data-driven CFAR detection are discussed, and a restricted network architecture for analysis and design of CFAR detectors is illustrated, trying to bridge the gap between the classical modelbased approach and data-driven tools. In particular, the problem of designing data-driven and hybrid detectors that are robust to signal mismatches or can conversely reject unwanted signals is considered. Chapter 5 provides an articulate discussion on some important open issues of data-driven techniques. Cutting-edge studies aimed at understanding the generalization capabilities and, more generally, developing foundational theories of deep learning are reviewed, highlighting open problems regarding explainability and interpretation. Other relevant aspects, such as adversarial attacks, stability, sustainability, marginal return, and patentability, are also outlined. This book includes a carefully selected reference list for each chapter, so that the reader is easily steered toward more detailed or complementary information on each aspect. Lists of acronyms and recurrent symbols are also provided. The potential unleashed by data-driven methods, especially in combination with more principled (model-based) approaches, is likely going to produce significant advances in the radar field at large in the near future, provided that control is kept on the performance and interpretability of the algorithms. At

Coluccia: “fm_v2” — 2022/10/7 — 13:05 — page xii — #12

Preface

xiii

the same time, all the issues and limitations need to be carefully considered in making design choices for mission-critical tasks such as radar detection. In this respect, the viewpoint of the book is that a correct attitude should be neither blind faith in data-driven methods, throwing away decades of brilliant signal processing research, nor an a priori refusal of alternatives to classical model-based solutions—this way progressively advancing, ideally, toward an ultimate paradigm that provides benefits from both worlds.

Coluccia: “fm_v2” — 2022/10/7 — 13:05 — page xiii — #13

Coluccia: “fm_v2” — 2022/10/7 — 13:05 — page xiv — #14

Acknowledgments Scientific research is standing on the shoulders of giants, hence the list of people I should acknowledge would properly need to go back to the onset of the Homo sapiens lineage, if not earlier. I would at least like to thank those who produced the theoretical and practical knowledge that made its way into this book; they are many as well. I especially acknowledge the coauthors of my papers, both on and off the book’s topic: from our joint work I learned a lot. I would also like to extend my thanks to all the scholars and community members who kindly spent time in discussion with me at conferences or in private communications, for unconditionally sharing their know-how and vision. In truth, this book is a sudden outcome during my humble research journey, not a deliberate task I was planning for. So, sincere thanks must go to Artech House, in particular Dr. Merlin Fox (senior commissioning editor), for inviting me to develop this project, and for encouraging me to carry it out over a short time frame. I also wish to thank Ms. Casey Gerard (assistant editor) for her continuous assistance and gentle pushing, and Dr. Joseph R. Guerci for his consideration and valuable feedback. I am indebted to Prof. Giuseppe Ricci, who first introduced and mentored me on the problem of radar detection over a number of years, and as a student at the University of Salento, Italy, gifted me with an indelible imprinting through his teaching. I would also wish to thank my colleague Dr. Alessio Fascista for his help in preparing some of the figures, and for the nice discussions. Last but not least, I sincerely wish to thank my wife Serena and my two little boys Paride and Matteo, for making my life so rich and joyful: this book could not have been written without you.

xv

Coluccia: “fm_v2” — 2022/10/7 — 13:05 — page xv — #15

Coluccia: “fm_v2” — 2022/10/7 — 13:05 — page xvi — #16

1 Model-Based Adaptive Radar Detection This chapter concisely presents the fundamental principles of radar, the basics of signal detection, and the specific problem of (space-time) adaptive radar detection. It is meant to introduce the necessary basic concepts, formalism, and performance metrics (namely probability of false alarm, probability of detection, and constant false alarm rate property) while providing a handy reference to popular detectors based on classical model-based tools. Emphasis is put on insights and relationships among different detection strategies, with a critical comparative discussion, rather than on details and comprehensiveness of all radar aspects. The discussion is kept as lightweight as possible, informal at times, but nonetheless essential formalism and fundamental elements of mathematical rigor are progressively introduced in order to grasp the inner logic of model-based radar detection. Specifically, the unstructured signal case is first addressed, including the ubiquitous tool of energy detection, the Neyman–Pearson approach, and eigenvalue-based detectors. Then, structured signal models are reviewed, and one-step and two-step generalized likelihood ratio test approaches are compared, from the pioneering Kelly’s detector and adaptive matched filter to more recent detectors. The ultimate goal is to uncover the rationale of the model-based approach to adaptive radar detection, as it has been consolidating over several decades, highlighting its intrinsic interpretability and control onto an achievable performance. Such aspects will be progressively contrasted with the different characteristics of the data-driven approach within the remaining chapters of the book.

1.1 Introduction to Radar Processing Radar systems currently have more than one century of history and underwent a significant evolution over the past decades. Originated in the military field 1

Coluccia: “chapter1_v2” — 2022/10/7 — 13:05 — page 1 — #1

2

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

for long-range surveillance and early detection of air and naval targets, the applications of radar today are also very numerous in civil contexts, for instance in the automotive field. Radar detection involves multiple stages of processing and is itself only one (though very fundamental) part of a modern radar system, which generally includes several blocks and functionalities. For example, an automotive radar needs to detect range, velocity, and direction of an object, and estimation of such parameters clearly requires a detection algorithm but may also involve clustering and other algorithms. Due to the technological advance, in particular in electronics and digital processing capability, high-resolution radars are in fact becoming widespread and detection provides point clouds that need further processing (e.g., clustering to estimate the object contour).1 The focus of this book is mainly on point target detection; therefore, all the methods described may correspond to first stages of a more advanced radar detection processing chain. Traditional (adaptive) radar detection is a beautiful mix of engineering attitude and flourishing mathematical creation. This is due to the importance that models have in the processing of radar data for detecting the possible presence of a target, under different types of noise and disturbances that may mask it. In the following the essential prerequisite elements of this topic are introduced, but after that the emphasis will be on providing insights and drawing relationships between different model-based solutions, rather than on a comprehensive treatment. This is a distinctive angle compared to the many excellent sources already available in the literature. The reader is indeed referred to existing books for all the details about the introduced concepts, as well as for the many aspects of radar systems that are not covered in this book. Classical, indispensable references are [1–3], while more modern books are [4–6]; see also [7], which includes additional topics such as multi target tracking, data association, passive radar, radar networks, applications, and research directions. References more centered on adaptive radar detection are [8, 9], while cognitive radar, waveform methods, optimum resource allocation, and radar scheduling under the knowledge-based (or knowledgeaided) approach are discussed in [10, 11]. Specific references for space-time adaptive processing (STAP) are [12, 13], while for a high-level introduction the reader may refer to [14, 15]. 1.1.1 Generalities and Basic Terminology of Coherent Radars A radar is basically an engineered system employing an electromagnetic wave to determine the presence or absence of a target within a certain surveillance area, and to determine its (nonambiguous) range, essentially based on the relationship 1. However, specific approaches also exist for the detection of range-spread or otherwise extended targets based on the joint processing of multiple radar cells, which do not involve clustering.

Coluccia: “chapter1_v2” — 2022/10/7 — 13:05 — page 2 — #2

Model-Based Adaptive Radar Detection

3

between the round-trip time-of-fly and the distance between the radar and the target. Additionally, most modern radars can determine velocity and direction of arrival (DOA), and in some cases even perform imaging of the target. All such information is extracted from the echo of the transmitted waveform received by the radar. Specifically, considering the colocated transmitter and receiver (monostatic radar), the transmitted radiofrequency (RF) waveform propagates in the environment and may be partially reflected back to the radar by the target, as well as by other unwanted scatterers or reflectors, collectively referred to as clutter. The portion of reflected energy, called target echo or return, depends on the radar cross section of the target and channel propagation factors, and is typically weak. As a consequence, several strategies are adopted to improve the signal-to-noise ratio (SNR), including coherent integration of several pulses, multiple-antenna (array) processing, and optimal filtering in both space and time (STAP). In a typical setting, a pulsed wave transmitter emits a sequence of pulses of finite duration—which can be either simple uncoded pulses or include intrapulse (and even interpulse) modulations to gain desired waveform properties— separated by an interval called pulse repetition time (PRT), in which the transmitter is switched off. During this temporal period the received signal is amplified, downconverted, and sampled at a suitable frequency after waveform-matched filtering.2 The resulting samples are then processed for target detection and estimation of useful information. Specifically, considering a spherical coordinate system with the radar located at the origin, it transmits a beam in some azimuth and elevation angular direction and aims at determining the presence of a useful target in range along that line. Modern radars sample the signal during the coherent processing intervals (CPIs) at discrete times that represent possible increments in range, hence the name range bin or fast-time samples. In a coherent radar, the stability of the frequency generation is sufficient to deterministically relate the phase of the received signals from pulse to pulse. During a CPI, N T pulses are typically transmitted in burst (pulse train), and the corresponding set of range bins for each pulse are stored in a memory as a matrix of in-phase and quadrature (I/Q) voltage samples (baseband equivalent). The sampling rate in this dimension 1 is the pulse repetition frequency PRF = PRT (number of transmit/receive cycles per second, measured in hertz), which is much lower than the sampling rate in range called, accordingly, slow-time. The CPI is the total amount of time N T · PRT represented by the data matrix, as shown in Figure 1.1. 2. Other architectures are possible, depending on whether pulse compression is performed before Doppler processing or different approaches are used. See the discussion later in this section and also [8, Chapter 2] and [9, Chapter 2] for analytical derivations of radar signal models. An illustrative example with some details on radar waveforms and matched filtering for pulse compression is presented in Section 2.1.1.

Coluccia: “chapter1_v2” — 2022/10/7 — 13:05 — page 3 — #3

4

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

Figure 1.1

Fast-time/slow-time CPI data matrix.

Considering a single pulse, the vector r containing R range bins (fast-time samples) obtained by sampling at time instants τ1 , . . . , τR can be modeled as   (1)   n aχ(τ1 − τ , fD )  aχ (τ2 − τ , fD )   n(2)      (1.1) r=  +  ..  ..    . .  aχ(τR − τ , fD )

n(R)

where a is a complex factor depending on transmit power, antenna gain and radiation pattern, path loss, and radar cross section of the target, τ is the roundtrip time between radar and target providing the time-of-fly information from which the range can be determined as cτ/2 (c denoting the speed of light), fD is the Doppler frequency shift, χ (·, ·) is the ambiguity function, and n(1) , . . . , n(R) denote noise terms (for more details, see [9, Chapter 2] and [8]). If the radar and the target are in relative motion, in fact, the so-called Doppler effect also occurs: the received wave reflected by the target will be thus shifted in frequency compared to the wave transmitted by the radar. Denoting by v the radial component of the velocity vector of the target towards the radar and λ the wavelength of the transmitted signal, the Doppler frequency shift fD is approximately given by3 2v (1.2) λ and is clearly zero hertz for a stationary target. Target detection in range domain requires to determine whether any of the range bins rk in r, k = 1, . . . , R, contains a significant return compared to noise, which usually occurs for the corresponding τk closest to the true round-trip delay τ , where the ambiguity function takes its maximum. For better performance, however, N T 1 pulses are typically processed jointly; this also fD ≈

3. Note that the Doppler frequency shift may be coupled to range, but in most cases such a coupling is negligible.

Coluccia: “chapter1_v2” — 2022/10/7 — 13:05 — page 4 — #4

Model-Based Adaptive Radar Detection

5

enables estimation of the Doppler shift, from which the (radial) velocity of the target can be determined (based on (1.2)). Indeed, for a given range bin rk , the signal considering N T pulses is expressed by a N T -dimensional complex vector4 z = αv T + n where

   vT =  

1 ej2πν T .. . ej2π(N T −1)ν T

(1.3)     ∈ CN T 

(1.4)

is called the temporal steering vector, ν T = fD T is the normalized Doppler frequency of the target (with 1/T the sampling frequency in fast time), α = Ae jφ ∈ C accounts for the amplitude A and phase φ of the received target echo (including ambiguity function and possible other factors), and n ∈ CN T denotes the complex random vector representing the overall noise. The latter may include thermal (white) noise, correlated clutter returns, and other possible disturbances such as interference from spectrum coexistence or jamming signals. Model (1.3)–(1.4) indicates that the detection problem can be performed in the Doppler domain by searching for a peak frequency in the Fourier spectrum (estimation of a complex sinusoid in noise), computed through the fast Fourier transform (FFT) algorithm. However, we will see later that a better and more general approach is to recast the problem in the framework of hypothesis testing. In our notation, subscript T stands for “time” to distinguish the related variables from those in “space”; that is, associated to several antennas that may be present in the radar, for which the subscript S will be adopted. Indeed, the term “steering” is more properly used in DOA processing where it intuitively refers to a physical or digital steering of the antenna beam, as explained in the next section. 1.1.2 Array Processing and Space-Time Adaptive Processing In advanced radar systems, the detection process can also benefit from the presence of multiple antenna elements that coherently process one or multiple pulses, as illustrated in Figure 1.2. The term array processing generally refers to the processing of signals acquired through a set of spatially separated sensors for a great variety of problems and applications [16, 17]. In the radar context, an adaptive array of NS sensors (antennas) can overcome the directivity and resolution limitations of a single sensor [15]. A model of the same type as (1.3) is still valid for the coherent signal obtained by taking complex samples from multiple channels, namely an 4. In the remainder of the chapter we omit the range bin index k due to its irrelevance, being that models and tests referred to a given (generic) radar cell at time, as better explained in Section 1.1.3.

Coluccia: “chapter1_v2” — 2022/10/7 — 13:05 — page 5 — #5

6

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

Figure 1.2

Visual representation of a radar with multiple antennas that processes multiple pulses (slow time); sampling instants relative to the different range bins (fast time samples) are also depicted, with the same shade of gray indicating a given range bin along the slow time (within the CPI).

antenna array, in the case of a single pulse. The only difference is the definition of the steering vector, wherein ν T needs to be replaced by the spatial frequency of the target νS , so obtaining a NS -dimensional spatial steering vector:   1   ej2πνS   .. (1.5) vS =   ∈ CNS . .   ej2π(NS −1)νS

The spatial steering vector depends on the array geometry, carrier frequency of the radar, and DOA of the target echo.5 Improved performance is finally obtained by processing several samples from each sensor of the antenna array, in a joint multidimensional adaptive filtering of clutter and jamming [18, 19]. In fact, traditional decoupled approaches of beamforming followed by Doppler filtering (or vice versa) are not optimal [18], since clutter returns may generally manifest themselves as fully two-dimensional (2-D) (nonfactorable) structures [13]. Instead, STAP specifically refers to the 5. For example, νS = dλ sin θ for a uniform linear array (ULA) with interelement spacing d (typically d = λ/2) and radar carrier frequency fc = c/λ, with c denoting the speed of light. The use of either sin or cos function depends on the convention adopted for measuring the DOA θ.

Coluccia: “chapter1_v2” — 2022/10/7 — 13:05 — page 6 — #6

Model-Based Adaptive Radar Detection

7

extension of adaptive antenna techniques to simultaneously combine, in a dataadaptive way, the signals received on multiple elements of an antenna array (the spatial domain) and from multiple pulse repetition periods (the temporal domain) of a CPI [18]; that is, to perform multidimensional filtering that accounts for angle-Doppler coupling [13]. In that case, the signal model at the hth antenna, h = 1, . . . , NS , can be written as zh = αv T ej2π(h−1)νS + nh .

(1.6)

By stacking such NS vectors (each of them N T -dimensional) the resulting signal model is formally of the same type as (1.3), with vT replaced by the space-time steering vector v = v T ⊗ vS ∈ C N

(1.7)

where ⊗ is the Kronecker product, and N = N T NS . Notice that STAP is intriniscally a linear filtering technique for clutter and jamming cancellation through optimal weighting, conceptually not very different from minimum-variance distortionless (Capon’s) beamforming [16]. Originally introduced in the context of airborne radar, STAP has shown its effectiveness to maximize the output SNR hence greatly improve the visibility of the target [5, 12, 13, 18]. Still, for the detection task itself, a detection statistic is needed. Such aspects will be discussed in more details in Section 1.4 (in particular, Section 1.4.1). In the following, unless differently specified, we will generally adopt the notation v, dropping any subscript since the development applies irrespective of the temporal, spatial, or space-time nature of the steering vector. Figure 1.3 depicts the so-called radar datacube, also highlighting the aforementioned different types

Figure 1.3

Radar datacube with the main types of 1-D and 2-D processing highlighted.

Coluccia: “chapter1_v2” — 2022/10/7 — 13:05 — page 7 — #7

8

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

of processing that are within the scope of this book, in particular for the problem of adaptive target detection. For further important radar topics, including 2-D imaging in slow and fast time (synthetic aperture radar (SAR)), the reader is referred to the already cited general books (in particular [4–8]) in addition to [20–27]. 1.1.3 Target Detection and Performance Metrics The focus of this book is on adaptive target detection against a background of disturbances consisting of clutter, interference, possible noise-like jammers, and (thermal) noise. Several different models for both the target echo and background disturbance are discussed; such a variety of models is motivated by the different types of radars as well as operational contexts, which can be as different as fixed long-range surveillance radars, airborne radars for aerial and/or terrestrial target detection, naval radars for different applications, and short-to-mediumrange radars for automotive applications, among others. Within this context, the theoretical framework for optimal detection is given by the powerful tool of statistical hypothesis testing, which is briefly recalled in the following. Consider a coherent radar and a given cell under test (CUT) in rangeDoppler, range-azimuth (DOA) or range-Doppler-azimuth, according to the considered case of temporal, spatial, or space-time steering vector. Radar detection is the goal of deciding whether a target return (useful signal) is present against a background of disturbance (noise signal). This can be formulated as a binary hypothesis test, where the null hypothesis H0 means signal absence (noise only) while the alternative hypothesis H1 means signal presence (plus noise); that is: H0 : z = n (1.8) H1 : z = αv + n where v is known while α is unknown, being the latter is generally dependent on the transmit antenna gain, radiation pattern of the array, two-way path loss, and radar cross section of the target (assumed to be slowly fluctuating). The hypothesis testing problem (1.8) can be solved in different ways, thus obtaining a decision statistic t(z) which is a function of the data (also called observations) z and typically produces a decision according to the detection rule H1

t(z) ≷ η.

(1.9)

H0

The resulting detector, whenever t(z) exceeds the threshold η, will produce either a correct detection (true positive) or a false alarm (false negative) according to the hypothesis actually in force (either H1 or H0 , respectively); likewise, whenever t(z) is below η, the decision will be a miss or conversely a true negative depending on the actual hypothesis. Natural metrics for assessing the performance of a

Coluccia: “chapter1_v2” — 2022/10/7 — 13:05 — page 8 — #8

Model-Based Adaptive Radar Detection

9

detector are therefore the probability of detection PD and the probability of false alarm PFA . In practice it is of utmost importance to keep the false alarm rate under an acceptable level, which means guaranteeing a prefixed value of PFA during the radar operation. The threshold η is therefore set accordingly, but since the noise level and characteristics may change over time, it is also important to ensure that PFA remains constant irrespective of such changes. This leads to an additional performance metric, which is the sensitivity of a detector’s PFA to variations of the noise statistics. More specifically, a desirable property is the constant false alarm rate (CFAR), which is obtained when the statistical distribution of the test statistics t(z) under H0 is independent of any unknown parameter, as further discussed below.

1.2 Unstructured Signal in White Noise In order to grasp, in a progressive manner, the subtleties by which theoretical hypothesis testing formulations can ultimately end up in reasonable, implementable, and effective adaptive detectors, with specific characteristics that reflect the adopted model-based formalization, it is instructive to first focus on a simpler problem than (1.8). This would be the detection of an unstructured signal—that is, a fully generic signal for which no information is available—in white noise.6 Along the path, we will review the general-purpose tool of energy detection and highlight its relationship with the fundamental Neyman–Pearson approach. 1.2.1 Old but Gold: Basic Signal Detection and the Energy Detector Detecting the possible presence of a signal in noise is not unique of the radar domain, being rather a recurrent problem in engineering as well as in biology, finance, seismology, and other fields. Intuitively, one may expect that the presence of a useful signal s(t) tends to increase the magnitude of the measured signal z(t) compared to the noise n(t), assumed to be wide-sense stationary and white; therefore, if the overall signal level exceeds a certain threshold, this might likely indicate a deviation from the noise-only condition. It is thus rather natural to select as possible decision statistic the energy of the signal, which means obtaining the most basic detection tool, the energy detector (ED). The ED measures the received energy E in a time interval and compares it to a threshold. By the independence between s(t) and n(t), this informally means exploiting the relationships E (n) if the signal is absent (H0 ) E (z) = (1.10) E (s) + E (n) if the signal is present (H1 ) 6. The most relevant case of structured signal in colored noise will be addressed in Section 1.4, passing through intermediate formulations of structured signal in white noise in Section 1.3.

Coluccia: “chapter1_v2” — 2022/10/7 — 13:05 — page 9 — #9

10

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

where the general definition of energy for an analog (continuous-time) signal z(t) +∞ is given by E (z) = −∞ |z(t)|2 dt. In particular, assuming that N samples are collected with sampling time T , that is z(iT ), i = 1, . . . , N , we have that n(iT ) if the signal is absent (H0 ) z(iT ) = i = 1, . . . , N s(iT ) + n(iT ) if the signal is present (H1 ) (1.11) which can be restated in vector form as (cf. (1.8)) n if the signal is absent (H0 ) . z= s + n if the signal is present (H1 )

(1.12)

Thus, the ED considers as statistic the energy of the discrete-time version of the 2 2 signal (i.e., N i=1 |zi | = z ), according to the decision rule H1

z2 = z† z ≷ η.

(1.13)

H0

Notice that, while (1.10) is valid irrespective of the assumptions about s and n, setting η to guarantee a desired PFA in (1.13) requires the statistical characterization of z2 under H0 . This, in turn, entails the specification of a probabilistic model for the noise, as well as to introduce some assumptions on the useful signal.7 The zero-mean Gaussian assumption for the noise is a widely adopted one, and can be justified on the basis of the central limit theorem.8 In addition, in some context n can be assumed to have uncorrelated entries, that is, with a white covariance matrix E[nn† ] = σ 2 I, where σ 2 is the noise power (equal to the level of the power spectral density of n(t)). The correlated (colored) noise model is however of greatest interest for many radar scenarios. Similarly, while in coherent radar processing the signal s is known up to a complex factor, as highlighted in (1.8), in other scenarios an unstructured model needs to be adopted. Putting things together, we can identify the minimal (simplest) setup for a statistical signal detection problem: deciding about the possible presence of a noncoherent signal in white complex Gaussian noise; that is, n ∼ CN (0, σ 2 I ), in particular by assuming also for s a white complex Gaussian 7. Curiously, the squelch circuit used, for example, in devices such as old-fashioned citizens band (CB) radio and radio microphone equipment, adopts an energy detection scheme to mute the output speaker during time intervals with too low audio input to avoid listening to uncomfortable white noise only. In those cases the threshold level (i.e., η) is typically adjusted by tuning the squelch knob. Most radar contexts require a guaranteed false alarm rate instead due to their mission-critical nature. 8. However, situations exist where the Gaussian model is not a good fit; for instance, sea clutter in high-resolution radars at low grazing angles. We will come back to this point later in Chapters 3 and 4.

Coluccia: “chapter1_v2” — 2022/10/7 — 13:05 — page 10 — #10

Model-Based Adaptive Radar Detection

11

model (i.e., s ∼ CN (0, σs2 I ) with σs2 the power of the useful signal). Under such assumptions, (1.12) is tantamount to testing H0 : z ∼ CN (0, σ 2 I ) . (1.14) H1 : z ∼ CN (0, (σ 2 + σs2 )I ) 1.2.2 The Neyman–Pearson Approach The optimal test for the minimal model (1.14) can be derived by following the Neyman–Pearson (NP) approach [28–30], which guarantees that PD is maximized under a constraint on PFA . Specifically, the NP lemma states that the uniformly most powerful (UMP) detector, if it exists, is given by the ratio between the probability density function (PDF) of the data under H1 , denoted by f (z|H1 ), and the corresponding PDF under H0 , denoted by f (z|H0 ). Such a decision statistics, referred to as likelihood ratio (LR), for the case at hand can be easily computed as † 2 2 −1 1 σs2 e−z ((σs +σ )I ) z σ 2 N z† z f (z|H1 ) π N det((σs2 +σ 2 )I ) σ 2 (σs2 +σ 2 ) e = = 2 −1 † 1 σs2 + σ 2 f (z|H0 ) e−z (σ I ) z π N det(σ 2 I ) (1.15) provided that σ 2 and σs2 are known. Notably, testing (1.15) against a threshold is equivalent, up to a monotonic transformation and absorbing known quantities into the threshold η, to the ED test given in (1.13). Thus, interestingly, the ED rule we heuristically introduced in Section 1.2.1 actually coincides, under the suitable assumptions above, with the likelihood ratio test (LRT); that is, the optimal NP test. A first takeaway is therefore that intuition and rigor can fruitfully complement each other.9 Under the assumptions above, the ED, also known as square-law detector (coupled with linear integrator, so overall producing a noncoherent integration of the signal), admits a simple characterization. Specifically, it can be equivalently rewritten as z2 H1 ≷η (1.16) σ 2 H0 where for simplicity (and a slight abuse of notation, hereafter) η is still used to denote a modification of the original threshold. It then turns out that the lefthand side of (1.16) is the sum of 2N standard (i.e., with zero mean and unit variance) independent and identically distributed Gaussian random variables, due to the independence and equivariance of real and imaginary parts in (circularlysymmetric) complex Gaussian variables. This implies that, under H0 , a central 9. This mantra is in fact one of the guiding principles of this book.

Coluccia: “chapter1_v2” — 2022/10/7 — 13:05 — page 11 — #11

12

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

2 chi-square distribution with 2N degrees of freedom is obtained (i.e., σz2 ∼ 2 ), hence χ2N

z2 > η H0 = 1 − Fχ 2 (η). (1.17) PFA = P 2 2N σ

From (1.17) the threshold that guarantees a chosen value of PFA can be obtained, by means of the cumulative distribution function (CDF) inverse, as η = Fχ−1 2 (1 − PFA ).

(1.18)

2N

Similarly, a characterization of the ED under H1 can be obtained by multiplying and dividing (1.16) by σ 2 +σs2 , so obtaining a statistic distributed as a noncentral chi-square with 2N degrees of freedom and noncentrality parameter σ 2η . As a consequence, σ 2 +σs2

σ 2η

η z2 = 1 − F = 1 − F > η H PD = P 2 2 1 χ χ 2N σ 2 + σ 2 2N 1 + SNR σ2 s (1.19) where SNR = σs2 /σ 2 is indeed interpretable as a signal-to-noise ratio. The ED example is very instructive, since despite its simplicity it nonetheless touches many important aspects that are encountered, mutatis mutandis, in more advanced detectors. They are summarized as follows: • Equations (1.17) and (1.19) show that both PFA and PD are directly related to the threshold η, so improving the performance on one of such metrics is a tradeoff with the performance on the other metric. • The distribution of the test under H0 , hence, in turn, PFA is independent of any unknown parameter, therefore the ED exhibits the CFAR property. • The detection performance ultimately depends, besides obviously upon the threshold η, only upon one parameter: specifically, (1.19) reveals that PD nonlinearly increases with the SNR, and does not depend on the signal and noise power individually. In other cases, PD may also depend on additional parameters. Generally speaking, knowing which fundamental parameters the performance depends on is very important to optimize the system design. • Some of the parameters required for the implementation of the NP detector may be unknown in practice. For the ED, this is typically the case of σ 2 in (1.16), which is required to make the detector CFAR; by comparison, detector (1.13) is not CFAR since its distribution under H0 depends on σ 2 . Fortunately, the detector can still be made adaptive by replacing the unknown σ 2 with an estimate σˆ 2 , as illustrated in the next section.

Coluccia: “chapter1_v2” — 2022/10/7 — 13:05 — page 12 — #12

Model-Based Adaptive Radar Detection

13

1.2.3 Adaptive CFAR Detection The last point in the list in Section 1.2.2 is particularly important and deserves further development. Figure 1.4 depicts the conceptual scheme of an adaptive ED that uses an estimate σˆ 2 in place of the unknown σ 2 . Specifically, σˆ 2 can be obtained as sample variance (or other estimator, as discussed in Section 1.2.4) from a set of K independent and identically distributed secondary data zi , i = 1, . . . , K , representative of the noise-only condition (i.e., sharing the same variance σ 2 as the CUT under H0 ). Such data can be obtained from selected range bins in the neighborhood of the CUT, possibly leaving some guard periods, as better discussed later in this section. Setting the threshold for the resulting adaptive detector10 : z2 H1 z2 ≷η ≡ K 2 σˆ 2 i=1 |zi | H0

(1.20)

requires accounting for the additional statistical variability introduced by σˆ 2 . By multiplying and dividing by σ 2 , the left-hand side of (1.20) turns out to be the ratio of two sums of independent standard Gaussian variables. In particular, under H0 , it is distributed as the ratio of two independent central chi-square distributions with 2N and 2K degrees of freedom, respectively, so it has a central Fisher-Snedecor’s F distribution. Again, this means that the adaptive ED exhibits the CFAR property; that is, setting a value for η yields the same PFA irrespective of the actual value of σ 2 . More in general, all the considerations made in Section 1.2.2 still apply. In particular, the distribution of the adaptive ED under H1 (which is equivalent to a noncentral F distribution) depends only upon the SNR (besides obviously η), hence so does the detection performance in terms of PD . The NP-based approach made adaptive by the use of sample estimates in place of unknown parameters paves the way for the generalized likelihood ratio test (GLRT) approach that will be discussed in Section 1.3. Beforehand, we will exploit once more the adaptive ED example to show how a theoretical derivation

Figure 1.4

Conceptual scheme of an adaptive radar receiver based on energy detector.

10. The factor K1 appearing in σˆ 2 , being a constant, has been absorbed into the threshold η.

Coluccia: “chapter1_v2” — 2022/10/7 — 13:05 — page 13 — #13

14

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

can be used to justify a practical scheme. We already noticed that the ED scheme, in spite of its intuitive origin, is actually the optimal NP test under suitable assumptions. Likewise, the adaptive ED in (1.20) can be recast as z2

H1

≷

ησˆ 2

H0 nonadaptive statistic adaptive threshold

(1.21)

that is, a nonadaptive ED followed by a data-adaptive threshold aimed at recovering the CFAR property. The latter is usually referred to as CFAR processor and represents a traditional processing technique in many practical radar systems. It will be briefly discussed below. The process of estimating the noise power σ 2 —more generally, the whole covariance matrix in case of colored noise—is usually performed by processing selected radar cells in the neighborhood of the CUT, namely by averaging output samples of a square-law detector in lagging and leading fast-time windows, as shown in Figure 1.5. The assumption is that the latter contain only independent realizations of the noise process, sharing the same statistical properties of the CUT (homogeneous environment). This processing, known as cell-averaging CFAR (CA-CFAR), also admits multiple robust variants, namely based on ordered statistics such as the OS-CFAR, GO-CFAR, and several other schemes [4] which try to limit the impact of clutter edges, possible target return contamination, and other deviations from the homogeneity assumption (nonhomogeneous environment).11 CFAR processors are very popular in conventional radar receivers,

Figure 1.5 11.

Cell-averaging CFAR scheme.

Additional discussion on CFAR processors will be provided in Section 3.3.4.

Coluccia: “chapter1_v2” — 2022/10/7 — 13:05 — page 14 — #14

Model-Based Adaptive Radar Detection

15

as they are able to adjust the threshold in a dynamic way while preserving good detection capabilities. An illustration of the typical behavior, also in comparison to the NP based thresholding setting (i.e., with perfect knowledge of the noise variance σ 2 ) is given in Figure 1.6. However, the equivalence of the two processing approaches; that is, adaptive detection followed by nonadaptive thresholding as in (1.20) and nonadaptive detection followed by adaptive thresholding as in (1.21), does not hold true in general. The ED case is rather an exception due to the lack of structure in the signal s and white noise assumption. Adaptive detectors can better exploit secondary data in presence of clutter and/or interference, as will be discussed in Section 1.4. Before addressing such a case, we discuss the (kind of intermediate) case of correlated signal in white noise. 1.2.4 Correlated Signal Model in White Noise Detecting a correlated signal in white noise is a generalization of the noncoherent problem (in Section 1.2.1), where instead the signal s was modeled as random and white (uncorrelated). This intermediate case has found application in array processing (e.g., DOA estimation and adaptive beamforming) as well as in wireless communications. In the latter context, spectrum sensing is becoming increasingly more important to overcome spectrum scarcity as part of the cognitive radio

Figure 1.6

Illustration of the adaptive behavior of the threshold in CA-CFAR.

Coluccia: “chapter1_v2” — 2022/10/7 — 13:05 — page 15 — #15

16

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

paradigm: this requires uncooperative detection, through overhearing (sensing) of the channel, of transmissions that can be originated by multiple sources (users). The similarity of this problem with that of noncoherent radar detection is striking, and in fact ED is a basic spectrum sensing tool as well. However, ED does not exploit the fact that communication waveforms have a certain statistical correlation over time (autocorrelation). Indeed, the assumption of white signal for s introduced in the NP-based derivation of the ED is a special case of a more general, correlated signal model that is colored instead of white, which leads to eigenvalue (EV)-based detectors. Specifically, problem (1.12) can be restated by exploiting the structure of the covariance matrix Rz of z, given by 2 σ I under H0 Rz = (1.22) Rs + σ 2 I under H1 where Rs is the covariance matrix of the signal s and σ 2 I is the covariance matrix of the white noise (as before). Model (1.22) suggests interesting strategies for signal detection: by noticing the relationship between the maximum and minimum eigenvalues of the different covariance matrices; that is: 2 2 (σ , σ ) under H0 (λmax (Rz ), λmin (Rz )) = (λmax (Rs ) + σ 2 , λmin (Rs ) + σ 2 ) under H1 (1.23) a detection statistics can be intuitively identified in their ratio, since

λmax (R z ) λmin (R z )

=1

(R z ) > 1 under H1 [31]. This leads to the maximumunder H0 while λλmax min (R z ) minimum eigenvalue detector (MMED):

λmax (Rz ) H1 ≷ η. λmin (Rz ) H0

(1.24)

A simpler, alternative solution is to use as a detection statistic the ratio between the energy of the signal and the minimum eigenvalue, thus obtaining the minimum eigenvalue detector (MED) z2 H1 ≷ η. λmin (Rz ) H0

(1.25)

Other possibilities also exist, for instance the ratio between the sum of the amplitude (modulus) of diagonal and off-diagonal entries of Rz , which as for the MMED is equal to 1 under H0 and greater than 1 under H1 , due to the fact that the off-diagonal entries are expected non-zero only under H1 [32].

Coluccia: “chapter1_v2” — 2022/10/7 — 13:05 — page 16 — #16

Model-Based Adaptive Radar Detection

17

In practice, Rz is unknown, hence such detectors are made adaptive by replacing it with a covariance matrix estimate based on a sufficiently large number of samples. In any case, the detector can be implemented without requiring knowledge of the noise power σ 2 , and for such a reason it is referred to as blind detection in spectrum sensing jargon. This corresponds to having the CFAR property being the distribution under H0 independent of unknown parameters, and hence the detection threshold η can be set offline irrespective of the noise power (and signal covariance). The reason for discussing the correlated signal model in white noise is that it lends itself well to bridge the fully unstructured, uncorrelated case of the ED to the structured, correlated case of interest in radar detection. Indeed, the MED (1.25) is remarkably similar to the ED (1.16), the only difference being the way the normalization of the energy is performed to gain the CFAR property (detector (1.13), in fact, has a distribution under H0 that depends on σ 2 ). In the adaptive versions of the MMED/MED, this translates into having λmin (Rz ) replaced by λmin (Rˆ z ), paralleling the substitution of σ 2 with σˆ 2 in (1.20), as a different estimator of the noise power. Thus, the same scheme of Figure 1.4 applies, although with a different definition of σˆ 2 . A natural question thus ultimately arises: What is the best estimator to make the ED adaptive, in particular in the radar context? This issue is actually much broader, as the two-step approach followed here—(i.e., derivation of test statistics based on known noise statistics, then replacement of the unknown parameters with estimates)—is generally suboptimal. Moreover, this should be paired with the discussion at the end of Section 1.2.3, regarding the general suboptimality of a nonadaptive detector followed by a CFAR processor (adaptive thresholding) compared to the fully adaptive detection (which uses a nonadaptive thresholding). This important difference and its remarkable implications will be evident in Section 1.4. To sum up this Section 1.2, a few observations are in order. Despite the unstructured model bringing less information about the target compared to the structured model that will be discussed in Section 1.3, from the development above, one key advantage of model-based detectors can be already observed: they typically provide understanding and performance guarantees. In particular, knowing that a detector is the UMP test for the considered assumptions (thanks to its NP-based derivation) means that there is no need to look for a better detector for such a case provided that noise statistics are known. But the latter does not hold true in practice, hence an intuitive (and quite effective) adaptation strategy is to plug estimates of unknown parameters into the detection statistic. This is nonetheless suboptimal, hence there is theoretically backed room for performance improvement through more sophisticated approaches. Still, the CFAR property of the detector properly made adaptive ensures that the threshold can be set offline to guarantee a chosen PFA , irrespective of the noise statistics actually encountered

Coluccia: “chapter1_v2” — 2022/10/7 — 13:05 — page 17 — #17

18

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

online (during the radar operation). However, as a matter of fact, extending this intuition to more complex cases requires more sophisticated analytical tools. Such a merging of intuitive reasoning and insights with an analytical, rigorous development will be further developed later in Chapter 4, in particular by discussing a feature-space interpretation of CFAR adaptive detection inspired by a data-driven approach.

1.3 Structured Signal in White Noise The aim of this section is to discuss several approaches to solve the detection problem in which the signal of interest in (1.12) has the structure s = αv, with v a known steering vector as introduced in (1.8). This problem is of greatest interest in the context of coherent radars, since it makes better use of the available information by coherent integration of the data, reflected in the presence of a known (temporal, spatial, or space-time) steering vector v. The literature over several decades has been enriching, with a number of ideas and approaches that rely on different modeling assumptions and background hypotheses. A number of the most important ones will be reviewed throughout the rest of this chapter, in particular based on the one-step and two-step GLRT methodologies, from the pioneering Kelly’s detector and adaptive matched filter (AMF) to more recent detectors. As a preparatory step, the structured signal model is addressed below under multiple facets in the preliminary, simpler case of white noise. 1.3.1 Detection of a Structured Signal in White Noise and Matched Filter Let us first consider the solution to problem (1.8) for white noise n ∼ CN (0, σ 2 I ) with known σ 2 . In general, according to the assumptions on the complex factor α = Aejφ , different derivations of the NP test are obtained. For known signal phase φ and unknown signal amplitude A = |α|, the hypothesis testing problem reduces to testing A = 0 vs A > 0. The LRT is thus given by † jφ 2 − (z−α v) 2(z−α v) 1 − z−Ae2 v H σ e σ 1 2 N e (π σ ) = ≷η (1.26) †z z2 z − 2 − 2 H0 1 σ e e σ (π σ 2 )N which after a monotonic transformation can be rewritten as H1

{e−jφ v† z} ≷ η.

(1.27)

H0

The interpretation of this detector includes a filter matched to v, followed by a phase compensation by the value φ. In practice the phase of the target is hardly known, hence a more reasonable option is to model it as a random variable

Coluccia: “chapter1_v2” — 2022/10/7 — 13:05 — page 18 — #18

Model-Based Adaptive Radar Detection

19

uniformly distributed in [0, 2π). To obtain the NP test for this case, which is also known as average likelihood ratio (ALR), the PDF of z (given φ) must be multiplied by the PDF of φ and integrated, to marginalize the phase according to the law of total probability, so obtaining the (unconditional) PDF of z, as follows: 2π − z−Aejφ v2

1 1 σ2 e dφ 2 N 2A|v† z| H1 (π σ ) 2π 0 ∝ I0 ≷η (1.28) 2π − z2 σ2 H0 1 1 2 e σ dφ (π σ 2 )N 2π 0 where I0 (·) is the modified Bessel function of order zero. Being the latter, a strictly monotonic (increasing) function, an equivalent test is given by H1

|v† z|2 ≷ η

(1.29)

H0

which can be interpreted as the (squared) modulus of the output of the same matched filter appearing in (1.27). This is reasonable, since the phase is unknown hence it cannot be exploited by the detector. Notice that from (1.29) (but also (1.27)) it is apparent the coherent integration of the data through the relationship captured by the steering vector v, while the ED (1.13) performs a noncoherent processing. In particular, for v = 1 (which corresponds to zero normalized frequency, namely 0-Hz Doppler shift or 0◦ DOA) the ED in (1.13) computes the sum of the squared modulus of the components of z, while detector (1.29) computes the modulus after complex-valued summation of I/Q samples. More in general, |v† z|2 can be interpreted as the energy of the output of a linear filter with weights v—indeed a matched filter in the temporal case, a beamformer in the spatial case, or a vectorized 2-D filter in the space-time (STAP) case.12 Notice though that the MF (1.29) is not CFAR; again, it may be made adaptive by dividing it by an estimate of the noise power based on secondary data, as done for the ED to obtain (1.20), so producing |v† z|2 H1 ≷η σˆ 2 H0

(1.30)

which instead possesses the CFAR property and may be implemented through a CFAR processor as generally discussed in Section 1.2.3. Clearly, the approach of introducing random models is a general one, and several traditional radar detection schemes (sometimes based on a single pulse) also consider a PDF for the target amplitude [4]. The drawback of this approach is that quite strong assumptions on the target are made with no guarantee that they will be met under operational conditions. Although one might partially 12. This interpretation will be further discussed in Section 1.4.2.

Coluccia: “chapter1_v2” — 2022/10/7 — 13:05 — page 19 — #19

20

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

extend this criticism to all model-based approaches, assumptions on other aspects (namely, noise distribution) can be checked in a calibration phase and at runtime based on secondary data, while by their nature target characteristics continuously change and no additional data is typically assumed to be available for them.13 Moreover, the integrals involved in the ALR are often analytically intractable. Such motivations support the choice, in several advanced adaptive schemes, of modeling unknown parameters as deterministic rather than random.14 1.3.2 Generalized Likelihood Ratio Test Unfortunately, UMP tests do not often exist because in practice more than one parameter is unknown [33]. In other words, the hypotheses to be tested become composite (instead of simple), making the NP approach inapplicable. This is already the case of unknown α ∈ C in white noise. Still, following the same rationale adopted in Section 1.2.3 for making NP-based detectors adaptive with respect to the noise power, one may generally consider the replacement of other unknown parameters with corresponding proper estimates; in general, the nonexistence of UMP tests explains the need to always develop new detectors with improved performance in different situations. A difficulty is however that, as mentioned at the end of the previous section (Section 1.3.1), secondary data bring information only about noise, not useful signal, since the presence of the target and its characteristics are unknown. Still, the logic of the argument remains, and in fact one may consider a generalization of the LRT where unknown parameters are conceptually replaced by estimates as if the corresponding hypothesis were true. Indeed, a further rewriting of the ED is 1 N

z2 H1 ≷η σ 2 H0

(1.31)

which admits an intuitive interpretation as the ratio between the estimated signal power and the true noise power (or its estimate based on secondary data, in case of adaptive ED), aimed at solving a hypothesis test on the signal variance. In particular, N1 z2 is the sample variance of the data, which is a consistent and unbiased estimator of the overall signal variance. More precisely, it is the maximum likelihood (ML) estimator under the Gaussian assumption. Thus, the numerator in the left-hand side of (1.31) accounts for the signal-plus-noise variance (under 13. This will be indeed one of the discriminant points for the possible adoption of data-driven techniques, as discussed in Chapter 2, and more specifically, in Chapter 3. 14. See also [8, Chapter 2] for additional NP detectors under different hypotheses, in the more general colored noise scenario (discussed later in Section 1.4), and for the GLRT framework discussed next in Section 1.3.2.

Coluccia: “chapter1_v2” — 2022/10/7 — 13:05 — page 20 — #20

Model-Based Adaptive Radar Detection

21

H1 ) or noise-only variance (under H0 ). As a consequence, the ratio in (1.31) is reminiscent of the rationale underlying EV-based detectors in Section 1.2.4, and can be adopted to test the estimated overall variance (noise plus signal, if any) versus noise-only variance.15 For the problem at hand (1.8), lacking any prior knowledge on α, its best estimate can be found indeed by resorting to the ML approach, which guarantees asymptotic unbiasedness and consistency with the size N (sample size) of the observation vector z [28]. The ML estimator αˆ is obtained by maximizing the likelihood function with respect to α, so obtaining an estimate of the target amplitude αˆ = argmax α

2 v† z 1 − z−α2v 2 σ e = argmin z − αv = v† v (πσ 2 )N α

(1.32)

to be plugged into the likelihood ratio test (1.26). The resulting detector, GLRT, is given by − z−αˆ2v

2

e

σ

e

z

2 − 2 σ

=e

−

Pv⊥ z2 −z2

σ2

H1

≷η

(1.33)

H0

where PM = M(M† M)−1 M† is the projector onto the subspace spanned by the ⊥ = I−P the projector onto the corresponding columns of the matrix M, and PM M † v† z orthogonal complement. It follows that Pv⊥ z = (I − vv v2 )z = z − v2 v and Pv⊥ z2 = z† Pv⊥ z = z2 − |v† z|2 /v2 , finally leading to |v† z|2 H1 ≷ η. v2 H0

(1.34)

Being the constant v2 irrelevant,16 test (1.34) is equivalent to the matched filter (1.29), which was derived via the NP procedure assuming a random phase. This intuitively confirms the soundness of the GLRT procedure, which has no general optimality guarantees but can also lead to meaningful detection schemes in cases where the NP approach is inapplicable. More specifically, the UMP test might not exist but several reasonable schemes can be found through the GLRT approach, as discussed in detail in Section 1.4 for the general colored noise scenario. 1 z2 15. In fact, in the limit N → ∞, N σ 2 goes to 1 under H0 or to 1 + SNR under H1 , similar to the MED and related EV-based detectors. 16. It can be absorbed into η as usual, or alternatively √ the steering vector can be normalized to unitary norm by including in its definition a factor 1/ N .

Coluccia: “chapter1_v2” — 2022/10/7 — 13:05 — page 21 — #21

22

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

The GLRT approach can be applied analogously also in presence of several unknown parameters, thus in general it can be written as max f (z|θ, H1 ) H 1 ≷η max f (z|θ, H0 ) H0

θ ∈1

(1.35)

θ ∈0

where f (z|θ, H0 ) and f (z|θ , H1 ) denote the PDF of z under hypothesis H0 and H1 , respectively, given a vector of parameters θ, and i denotes the corresponding set of admissible values. Since the two maximizations are performed independently for the numerator and denominator, this exactly corresponds to replacing θ with different ML estimates θˆ 1 and θˆ 0 under the two hypotheses H1 and H0 , respectively. As a final remark, it is worth anticipating that additional data might be needed for the two maximizations to be well-defined, in particular under colored noise with unknown statistics. For the white noise case, instead, the detection problem can be solved in a fully adaptive way by estimating also σ 2 under the GLRT procedure; that is: 2 1 − z−α2v σ e H1 α,σ 2 (πσ 2 )N ≷ η. 2 1 H0 − z2 max e σ σ 2 (πσ 2 )N

max

(1.36)

Derivative-based maximization with respect to σ 2 returns the ML estimators σˆ 12 = N1 z − αv2 and σˆ 02 = N1 z2 under H1 and H0 , respectively, which substituted back in (1.36) yields the equivalent test max α

σˆ 02 z2 z2 H1 = = ≷ η. σˆ 12 Pv⊥ z2 H0 min z − αv2 α

(1.37)

This should be immediately contrasted with (1.33); in particular, the latter computes the difference of the same quantities appearing as a ratio in (1.37), resulting in a different behavior on the resulting statistics. Indeed, noticing that z2 = (Pv + Pv⊥ )z2 = Pv z2 + Pv⊥ z2 , it turns out that for the test (1.37) the following chain of equivalences holds true: z2 Pv z2 ≡ ≡ Pv⊥ z2 Pv⊥ z2

|v† z|2 1 ⊥ 2 N −1 Pv z

(1.38)

since Pv z2 = z† Pv z = |v† z|2 /v2 . The denominator of the right-most expression in (1.38) is an unbiased estimator of the noise power, since Pv⊥ z does not contain any coherent signal anymore, as only noise components are

Coluccia: “chapter1_v2” — 2022/10/7 — 13:05 — page 22 — #22

Model-Based Adaptive Radar Detection

23

left after the orthogonal projection; indeed, simple calculations show that17 E[ N 1−1 Pv⊥ z2 ] = N 1−1 E[Pv⊥ n2 ] = σ 2 . As a result, (1.37) can be shown to be CFAR thanks to the denominator that provides adaptivity to the noise power, similarly to what was observed in Section 1.2.3 and Section 1.2.4 (particularly regarding the adaptive ED (1.20) and MED (1.25)). Besides the different way the noise power is estimated, the main difference between (1.37) and (1.20) is that the former does not use secondary data, as it estimates σ 2 from the noise subspace (orthogonal to the signal subspace, spanned by v); such a decomposition is depicted in Figure 1.7. Indeed, v† z is the scalar product between the data vector z and the steering vector v, which after normalization is equal to the estimated target amplitude (1.32). Thus, the second expression in (1.38) can be geometrically interpreted as the ratio of the (squared) lengths of the data vector projections onto the signal and noise subspaces, respectively, which in fact increases as a stronger target echo component is present in z. The colored noise case, addressed in Section 1.4, will require instead a set of K ≥ N secondary data to obtain the GLRT under both one-step or two-step approaches.

Figure 1.7

17.

Geometrical representation of noise subspace (spanned by the steering vector ν) and signal subspace (orthogonal complement), and examples of decomposition of the data vector z under the two hypotheses (left: H 0 , right: H 1 ).

In fact: 1 1 1 E[Pv⊥ n2 ] = E[n2 ] − E[|n† v|2 ] N −1 N −1 N (N − 1) 1 N v† E[nn† ]v σ2 − N (N − 1) N −1 v2 N − σ2 = σ2 = N − 1 N (N − 1) =

since v2 = N .

Coluccia: “chapter1_v2” — 2022/10/7 — 13:05 — page 23 — #23

24

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

Finally, the left-most form of the detector in (1.37) may be also interpreted, in the light of the insights from (1.38), as the adaptive version of detector (1.31), thus confirming the capability of the GLRT procedure to deliver meaningful detection schemes, despite its lack of theoretical optimality. 1.3.3 Detection of an Unknown Rank-One Signal in White Noise Still considering white noise, a variation of the detection problem addressed above is worth a brief discussion. If one relaxes the structure of the steering vector, a situation in between the unstructured random model s (white or correlated) and the structured model αv is obtained. In particular, the data z can be arranged in a Nb × N matrix X, with Nb denoting the number of beams used to cover a spatial sector or the number of array sensors, so obtaining the hypothesis testing problem H0 : X = N (1.39) H1 : X = aS a†T + N where aS and aT are unknown arbitrary spatial and temporal steering vectors, respectively, and N is the corresponding white noise matrix [34]. The GLRT in this case is given by λmax (XX † ) H1 ≷ η. (1.40) Tr{XX † } H0 This decision statistic shows a similarity with that of the MMED in (1.24), being XX † proportional to the sample covariance estimate of Rz . In particular, under H0 , Tr{XX † } is proportional to the sample variance estimate of σ 2 . Moreover, under both hypotheses λmax (XX † ) accounts for the energy of the useful signal. As a consequence, the ratio in the left-hand side of (1.40) is independent of any unknown parameter under H0 , hence the detector is CFAR. Moreover, its detection performance (equivalently, its distribution under H1 ) depends only on the SNR = aS 2 aT 2 /σ 2 [34]. Notice that treating the steering vectors as unknown and arbitrary quantities may be meaningful when uncertainties exist, for instance because of miscalibration or other mismatches. In such cases, in fact, looking for a specific known v might counterproductively result in signal rejection, as it will be better discussed in Section 1.4.3. The detection scheme (1.40) may be viewed as a fast and simple preprocessing step, which upon detection of a possible target is followed by a more sophisticated detector to confirm or conversely invalidate the decision.18 Remarkably, if aS is modeled as a random vector according to the same model CN (0, σs2 I ) adopted in Section 1.2 (instead of an unknown vector), 18. It has been observed in [34] that (1.39) is also relevant when a monopulse radar is used, for Nb = 2 and the rows of X corresponding to the outputs of the sum and difference channels, respectively.

Coluccia: “chapter1_v2” — 2022/10/7 — 13:05 — page 24 — #24

Model-Based Adaptive Radar Detection

25

the GLRT is still given by (1.40). Finally, for Nb = 2 this is the UMP invariant (UMPI) test (i.e., the UMP among a family of detectors that are invariant under a suitable group of transformations). The importance of the maximal invariant theory will be discussed later in Chapter 4. Nonetheless, in many practical radar detection schemes the steering vector is considered with known structure, up to one parameter, typically related to the Doppler or DOA; the analytical justification for this approach is discussed below. 1.3.4 Steering Vector Known up to a Parameter and Doppler Processing Testing a CUT actually means that the surveillance region is scanned in range and azimuth according to a certain policy, while different Doppler shifts are tested. The Doppler domain is in fact quantized into Q values, and the resulting specific steering vectors are individually used for target detection. The same applies to azimuth scanning through an active electronically scanned array (AESA). This intuitive approach, which is usually interpreted as a pulse-Doppler matched filterbank, can be actually formalized by saying that v is known up to one parameter, namely the normalized Doppler frequency19 ν T . For such a problem, the GLRT derivation would simply include a further, outer maximization with respect to ν T , so justifying the aforementioned matched filterbank detector, (q)

max

q=1,...,Q

|v†(ν T ) z|2 (q)

v(ν T )2

H1

≷η

(1.41)

H0

as depicted in Figure 1.8. As a by-product of this processing, an estimate of the Doppler frequency shift fD , hence of the radial velocity of the target, is obtained. Interestingly, recalling the structure of a steering vector (see Section 1.1.2), the processing above can be interpreted as a peak finding on the frequency spectrum obtained via discrete Fourier transform (i.e., using FFT).This justifies the practical approach of performing Doppler processing followed by a conventional CFAR processor. However, in the milestone paper by Brennan and Reed [35] (space-time) adaptive detection was introduced, which paved the way to a number of adaptive detection schemes—in element-space or beamspace, pre-Doppler or post-Doppler (see [18])—as well as advanced detection techniques, discussed next.

1.4 Adaptive Detection in Colored Noise Capitalizing on the comparison of the many different approaches to adaptive target detection in white noise performed in the previous sections, it is now easier 19. A similar formalism applies to DOA using the normalized spatial frequency νS . Extension to more than one parameter is also straightforward.

Coluccia: “chapter1_v2” — 2022/10/7 — 13:05 — page 25 — #25

26

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

Figure 1.8

Pulse-Doppler processing via matched filterbank.

to finally address the general problem (1.8) where the covariance matrix of the noise R = E[nn† ] is generally nondiagonal. First of all, consider that the detectors presented so far in this chapter, although derived assuming white noise, can be straightforwardly generalized to colored noise if R is known. In that case, in fact, a whitening preprocessing can be performed through the square root of the matrix inverse, hence all equations for white noise still apply under the substitution of z, s, v with z˜ = R−1/2 z, ˜s = R−1/2 s, v˜ = R−1/2 v, since the transformed noise R−1/2 n gives back the white noise condition E[R−1/2 n(R−1/2 n)† ] = R−1/2 E[nn† ]R−1/2 = I. In practice, however, R is seldom known and hence must be estimated. To this end, a set of secondary data is necessary, as for the estimation of σ 2 in white noise discussed in the adaptive ED (1.20) or adaptive MF (1.30). We recall that other schemes discussed above, instead, do not require secondary data, in particular the MMED (1.24) and MED (1.25), as well as the GLRT-based coherent detector (1.37) and rank-one signal detector (1.40). Unfortunately, this desirable feature cannot be paralleled, in general, by the more general case of unknown R. As a confirmation, notice that not even the noise power can be unbiasedly estimated from the noise subspace if R is unknown. In fact, the denominator of the GLRT-based detector (1.37), which as discussed provides adaptivity to the detection statistic, for colored noise has expected value v† Rv . v2 By separating the power level from the structure of the covariance matrix, for example, R = σ 2 C where diag{C} = 1 (for instance, an exponentially-correlated, E[Pv⊥ z2 ] = Tr{R} −

Coluccia: “chapter1_v2” — 2022/10/7 — 13:05 — page 26 — #26

Model-Based Adaptive Radar Detection

27

first-order autoregressive structure is a Toeplitz matrix with ρ |k−h| elements, k, h = 1, . . . , N and ρ the one-lag correlation coefficient), it turns out that

v† Cv v† Cv 2 2 ⊥ 2 =σ N − E[Pv z ] = σ Tr{C} − v2 N so even unbiased estimation (based on the noise subspace) of the sole σ 2 would require knowledge of the structure C of the covariance matrix. The usual N − 1 normalization factor is obviously retrieved in the white case R = σ 2 I, whereas the estimator can be made unconditionally unbiased, due to the fact that the number of parameters to be estimated certainly exceeds the number N of data samples in z. Covariance matrix estimation involves the determination of N 2 parameters. This number can be lowered by considering special symmetries and estimating the sole independent parameters. Examples are the Hermitian structure with respect to the main diagonal, the persymmetry also with respect to the cross diagonal, or the low-rank structure in which only a few eigenvalues of R are significantly different from zero. Still, the number of parameters to be estimated often exceeds N . Moreover, in practical settings such special symmetries may be violated due to non-idealities, hence estimating the full covariance matrix may represent in this respect a more robust approach. Secondary data is thus typically required for high-performance adaptive detection in general scenarios, but still two different strategies are possible; that is, either one-step (1S) and two-step (2S) approaches, which lead to detectors with diverse characteristics. The well-known Reed–Mallett–Brenann rule (RMB) [36] specifies that, compared to the optimum case of known covariance matrix, K = 2N independent identically distributed data produce an SNR loss of 3 dB. 1.4.1 One-Step, Two-Step, and Decoupled Processing In 2S approaches the detector is devised, by means of GLRT or alternative procedures, assuming that R is known. Then, in a second step, a covariance matrix estimate Rˆ obtained from secondary data is used in place of R. This is conceptually the same as for the white noise case (see Section 1.2.3), however with two differences: 1. N -dimensional vectors are needed as secondary data, instead of scalar values; 2. Making the detection statistics (not only adaptive but also) CFAR entails additional complications. Regarding point (1), such data can be typically obtained from neighboring range cells in fast-time (range bins), using the same temporal, spatial, or space-time data arrangement as for the CUT samples. Regarding point (2), the CFAR property can be embedded into an adaptive statistics provided that the statistical distribution

Coluccia: “chapter1_v2” — 2022/10/7 — 13:05 — page 27 — #27

28

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

of the detector can be made independent not only of the noise power, but also of the structure of the true covariance matrix. This feature, often referred to as generalized CFAR property, must be checked on a case-by-case basis, and is clearly a significant point of merit for an adaptive detector. A simplifying approach is instead to decouple filtering and detection, considering them as two separate processing stages. Specifically, multidimensional (multichannel) data from the CUT are first filtered to suppress clutter/jamming as much as possible, via heuristic (nonadaptive) clutter cancellers such as moving target indicator (MTI), pulse-Doppler matched filterbank (see Section 1.3.4), or adaptive processing (beamforming or STAP; see Section 1.1.2), so compressing the N -dimensional data vector to a scalar complex value;20 then, a decision is taken based on a proper (real-valued) statistic of the filtered output, and an adaptive threshold provided by a CFAR processor. This scheme is classical and in fact reasonable, since PD depends on SNR hence improving the latter via preliminary filtering for disturbance suppression is always beneficial to the detection performance. Moreover, the use of a CFAR processor guarantees the desired PFA by adjusting the threshold to changes in the noise power level. However, best filtering capabilities that can significantly enhance the visibility of targets buried in noise require optimal weighting based on the true covariance matrix R of the overall disturbance, hence can be implemented in practice only ˆ More advanced schemes can be devised by suboptimally using an estimate R. trying to estimate at once all the unknowns while at the same time adapting the statistic for CFARness (i.e., a 1S approach). In doing so, filtering, detection, and CFAR-processing are all embedded into a fully adaptive CFAR detection statistic (followed by nonadaptive thresholding). From the discussion above, it is clear that covariance matrix estimation is a common aspect to both adaptive filtering followed by a CFAR processor and 2S adaptive detection; the latter may also embed CFAR capability into its statistic, or not, as discussed below. In 1S fully adaptive approaches, instead, covariance matrix estimation still plays a role but it cannot be disentangled, in general, from the detection structure of the statistic, thus providing better performance due to the joint processing. 1.4.2 General Hypothesis Testing Problem via GLRT: A Comparison The generic detection problem (1.8) can be restated more explicitly by assuming complex Gaussian colored noise with unknown covariance matrix R and a 20. As discussed in Section 1.1.2, a further decoupling is often introduced in traditional radar schemes in which beamforming and Doppler filtering are performed separately. This is suboptimal since clutter returns may generally manifest themselves as fully 2-D (nonfactorable) structures [13]. Indeed, STAP combines the signals in space and time, thereby accounting for angle-Doppler coupling [13, 18].

Coluccia: “chapter1_v2” — 2022/10/7 — 13:05 — page 28 — #28

Model-Based Adaptive Radar Detection

coherent target with unknown deterministic α, as follows: H0 : z ∼ CN (0, R) . H1 : z ∼ CN (αv, R)

29

(1.42)

A straightforward solution to this problem can be found by following a 2S-GLRT approach, so obtaining the colored-noise counterpart of the MF discussed in Section 1.3. To this aim, the matrix R is first assumed known, then the whitening transformation z˜ = R−1/2 z directly yields to the same problem of Section 1.3.2. Maximization with respect to α yields the ML estimate of the target amplitude as αˆ =

v† R−1 z v† R−1 v

(1.43)

which generalizes (1.32) to the case of colored noise, and finally leads to the generalization of detector (1.34); that is: tMF (z) =

|v† R−1 z|2 H1 |˜v† z˜ |2 = ≷ η. ˜v2 v† R−1 v H0

(1.44)

The processing described by v† R−1 z has the meaning of a scalar product between the whitened data z˜ = R−1/2 z and the whitened steering vector v˜ = R−1/2 v, and in fact it is a generalization of the MF for white noise (1.34). Detector (1.44) involves the cascade of a (whitening) transformation aimed at suppressing the † clutter and a projection Pv˜ = v˜v˜v˜2 aimed at performing coherent integration of the useful signal; the final statistic can be regarded as a square-law detector on † −1 2 the output of the cascade filter (i.e., Pv˜ z˜ 2 = z˜ † Pv˜ z˜ = |v †R −1z| ), which can vR v be interpreted geometrically in a way similar to Figure 1.7. The colored noise version of the ED can be also obtained, for example: H1

tED (z) = R−1/2 z2 = z† R−1 z ≷ η

(1.45)

H0

which, as usual, does not exploit any coherence in the detection task. Both MF and ED statistics for white noise are obviously retrieved for R = σ 2 I. To implement the detectors above, an estimate of R is needed. This is obtained from a set of K secondary data, denoted as z1 , . . . , zK , with K ≥ N . For convenience, the matrix notation Z = [z1 . . . zK ] is also used, which yields a simple expression for the sample covariance matrix (SCM): K 1 1 ˆRSCM = 1 zi z†i = ZZ † = S K K K

(1.46)

i=1

Coluccia: “chapter1_v2” — 2022/10/7 — 13:05 — page 29 — #29

30

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

where S = ZZ † is often referred to as scatter matrix. In particular, by substituting Rˆ SCM in (1.44), the following equivalent detector is obtained: tAMF (z, S) =

|v† S−1 z|2 H1 ≷ η. v† S−1 v H0

(1.47)

It can be shown that the distribution of tAMF does not depend on any unknown parameter under H0 [37]; that is, it has the generalized CFAR property, where, as already mentioned, the adjective “generalized” is sometimes used to denote that CFARness holds true with respect to the whole covariance matrix R (not only to its power σ 2 = Tr{R}/N ). Conversely, the adaptive version of the ED (1.45), that is: H1

tAED (z, S) = S−1/2 z2 = z† S−1 z ≷ η

(1.48)

H0

does not have either the CFAR property (with respect to σ 2 ) or the generalized CFAR property (with respect to R). Remarkably, while the former can be gained by using an estimator σˆ 2 as normalization factor (as shown in Section 1.2.3), the latter remains unsatisfied; at the same time, it can be expected that such a detector can provide better PD compared to CFAR detectors, being less constrained. We will analyze later the different performance trade-offs between adaptive detectors. In his 1986 pioneering paper [38], Kelly derived a GLRT for problem (1.42) assuming unknown (Hermitian) positive definite covariance matrix R, including K ≥ N independent and identically distributed secondary data z1 , . . . , zK (independent of z, free of target echoes, and sharing with the CUT the statistical characteristics of the noise) directly into the hypothesis testing problem formulation; that is:  H : z ∼ CN (0, R)    0 zi ∼ CN (0, R), i = 1, . . . , K . (1.49) H : z ∼ CN (αv, R)  1   zi ∼ CN (0, R), i = 1, . . . , K With a slight abuse of notation, the resulting GLR can be thus written as K max max CN (αv, R) × CN (0, R) max f (z, Z | H1 ) α R α,R i=1 (1.50) = . K max f (z, Z | H0 ) R max CN (0, R) × CN (0, R) R i=1 The presence of the secondary data distribution makes well-defined the additional maximization with respect to R, necessary in both the numerator and

Coluccia: “chapter1_v2” — 2022/10/7 — 13:05 — page 30 — #30

Model-Based Adaptive Radar Detection

31

denominator, which can be solved in closed form producing the ML estimates Rˆ 1 =

1 [S + (z − αv)(z − αv)† ] K +1

(1.51)

and 1 Rˆ 0 = S (1.52) K under H1 and H0 , respectively. The maximization with respect to α to obtain the 1S-GLRT detector follows straightforwardly, as it turns out that the ML estimate of α is simply the adaptive version of (1.43) performed in 2S-GLRT; that is21 v† S−1 z . (1.53) v† S−1 v Substituting into the LR, the resulting detector can be equivalently written as αˆ =

tKelly (z, S) =

H1 |v† S−1 z|2 ≷ η. v† S−1 v (1 + z† S−1 z) H0

(1.54)

The statistic in (1.54) is surprisingly similar to tAMF in (1.47), but for the presence of the additional term between brackets in the denominator, which is equal to 1 + tAED . Notice that, while the AMF statistic can be expressed as the output of a filter followed by a detector, Kelly’s statistic cannot. More in general, this highlights the suboptimality of the decoupled approach, as discussed in Section 1.4.1, and shows the capability of the 1S-GLRT to capture the intrinsic coupling in the problem, compared to the 2S-GLRT. This is due to the better use of secondary data, included in the hypothesis testing formulation since the beginning, and can be shown to yield higher PD than AMF, for the same PFA ,22 SNR, and keeping the generalized CFAR property. In summary, Kelly’s detector represents a paradigmatic example of radar detector embedding the three functions of clutter suppression (filtering), coherent signal integration with optimal detection, and CFAR processing, into a unique adaptive statistic. As such, it is considered as a benchmark for advanced adaptive radar detectors. 1.4.3 Behavior under Mismatched Conditions: Robustness vs Selectivity What is unexpected, at first glance, in the different reaction of Kelly’s detector and AMF to mismatches in the signal model, given their otherwise similar behavior 21. Note that such an estimate of the target amplitude is also reasonable in non-Gaussian environments, where it corresponds to the least-squares (LS) estimator (not the ML). Examples of radar detection in non-Gaussian disturbance will be given in Chapters 3 and 4. 22. For Kelly’s detector, PFA can be also expressed in a simple closed form as PFA (η) = 1 , which can be easily inverted to set the threshold that guarantees the desired false (1+η)N −K +1 alarm rate.

Coluccia: “chapter1_v2” — 2022/10/7 — 13:05 — page 31 — #31

32

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

in terms of PD , especially for large K . Mismatches may arise due to several uncertainties on the steering vectors and/or the waveforms, in terms of beampointing errors, array miscalibration, angle/Doppler quantization with off-grid targets, waveform and array distortions, and other reasons [39–42]. In such cases, the desired behavior of the detector depends on the application: for instance, a selective detector is desirable for target localization; instead, a certain level of robustness to mismatches is preferable when the radar is working in searching mode. In [43] the performance of Kelly’s detector was assessed when the actual steering vector, say p, is different from the nominal one v. The analysis showed that it is a selective detector (i.e., it tends to reject signals not arriving from the nominal direction). The mismatch level is measured by the squared cosine of the angle θ between the two directions; that is: cos2 θ =

|p† R−1 v|2 . p† R−1 p v† R−1 v

(1.55)

Figure 1.9 reports the so-called mesa plots, that is iso-PD contour curves for Kelly’s detector, AMF, and the adaptive coherence estimator (ACE) [44] (also called adaptive normalized matched filter [45]) whose statistic is given by |v† S−1 z|2 . (1.56) v† S−1 v z† S−1 z Mesa plots are very informative, since they highlight how the detectors behave under matched conditions (cos2 θ = 1, corresponding to the top horizontal axis) tACE (z, S) =

Figure 1.9

Mesa plots of Kelly’s and AMF and ACE detectors, for N = 16, K = 32, and PFA = 10−4 .

Coluccia: “chapter1_v2” — 2022/10/7 — 13:05 — page 32 — #32

Model-Based Adaptive Radar Detection

33

as well as mismatched conditions for various levels of mismatch (until cos2 θ = 0, which means that v and p are orthogonal), as a function of the SNR, defined as SNR = |α|2 p† R−1 p.

(1.57)

From Figure 1.9 it is evident that Kelly’s detector guarantees the highest PD under matched conditions, while AMF and ACE experience a certain performance loss; moreover, AMF is a robust detector, ACE is a quite selective one, while the behavior of Kelly’s detector is selective but comparably more moderate. Notice that the ACE statistics can be rewritten as tAMF tACE = (1.58) tAED which can be interpreted as the detecting the presence of a significant coherent component (along v) out of the overall signal energy. It is worth remarking once again that the AED totally ignores the information on the steering v (it is an incoherent processing), hence when used alone as detection statistic will lead to reduced PD but strong robustness. By comparison, Kelly’s detection statistic is conversely expressed by tKelly =

tAMF 1 + tAED

(1.59)

hence Kelly’s and ACE detectors differ only for the additional unity in the denominator, which however makes a significant difference in the performance under both matched and mismatched conditions. The lesson learned is that diversified behaviors may arise by minor changes in the detection statistic. Along this line, many authors have thus addressed the problem of enhancing either the robustness or the selectivity of adaptive detectors. However, this is a nontrivial task, since detectors should still guarantee the CFAR property. Typical design procedures include statistical tests with modified hypotheses, asymptotic arguments, approximations, and ad hoc strategies, as presented next. 1.4.4 Model-Based Design of Adaptive Detectors Several decades of work have already been dedicated to derive model-based adaptive detectors through different approaches. One of the most interesting ideas is to modify the hypothesis test to promote some characteristics in its solution (i.e., to obtain a detector with desired properties). In the following, we review tunable receivers, the subspace approach, the orthogonal rejection (adaptive beamformer orthogonal rejection test (ABORT)) approach, the cone acceptance/rejection approach, and second-order approaches. Other approaches different from GLRT also exist, like Rao’s and Wald’s tests [30], which return alternative detectors with specific characteristics. For instance, a detector based

Coluccia: “chapter1_v2” — 2022/10/7 — 13:05 — page 33 — #33

34

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

on Rao’s test is given by [46] tRao (z, S) =

|v† (S + zz† )−1 z|2 v† (S + zz† )−1 v

(1.60)

in which an interesting peculiarity arises: the use of the primary data z in the (quasi-)whitening transformation of the AMF, in addition to the sample covariance estimate based on the secondary data (summarized in the scatter matrix S). This produces a very selective behavior, unless the number of secondary data K becomes large: in that case, intuitively, the effect of the term zz† compared to S becomes less significant, and the detector essentially boils down to the AMF. 1.4.4.1 Tunable Receivers

The Rao detector (1.60) can be also regarded as a particular case of the AMF with de-emphasis (AMFD) given by [47] tAMFD (z, S) =

|v† (S + AMFD zz† )−1 z|2 . v† (S + AMFD zz† )−1 v

(1.61)

The AMFD is an example of tunable receiver, whose behavior can be adjusted via the de-emphasis parameter AMFD . Clearly, for AMFD = 0 the AMFD reduces to the AMF, while as AMFD increases the presence of the CUT in the inverse matrix produces a sidelobe blanking effect (rejection of unwanted signal from a direction different from v). For AMFD = 1, which as mentioned corresponds to the Rao detector, such a selectivity becomes very strong and exceeds that of the ACE. However, as AMFD increases, the AMFD begins to sacrifice mainlobe detectability for sidelobe rejection capability;23 hence the Rao detector experiences a PD loss under matched condition. It is indeed well-known that, in most receivers, enhancing robustness or selectivity often comes at the price of a certain PD compared to Kelly’s receiver, with different trade-offs, as further discussed later. The idea of tunable receivers is recurrent in the adaptive radar detection literature, and basically consists of ad hoc modification to make well-known statistics parametric. Another example is Kalson’s detector [48], in which a nonnegative parameter, say Kalson , is introduced in Kelly’s statistic; that is: tKalson (z, S) =

|v† S−1 z|2 . v† S−1 v (1 + Kalson z† S−1 z)

(1.62)

23. The jargon “mainlobe” and “sidelobe” is clearly inherited from a formulation of the detection problem with multiple antennas, that is, a spatial steering vector. More generally, one refers to the direction of interest of the steering vector vs a mismatched direction in the (rank-one) signal subspace, no matter if it is a temporal, spatial, or space-time detection problem.

Coluccia: “chapter1_v2” — 2022/10/7 — 13:05 — page 34 — #34

Model-Based Adaptive Radar Detection

35

In doing so, the degree to which mismatched signals are rejected can be controlled in between the AMF and Kelly’s detector. Kalson’s detection statistic can be rewritten in terms of the previously introduced detectors by noticing that tKalson = tAMF 1+ Kalson tAED . A different tunable receiver has been proposed in [49] by substituting the exponent 2 of the square brackets in the whitened adaptive beamformer orthogonal rejection test (W-ABORT) detector, given later in (1.66), with 2 KWA . This encompasses a special cases statistics equivalent to Kelly’s and W-ABORT detectors and, for KWA < 0.5 behaves as a robust detector but with limited PD loss under matched conditions, reaching the AED only as KWA → 0. 1.4.4.2 The Subspace Approach

Robustness can be also introduced by assuming a subspace model for the target; that is, that instead of αv the signal return is given by Hb, with H ∈ CN ×r a known full-rank matrix and b ∈ Cr an unknown vector that contains the coordinates of the target return in the subspace spanned by the columns of H [9], see also [8, Ch. 3]. By considering this model in (1.49), the so-called subspace detector is obtained as tSD (z, S) =

z† S−1 H(H† S−1 H)−1 H† S−1 z 1 + z† S−1 z

(1.63)

which can be considered a multirank-signal generalization of Kelly’s detector [50, 51]. The subspace model postulates that the linear combination of vectors in a well-designed subspace can capture a significant part of the received energy along the uncertain steering direction. To this aim, H typically contains v as one of its columns, plus mismatched versions of its or other directions that span a wider subspace compared to the single direction of the nominal v. It can be shown that the detector is CFAR (and independent of H under H0 ). Several invariant detectors have been proposed by assuming specific subspace models for the steering vector and/or the transmitted waveform (e.g., [52, 53]). More generally, the linear-model idea can be used to create subspace versions of practically any detector, and indeed many examples are found in the open literature. Finally, a subspace can be adopted to model clutter and interference disturbances (not only in radar, but also sonar and wireless communications) as subspace interference-plus-broadband noise than as colored noise [54]. In such a case, oblique pseudoinverses, oblique projections, and zero-forcing orthogonal projections are required, which depend on knowledge of signal and interference subspaces. Both deterministic (first-order) as well as stochastic

Coluccia: “chapter1_v2” — 2022/10/7 — 13:05 — page 35 — #35

36

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

(second-order) signal models can be adopted.24 However, while the signal subspace is typically known, since it is introduced by the designer, the interference subspace is unknown; fortunately, the operators mentioned can be estimated directly from experimental data, without knowledge of the interference subspace [54]. Moreover, the resulting detectors, estimators, and beamformers take on the form of adaptive subspace estimators, detectors, and Capon beamformers, all of which are reduced in rank [54]. 1.4.4.3 Adaptive Beamformer Orthogonal Rejection Test

Enhanced selectivity can be induced by solving a modified hypothesis testing problem that assumes the presence of an unknown fictitious signal u under H0 regarded as a (nuisance) deterministic parameter; that is: H0 : z = u + n . (1.64) H1 : z = αv + n The intuitive idea is to try to make the hypothesis H0 more plausible under mismatches. To this aim, suitable assumptions on u are required. In particular, in [55, 56] it is assumed that u is orthogonal to v in the quasi-whitened space, an approach called adaptive beamformer orthogonal rejection test (ABORT). In [57] it is conversely assumed that orthogonality holds in the whitened-space. An adaptive detector is then obtained by deriving the GLRT, which includes an additional constrained maximization with respect to the unknown vector u. The resulting statistics are respectively given by † −1 2 1 + |v †S −1z| vS v tABORT (z, S) = 2 + z† S−1 z

(1.65)

and tW -ABORT (z, S) = (1 + v† S−1 v)

1

1−

−1

|v† S z|2 −1 † v S v (1+z† S−1 z)

2

(1.66)

and both possess the generalized CFAR property. The W-ABORT detector is in particular very selective but experiences a PD loss under matched conditions. Notice that, as most of the discussed adaptive detectors, the ABORT and W-ABORT statistics can also be expressed in terms of other well-known detectors; that is: 1 + tAMF tABORT = (1.67) 2 + tAED 24.

Further details on the second-order approach are provided in Section 1.4.4.5.

Coluccia: “chapter1_v2” — 2022/10/7 — 13:05 — page 36 — #36

Model-Based Adaptive Radar Detection

37

and tW -ABORT =

1

(1 + v† S−1 v) 1 − tKelly

2

(1.68)

to be compared with (1.58) and (1.59). 1.4.4.4 Conic Acceptance/Rejection

A different idea to control the level of robustness/selectivity is to add a cone for acceptance and acceptance-rejection (hence, under the H1 hypothesis only or under both hypotheses, respectively); the resulting detectors are referred to as CAD and CARD, respectively, with statistic [41, 58]: !" 1 tCAD (z, S) = tAED (z, S) − tAED (z, S) − tAMF (z, S) 2 1 + CAD #2 $ " % 2 ) − CAD tAMF (z, S) u tAED (z, S) − tAED (z, S)(1 + CAD (1.69) 1 x≥0 is the unit-step function, and where u(x) = 0 x

N K

otherwise (1.73)

where & & |˜v† z˜ |2 |z† S−1 v|2 & ⊥ &2 2 † −1 = z S z − = tAED − tAMF . &Pv˜ z˜ & = ˜z − ˜v2 v† S−1 v (1.74) Detector (1.73) is particularly interesting since it can be interpreted as a particular two-stage detector, in which the first detector is Pv˜⊥ z˜ 2 = tAED − tAMF , then according to the selected branch the final decision is taken by testing the corresponding statistic. Two-stage detectors are usually formed by cascading a robust detector and a selective detector; as a result, two independent thresholds are found, which lead to infinite possibilities of achieving the same PFA but diversified behaviors under matched and mismatched conditions, according to the particular choice. Many such schemes have been derived, for instance by cascading AMF and Kelly’s statistics, or AMF and ACE (so-called adaptive sidelobe blanker (ASB)), refer to [9, 63] for details and a more comprehensive review, and also to [8, Chapter 4] for additional two-stage schemes (including subspace variants). Compared to all such detectors, the remarkable difference in (1.73) is that the two-stage architecture naturally arises without imposing the cascaded structure; moreover, the first threshold is automatically set to N K , and two different statistics are used in the second stage. With regard to 1S approaches, the main results are summarized as follows [64]: • = vv† returns Kelly’s detector; • = uu† (e.g., u a chosen mismatched version of v) requires to numerically estimate α ∈ C, but can be reduced to scalar search in finite interval; • = R (unknown) yields a “robustified version” of Kelly’s detector; that is:  † −1 1 & & 1+z S z 1− ζ  & ⊥ &2 1   , P z ˜ &  ˜ & > ζ −1 v & &2 ζ1  & ⊥ & (ζ −1)&Pv˜ z˜ & t1S-ROB = (1.75)  −1  1+&z†S &z   otherwise  & ⊥ ˜ &2 , 1+&Pv˜ z&

Coluccia: “chapter1_v2” — 2022/10/7 — 13:05 — page 39 — #39

40

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

& &2 & & 1 , is where ζ = KN+1 . Indeed, the statistic t1S-ROB , when &Pv˜⊥ z˜ & < ζ −1 equivalent to Kelly’s detector, but it is more robust to mismatches. It is worth noticing that the exponent ζ1 = KN+1 appearing in the 1S-ROB (1.75) generalizes the exponent N K appearing in the 2S-ROB (1.73) in a similar way the additional unit in (1.51) represents the inclusion of the primary data of Kelly’s estimate of the covariance matrix under H1 , compared to the sample covariance matrix based on secondary data only used by AMF (and Kelly’s detector under H0 , see (1.52)). The effect of ζ ∈ (1, +∞), which grows with K ≥ N , is to produce an automatic nonlinear deformation that reflects the confidence in the adaptation process based on the number of available data K ≈ ζ N . Indeed, (1.75) can be rewritten as t1S-ROB = where

1 + tED gζ (tED − tAMF )

(1.76)

√ 1 (1 + ζ ) ζ x, x ≥ ζ −1 gζ (x) = 1 + x, otherwise

(1.77)

√ ζ

ζ −1 − 1 > 0. The behavior of ζ , shown in Figure 1.10, and ζ = 1−1/ζ is interesting: it vanishes at the boundaries ( lim ζ = lim ζ = 0), is ζ →1

ζ →+∞

increasing in ζ ∈ (1, 2), attains its maximum value ζ = 1 in ζ = 2

Figure 1.10

√ ζ

ζ −1 Plot of ζ = 1−1/ζ − 1 as a function of ζ = KN+1 .

Coluccia: “chapter1_v2” — 2022/10/7 — 13:05 — page 40 — #40

Model-Based Adaptive Radar Detection

Figure 1.11

41

Plot of gζ (x) defined in (1.77) as a function of its positive argument x, for several values of ζ (asymptotic values are shown by dashed lines).

(corresponding to K ≈ 2N ),26 and then decreases (eventually approaching 0). Overall, gζ is monotonically increasing (linearly or sublinearly) and practically behaves as a constant for large ζ . This implements robustification in a graded way, according to K /N : large K pushes the statistic towards the ED (very robust), while the opposite extreme scenario K ≈ N always selects Kelly’s statistic (hence yielding no robustness) so as to preserve PD . As a matter of fact, in [64] it is shown that it is possible for a CFAR detector to guarantee practically zero loss under matched conditions while providing variable robustness to mismatches, depending on the setting of a tunable parameter. Indeed, one may decrease the probability to select “Kelly’s statistic” in the second branch of (1.75) by replacing ζ with ζ = (1 + ROB )ζ , with ROB ≥ 0. The resulting tunable detector, referred to simply as tROB , is given by

 1+z˜ 2 ) 1− ζ1 (     1 ,   (ζ −1)&&P ⊥ z˜ &&2 ζ & v˜ & tROB (z, S) =    1+z˜ 2    && ⊥ ˜ &&2 , 1+&Pv˜ z&

& & & ⊥ &2 &Pv˜ z˜ & >

1 ζ −1

.

(1.78)

otherwise

26. This brings us back to the RMB rule at the beginning of Section 1.4.

Coluccia: “chapter1_v2” — 2022/10/7 — 13:05 — page 41 — #41

42

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

Again, as for the 2S-ROB, an interpretation as a two-stage receiver is possible, in which this time also the first threshold is tunable. Such a detector has the generalized CFAR property and a PD depending only on SNR and cos2 θ, as for Kelly’s, AMF, and several other detectors. Remarkably, it is practically the same PD under matched conditions while achieving strong robustness to mismatches. An interesting aspect of (1.78) is that the processing is done entirely in the (quasi-)whitened space: this is tantamount to model the random component to be added to the whitened steering vector as a white term, (i.e., αR−1/2 v + θ˜ with θ˜ ∼ CN (0, δI )) which is interpretable as a worst case for knowledge (completely unknown structure of the mismatch). It is thus worth highlighting that the design choice = R, although heuristic, leads to a reasonable detector with desirable properties. The same approach can also be applied to range-spread targets27 [65].

1.5 Summary In this chapter we introduced the fundamental elements for understanding the model-based approach to advanced adaptive radar detection. We first presented the problem under the minimal setup of an unstructured (i.e., completely unknown, generic signal) model for the target return in white noise. We discussed the role of the energy detector and interpreted it in the light of the Neyman– Pearson approach. Then, we progressively extended the model to account for a correlation in the signal signature, ultimately approaching the more relevant structured signal model found in coherent radar detectors, which are concerned with the detection of a known steering vector with unknown amplitude in noise. The white noise assumption has been finally removed by considering the most general case of unknown covariance matrix, also clarifying along the path the different decoupled approaches for time (Doppler), space (angle), and CFAR processing, in comparison with space-time (STAP) and joint approaches. Several detection schemes under the GLRT paradigm, both one-step and two step, were reviewed in the last part of the chapter; this is instrumental to the subsequent discussion, where the different performance trade-offs in terms of detection power and behavior under mismatched conditions will be contrasted with detection schemes based on a data-driven rationale. Last but not least, the importance of the (generalized) CFAR property was highlighted throughout. A fact that has been observed, in several points, is that intuition may provide good solutions that can be possibly backed with theoretical arguments. This can justify ad hoc solutions or heuristic choices, as it provides interpretability, which is indeed one of the peculiarities of the model-based approach. However, in 27.

For more details on range-spread target models and related detectors, see [8, Chapter 9].

Coluccia: “chapter1_v2” — 2022/10/7 — 13:05 — page 42 — #42

Model-Based Adaptive Radar Detection

43

recent years data driven tools are prominently emerging, showing impressive results in many different applications, including signal processing and decision problems. So a question naturally arises: Is there any better, possibly data-driven function of the data to build more powerful detection algorithms? While the answer is expected to be positive in general, given the intrinsic lack of optimality of the GLRT approach, however, in the radar context additional requirements are present, as apparent from the above discussion—namely, besides interpretability, also tunability and CFARness. Data-driven techniques must be thus carefully tailored to the radar detection needs, a nontrivial task indeed, as it will be discussed throughout the rest of this book.

References [1]

Levanon, N., Radar Principles, New York: Wiley-Interscience, 1988.

[2]

M. I. Skolnik, Introduction to Radar Systems, Third Edition, New York: McGraw-Hill, 2001.

[3]

Meyer, D. P., and H. A. Mayer, Radar Target Detection: Handbook of Theory and Practice, Academic Press, 1973.

[4]

Richards, M. A., J. A. Scheer, and W. A. Holm, Principles of Modern Radar: Basic Principles, Raleigh, NC: Scitech Publishing, 2010.

[5]

Melvin, W. L., and J. A. Scheer, Principles of Modern Radar: Advanced Techniques, Edison, NJ: Scitech Publishing, 2013.

[6]

Richards, M. A., Fundamentals of Radar Signal Processing, Second Edition, McGraw-Hill Professional, 2014.

[7] You, H., X. Jianjuan, and G. Xin, Radar Data Processing with Applications, Wiley-IEEE, 2016. [8]

De Maio, A. and M. Greco, (eds.), Modern Radar Detection Theory, Edison, NJ: Scitech Publishing, 2015.

[9]

Bandiera, F., D. Orlando, and G. Ricci, Advanced radar Detection Schemes under Mismatched Signal Models, San Rafael, CA: Morgan & Claypool Publishers, 2009.

[10]

Gini, F., and M. Rangaswami, Knowledge Based Radar Detection, Tracking and Classification, Wiley-Interscience, 2008.

[11]

Guerci, J. R., Cognitive Radar: The Knowledge-Aided Fully Adaptive Approach, Second Edition, Norwood, MA: Artech House, 2020.

[12]

Klemm, R., Principles of Space-Time Adaptive Processing, London: The Institution of Engineering and Technology, 2006.

[13]

Guerci, J. R., Space-Time Adaptive Processing for Radar, Second Edition, Norwood, MA: Artech House, 2014.

[14]

Rangaswamy, M., “An Overview of Space-Time Adaptive Processing for Radar,” Proceedings of the International Conference on Radar, 2003, pp. 45–50.

Coluccia:

“chapter1_v2” — 2022/10/7 — 13:05 — page 43 — #43

44

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

[15] Wicks, M., M. Rangaswamy, R. Adve, and T. Hale, “Space-Time Adaptive Processing: A Knowledge-Based Perspective for Airborne Radar,” IEEE Signal Processing Magazine, Vol. 23, No. 1, 2006, pp. 51–65. [16] Van Trees, H. L., Detection, Estimation, and Modulation Theory, Vol. 4: Optimum Array Processing, Wiley-Interscience, 2002. [17]

Krim, H., and M. Viberg, “Two Decades of Array Signal Processing Research: The Parametric Approach,” IEEE Signal Processing Magazine, Vol. 13, No. 4, 1996, pp. 67–94.

[18] Ward, J., “Space-time Adaptive Processing for Airborne Radar,” Lincoln Laboratory, MIT, Lexington, MA, Tech. Rep. No. 1015, December 1994. [19]

Klemm, R., Space-Time Adaptive Processing: Principles and Applications, London: IEEE Press, 1998.

[20]

Fioranelli, F., H. Griffiths, M. Ritchie, and A. Balleri (eds.), Micro-Doppler Radar and Its Applications, Scitech Publishing, 2020.

[21]

Chen, V. C., The Micro-Doppler Effect in Radar, Second Edition, Norwood, MA: Artech House, 2019.

[22]

Bergin, J., and J. R. Guerci, MIMO Radar: Theory and Application, Norwood, MA: Artech House, 2018.

[23]

Griffiths, H. D., and C. J. Baker, An Introduction to Passive Radar, Norwood, MA: Artech House, 2017.

[24]

Schleher, D. C., MTI and Pulsed Doppler Radar with MATLAB, Second Edition, Norwood, MA: Artech House, 2010.

[25]

Chen, Z., G. K. Gokeda, and Y. Yu, Introduction to Direction-of-Arrival Estimation, Norwood, MA: Artech House, 2010.

[26]

Flores, B. C., J. S. Son, and G. Thomas, Range-Doppler Radar Imaging and Motion Compensation, Norwood, MA: Artech House, 2001.

[27]

Sullivan, R. J., Microwave Radar: Imaging and Advanced Concepts, Norwood, MA: Artech House, 2000.

[28] Van Trees, H. L., Detection, Estimation, and Modulation Theory, Vol. 1: Detection, Estimation, and Filtering Theory, Second Edition, John Wiley & Sons, 2013. [29]

Scharf, L. L., Statistical Signal Processing: Detection, Estimation, and Times Series Analysis, New York: Addison-Wesley Publishing Company, 1991.

[30]

Kay, S. M., Fundamentals of Statistical Signal Processing, Vol. 2: Detection Theory, Prentice Hall, 1998.

[31]

Zeng, Y., and Y.-C. Liang, “Eigenvalue-Based Spectrum Sensing Algorithms for Cognitive Radio,” IEEE Transactions on Communications, Vol. 57, No. 6, June 2009.

[32]

Zeng, Y., and Y.-C. Liang, “Spectrum-Sensing Algorithms for Cognitive Radio Based on Statistical Covariances,” IEEE Transactions on Vehicular Technology, Vol. 58, No. 4, May 2009.

Coluccia:

“chapter1_v2” — 2022/10/7 — 13:05 — page 44 — #44

Model-Based Adaptive Radar Detection

45

[33]

Bose, S., and A. O. Steinhardt, “Maximal Invariant Framework for Adaptive Detection with Structured and Unstructured Covariance Matrices,” IEEE Transactions on Signal Processing, Vol. 43, September 1995, pp. 2164–2175.

[34]

Besson, O., S. Kraut, and L. L. Scharf, “Detection of an Unknown Rank-One Component in White Noise,” IEEE Transactions on Signal Processing, Vol. 54, No. 7, July 2006, pp. 2835–2839.

[35]

Brennan, L. E., and I. S. Reed, “Theory of Adaptive Radars,” IEEE Transactions on Aerospace and Electronic Systems, Vol. 9, No. 2, March 1973, pp. 237–252.

[36]

Reed, I., J. Mallett, and L. Brennan, “Rapid Convergence Rate in Adaptive Arrays,” IEEE Transactions on Aerospace and Electronic Systems, Vol. AES-10, No. 6, 1974, pp. 853–863.

[37]

Robey, F., D. Fuhrmann, E. Kelly, and R. Nitzberg, “A CFAR Adaptive Matched Filter Detector,” IEEE Transactions on Aerospace and Electronic Systems, Vol. 28, No. 1, 1992, pp. 208–216.

[38]

Kelly, E. J., “An Adaptive Detection Algorithm,” IEEE Transactions on Aerospace and Electronic Systems, Vol. 22, No. 2, 1986, pp. 115–127.

[39]

De Maio, A., S. Han, and D. Orlando, “Adaptive Radar Detectors Based on the Observed FIM,” IEEE Transactions on Signal Processing, Vol. 66, No. 14, July 15, 2018, pp. 3838–3847.

[40]

Besson, O., “Adaptive Detection with Bounded Steering Vectors Mismatch Angle,” IEEE Transactions on Signal Processing, Vol. 55, No. 4, April 2007, pp. 1560–1564.

[41]

Bandiera, F., A. De Maio, and G. Ricci, “Adaptive CFAR Radar Detection with Conic Rejection,” IEEE Transactions on Signal Processing, Vol. 55, No. 6, 2007, pp. 2533–2541.

[42]

Liu, J., and J. Li, “Robust Detection in MIMO Radar with Steering Vector Mismatches,” IEEE Transactions on Signal Processing, Vol. 67, No. 20, October 15, 2019, pp. 5270–5280.

[43]

Kelly, E. J., “Performance of an adaptive Detection Algorithm; Rejection of Unwanted Signals,” IEEE Transactions on Aerospace and Electronic Systems, Vol. 25, No. 2, 1989, pp. 122–133.

[44]

Conte, E., M. Lops, and G. Ricci, “Asymptotically Optimum Radar Detection in Compound Gaussian Noise,” IEEE Transactions on Aerospace and Electronic Systems, 1995.

[45]

Kraut, S., and L. L. Scharf, “The CFAR Adaptive Subspace Detector is a Scale-Invariant GLRT,” IEEE Transactions on Signal Processing, Vol. 47, No. 9, September 1999, pp. 2538– 2541.

[46]

De Maio, A., “Rao Test for Adaptive Detection in Gaussian Interference with Unknown Covariance Matrix,” IEEE Transactions on Signal Processing, Vol. 55, No. 7, July 2007, pp. 3577–3584.

[47]

Richmond, C., “The Theoretical Performance of a Class of Space-Time Adaptive Detection and training Strategies for Airborne Radar,” in 32nd Asilomar Conference on Signals, Systems and Computers, Vol. 2, 1998, pp. 1327–1331.

[48]

Kalson, S., “An adaptive Array Detector with Mismatched Signal Rejection,” IEEE Transactions on Aerospace and Electronic Systems, Vol. 28, No. 1, 1992, pp. 195–207.

Coluccia:

“chapter1_v2” — 2022/10/7 — 13:05 — page 45 — #45

46

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

[49]

Bandiera, F., D. Orlando, and G. Ricci, “One- and Two-Stage Tunable Receivers,” IEEE Transactions on Signal Processing, Vol. 57, No. 8, 2009, pp. 3264–3273.

[50]

Kraut, S., L. Scharf, and L. McWhorter, “Adaptive Subspace Detectors,” IEEE Transactions on Signal Processing, Vol. 49, No. 1, 2001, pp. 1–16.

[51]

Kelly, E. J., and K. Forsythe, “Adaptive Detection and Parameter Estimation for Multidimensional Signal Models,” Lincoln Lab, MIT, Lexington, Tech. Rep. 848, April 19, 1989.

[52]

Besson, O., L. Scharf, and F. Vincent, “Matched Direction Detectors and Estimators for Array Processing with Subspace Steering Vector Uncertainties,” IEEE Transactions on Signal Processing, Vol. 53, No. 12, 2005, pp. 4453–4463.

[53]

Besson, O., L. L. Scharf, and S. Kraut, “Adaptive Detection of a Signal Known Only to Lie on a Line in a Known Subspace, When Primary and Secondary Data Are Partially Homogeneous,” IEEE Transactions on Signal Processing, Vol. 54, No. 12, 2006, pp. 4698– 4705.

[54]

Scharf, L., and M. McCloud, “Blind Adaptation of Zero Forcing Projections and Oblique Pseudoinverses for Subspace Detection and Estimation when Interference Dominates Noise,” IEEE Transactions on Signal Processing, Vol. 50, No. 12, 2002, pp. 2938–2946.

[55]

Pulsone, N., and C. Rader, “Adaptive Beamformer Orthogonal Rejection Test,” IEEE Transactions on Signal Processing, Vol. 49, No. 3, 2001, pp. 521–529.

[56]

Fabrizio, G., A. Farina, and M. Turley, “Spatial Adaptive subspace detection in OTH Radar,” IEEE Transactions on Aerospace and Electronic Systems, Vol. 39, No. 4, 2003, pp. 1407–1428.

[57]

Bandiera, F., O. Besson, and G. Ricci, “An Abort-Like Detector with Improved Mismatched Signals Rejection Capabilities,” IEEE Transactions on Signal Processing, Vol. 56, No. 1, 2008, pp. 14–25.

[58]

De Maio, A., “Robust Adaptive Radar Detection in the Presence of Steering Vector Mismatches,” IEEE Transactions on Aerospace and Electronic Systems, Vol. 41, No. 4, 2005, pp. 1322–1337.

[59]

Besson, O., “Detection of a Signal in Linear Subspace with Bounded Mismatch,” IEEE Transactions on Aerospace and Electronic Systems, Vol. 42, No. 3, 2006, pp. 1131–1139.

[60]

Besson, O., A. Coluccia, E. Chaumette, G. Ricci, and F. Vincent, “Generalized Likelihood Ratio Test for Detection of Gaussian Rank-One Signals in Gaussian Noise with Unknown Statistics,” IEEE Transactions on Signal Processing, Vol. 65, No. 4, 2017, pp. 1082–1092.

[61]

Coluccia, A., and G. Ricci, “Adaptive Radar Detectors for Point-Like Gaussian Targets in Gaussian Noise,” IEEE Transactions on Aerospace and Electronic Systems, Vol. 53, No. 3, 2017, pp. 1284–1294.

[62]

Coluccia, A., and G. Ricci„ “A Random-Signal Approach to Robust Radar Detection,” in 52nd Annual Conference on Information Sciences and Systems (CISS), 2018, pp. 1–6.

[63]

De Maio, A., and D. Orlando, “A Survey on two-Stage Decision Schemes for Point-Like Targets in Gaussian Interference,” IEEE Aerospace and Electronic Systems Magazine, Vol. 31, No. 4, 2016, pp. 20–29.

Coluccia:

“chapter1_v2” — 2022/10/7 — 13:05 — page 46 — #46

Model-Based Adaptive Radar Detection

47

[64]

Coluccia, A., G. Ricci, and O. Besson, “Design of Robust Radar Detectors through Random Perturbation of the Target Signature,” IEEE Transactions on Signal Processing, Vol. 67, No. 19, 2019, pp. 5118–5129.

[65]

Coluccia, A., A. Fascista, and G. Ricci, “A Novel Approach to Robust Radar Detection of Range-Spread Targets,” Signal Processing, Vol. 166, 2020, p. 107223.

Coluccia: “chapter1_v2” — 2022/10/7 — 13:05 — page 47 — #47

Coluccia: “chapter1_v2” — 2022/10/7 — 13:05 — page 48 — #48

2 Classification Problems and Data-Driven Tools This chapter is devoted to the introduction of mathematical and algorithmic tools adopted in machine learning, in particular for classification. It starts from the decision problem in the general setting of multiple alternative hypotheses. Then, the more specific case of signal classification is discussed, which is related to (but different from) the signal detection problem addressed in Chapter 1. The relationship is clarified between the Neyman–Pearson rationale, typical of radar detection, and the use of loss functions in classification problems, following the formulation of statistical learning. As a whole, the chapter provides the necessary background for discussing data-driven approaches to radar detection in Chapter 3. This is meant to unlock the essential elements, language, and concepts of the data-driven approach typical of machine learning and data science, and adapting it to the typical background, jargon, and needs of engineers in the radar community. Specifically, a review of linear/nonlinear classifiers and different concepts of feature spaces is provided, and several popular machine learning tools (including support vector machine, k-nearest neighbors, and neural networks) that can be effectively adopted for radar detection are compendiously described to provide a self-consistent reference. The final part of the chapter is devoted to deep learning.

2.1 General Decision Problems and Classification Selecting one among a set of multiple hypotheses is the general version of the binary decision problem seen in Section 1.1.3 (in particular (1.8)), which in 49

Coluccia: “chapter2_v2” — 2022/10/7 — 13:06 — page 49 — #1

50

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

the radar context is the main tool for target (echo) detection. The aim of this section is to review the general formulation of decision problems and introduce the fundamentals of classification theory. The basic idea of pattern recognition is discussed by linking it to the basic radar detection problem. Then, the role of feature spaces and decision regions is highlighted, also in connection with similar concepts found in digital communications. Finally, the difference between the classification and Neyman–Pearson approaches is detailed. 2.1.1 M -ary Decision Problems Multihypothesis decision problems arise when a decision must be taken regarding the belonging of data or signals to one out of a certain number of possible classes of equivalence. This kind of recognition problems are also frequently encountered in daily life and work, and they are rather intuitive. However, building a machine or algorithm able to act as a classifier is generally nontrivial, even for tasks that humans may easily solve, such as the recognition of shapes or handwritten digits.1 The reason is that there might be large variability within the same class (intraclass), hence what is needed is a precise way to distinguish such a variability from salient differences between classes (interclass). More broadly, the field of pattern recognition is concerned with the automatic discovery of regularities in data through the use of algorithms, and then by exploiting these regularities to take actions such as classifying the data into different categories [3]. At a first glimpse, the target detection task in radar contexts looks indeed like a pattern recognition problem. A typical pulsed radar transmits a train of pulses (with period equal to PRT) upconverted at a radiofrequency carrier, and the received signal may contain attenuated and delayed echoes of such pulses reflected by the target, which are however buried in the disturbance, as depicted in Figure 2.1. The radar receiver then tries to find the target returns by correlating the received signal with a template of the transmit pulse (i.e., a waveform-matched filter). The output of the filter will exhibit a peak where a good overlap (match) is found; however, the autocorrelation properties of the transmit pulse play a major role in shaping the sharpness of such a peak.2 It is generally desirable to have a very peaked (impulse like) autocorrelation, which produces a better resolution of the target. A clever way to achieve this goal is to introduce intrapulse modulation that spreads the pulse bandwidth while keeping the same temporal duration. For instance, biphase codes split the pulse duration into several subpulses (or chips) 1. The problem of handwritten digit recognition is still considered as a benchmark to test classification algorithms, for instance using the publicly available National Institute of Standards and Technology (NIST) datasets for training and testing optical character recognition (OCR) systems. Please refer to MNIST [1] and EMNIST [2]. 2. More generally, also considering a possible Doppler shift, this role is played by the ambiguity function (see Section 1.1.1).

Coluccia: “chapter2_v2” — 2022/10/7 — 13:06 — page 50 — #2

Classification Problems and Data-Driven Tools

Figure 2.1

51

Transmitted pulse train and corresponding received signal, where the pulse echoes from the target are completely masked by the noise (overall disturbance).

and modify their phases according to some binary codes (i.e., sequences of plus and minus multiplicative factors). Figure 2.2 shows a comparison between an uncoded rectangular pulse, where a plain sinusoidal carrier is transmitted over the pulse duration, and a waveform obtained by intrapulse biphase modulation through the Baker 7 code (with chip duration 1/7 of the overall pulse duration).

Figure 2.2 Comparison between an uncoded rectangular pulse and a waveform obtained through the Baker 7 code + + − − +−. For illustration purposes, the carrier frequency is chosen as low as to have exactly one sinusoidal period per subpulse (it is usually much higher in a real radar).

Coluccia: “chapter2_v2” — 2022/10/7 — 13:06 — page 51 — #3

52

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

The corresponding different outputs of the matched filter are shown in Figure 2.3. It is evident that the uncoded pulse has a much more dispersed autocorrelation, which limits the resolution localization of the target and the resolvability of two targets close in range. Conversely, the intrapulse modulated waveform exhibits a much clear and distinct peak. Note that such pulse compression enlarges the bandwidth without requiring a reduction of the pulse duration (as conversely happens for uncoded pulses), thereby keeping the energetic advantages of long pulses (higher SNR) while significantly improving the range resolution.3 The example discussed above can be interpreted as a pattern matching problem, which is a form of recognition: intuitively, the more the pattern is rich in structure (modulated pulse), the higher its recognizability. The sampled waveform is then used to construct the decision statistic, as discussed in Chapter 1, to finally take a decision on the possible presence of a target by testing the statistic against a threshold. This ultimately makes the detection task a binary decision problem, where it is always unknown whether the sought pattern (waveform) is present or not—which is different from other pattern recognition problems where the pattern is surely present but needs to be localized in the data stream. More generally, electrical and telecommunications engineers are familiar with M -ary decision problems in any digital receiver’s symbol detector. Binary digits (bits) representing the data to be transmitted through a digital communication system are in fact encoded in pairs, triplets, or generally n-tuples

Figure 2.3

Output of the waveform-matched filter for a simple rectangular pulse (uncoded), compared to an intrapulse modulated waveform (Baker 7 code as in Figure 2.2).

3. Pulse compression can also be achieved by using frequency (instead of phase) modulation; namely, adopting a chirp pulse (linear frequency modulated waveform) or other approaches.

Coluccia: “chapter2_v2” — 2022/10/7 — 13:06 — page 52 — #4

Classification Problems and Data-Driven Tools

53

at the transmitter; the resulting M = 2n possible configurations are associated to certain characteristics of the transmitted waveform (amplitude, frequency, phase, or combination thereof ), according to the chosen modulation format. A wellknown representation is in terms of complex baseband I/Q samples, which can be displayed in the complex plane as shown in Figure 2.4 for the binary case. Due to noise over the communication channel, the received symbols will scatter around the nominal positions, to an extent that depends on the noise distribution and power, and the detector’s goal is to correctly associate them to the actually transmitted symbols. In this respect, each symbol of the M -ary constellation corresponds to one of the classes c1 , . . . , cM , among which the detector has to choose. To this end, a suitable statistic is used, which however, in contrast to a radar detector, needs a more articulated decision strategy than a comparison against a threshold. This is clearly a consequence of the nonbinary decision at play, which more generally requires determining a decision function f as a map from the m-dimensional observation space of the data o to the decision space of the classes, as shown in Figure 2.5. Note that two equivalent formalisms may be used for the output variable at the decision space: it can be considered a scalar (categorical) variable y ∈ {c1 , . . . , cM }, or can be regarded as an M -dimensional vector y ∈ {e1 , . . . , eM } where ei = [0 · · · 0 1 0 · · · 0]T is a possible ith component

encoding of class ci . The latter formalism is often referred to as “1-to-M encoding” and is widely used in certain machine learning tools, while the former is more classical and straightforward. A prominent difference between M -ary decision problems as those encountered in communications, and general classification problems arising in pattern recognition, machine learning, and data science at large, is how such a mapping f is obtained. Since the milestone work by Shannon, communications theory has developed a number of statistical models for which a sufficient statistic can be identified, that is a function of the data to be used in the detector under a suitable optimization criterion. For instance, in an antipodal modulation with

Figure 2.4

Binary (antipodal) modulation in the complex plane: nominal constellation points and received signals (scattered).

Coluccia: “chapter2_v2” — 2022/10/7 — 13:06 — page 53 — #5

54

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

Figure 2.5

Mapping from the m-dimensional observation space of the data to the M -dimensional decision space of the classes.

equiprobable binary symbols over a memoryless additive white Gaussian channel, it is sufficient to test the sign of the samples taken at regularly spaced times (equal to the symbol duration) from the output of the matched filter, to decide for either class c1 (e.g., logic 0) or class c2 (e.g., logic 1). In fact, in this case the sign function is the optimal mapping f that minimizes the error probability P(e) = P(c1 is decided |c2 was transmitted)P(c2 )

+ P(c2 is decided |c1 was transmitted)P(c1 )

(2.1)

where P(c1 ) = P(c2 ) = 1/2. It is worth highlighting how the two terms in the above equation can be linked to the probability of false alarm (PFA ) and probability of miss (1 − PD ) found in the radar domain and discussed in Section 1.1.3. For unequal probabilities, the sign test is replaced by a more general comparison with a threshold different from zero that accounts for the unbalanced prior information about the symbols.4 Besides such simple examples, several well-known M -ary symbol detectors have been derived in the communications literature under the minimum error probability framework for different modulations, coding schemes, and channel models. Figure 2.6 shows three examples of classical modulation formats with associated decision regions in the case of additive white Gaussian noise. Such regions delimit the areas of the complex planes where received (noisy) samples are associated to each symbol of the constellation, leading to the corresponding well known decision rules5 that minimize the error 4. In Section 2.1.3 we will discuss in more detail the differences between the radar detection problem and other types of binary classification problems as found in digital communications, and, more especially, in machine learning and data-driven approaches at large. 5. For more details, please refer to books in digital communications, such as [4].

Coluccia: “chapter2_v2” — 2022/10/7 — 13:06 — page 54 — #6

Classification Problems and Data-Driven Tools

Figure 2.6

55

Three well-known digital modulation constellations, and corresponding decision regions. Points represent the symbols associated to each of the 4, 8, or 16 classes, respectively.

probability in the general M -ary case P(e) =

M M

P(cj is decided |ci was transmitted)P(ci )

i=1 j=1,j =i

=

M

P(ci )(1 − P(ci is decided |ci was transmitted)).

(2.2)

i=1

Equation (2.2) counts all possible ways in which the decision can be incorrect, being P(ci is decided |ci was transmitted) the probability of correct decision. A direct derivation of the optimal mapping f and related decision rule is however not generally possible for other types of data for which precise mathematical structures and effective statistical models are not available, namely images, speech, financial, or biological data. This explains the fundamentally different approach originated in the machine learning field to solve classification problems, as discussed next. 2.1.2 Classifiers and Decision Regions Roughly speaking, machine learning differentiates from other statistical approaches in the emphasis put on data rather than models. The term data-driven, in fact, as opposed to model-based, notes such a difference. Accordingly, classifiers in the machine learning field are conceived as a function f to be learned from the data, through a suitable training process. While the latter will be discussed in Section 2.2.1, here we specify the characteristics of a generic data-driven classifier. Irrespective of the way the learning process is performed, the resulting estimate fˆ of f is expected to have good properties for the classification problem. A natural criterion is of course the classification error (i.e., the rate of mistakes produced by applying fˆ to some input data for which the ground truth classes are

Coluccia: “chapter2_v2” — 2022/10/7 — 13:06 — page 55 — #7

56

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

known). Such a rate can be estimated as the fraction of incorrect classifications L( f ) =

NT 1 I ( yi = yî ) NT

(2.3)

i=1

where yî denotes the decision of the classifier fˆ , also called predicted class, NT is the number of available data, and I (·) is the indicator function (equal to 1 when its argument is true, and 0 otherwise). Equation (2.3) is a loss function referred to as either training error, when computed during the learning process of fˆ , or test error, when computed on data that were not considered for the learning task. As a consequence, a good classifier is one for which the test error is small, a capability known as generalization [5]. Equation (2.3) is a popular example of loss function, which is the general name for the optimality criterion used in a data-driven approach. This “0-1” loss function is a particularization for classification tasks of the more general squared error loss given by NT 1 L( f ) = ( yi − yî )2 NT

(2.4)

i=1

for regression problems having output variable y ∈ R, in which the quality of fit is a relevant performance parameter. Learning the function fˆ that produces yî s with minimum squared loss is also known as empirical risk minimization which, owing to the law of large numbers, is considered a proxy of expected risk minimization— with sample mean replacing statistical expectation, which for the squared loss yields the minimum mean squared error (MSE) solution. Other loss functions can be considered as well, for instance cross-entropy, derived from information theory.6 A similarity can be noted between (2.3) and formula (2.2) encountered in digital communications once empirical sample (frequentist) estimates are used in place of probabilities and equal likelihood for all classes is assumed. The minimum error probability criterion can be therefore considered as the modelbased counterpart of the empirical “0-1” loss minimization; the latter is the only possibility in a data-driven approach, where conditional probabilities are unknown due to the lack of a statistical model. Unfortunately, minimization of the empirical training error does not guarantee that the test error will be minimized,7 6. The output of the model for a certain observation, given a vector of input features x, can be interpreted as a probability modeled using the logistic function. Classification is based on the maximum probability criterion, and cross entropy can be used as a measure of dissimilarity between the probability distributions of the two classes as they emerge during the training process. 7. It can be shown (see e.g., [5, 6]) that the average test rate is minimized by a Bayes classifier, which indeed exploits the conditional probabilistic relationship between raw data (input variable)

Coluccia: “chapter2_v2” — 2022/10/7 — 13:06 — page 56 — #8

Classification Problems and Data-Driven Tools

57

which calls for suitable learning strategies able to provide good generalization capabilities (low test error) without incurring into overfitting possibly introduced by the attempt to minimize the training error, as it will be better discussed in Section 2.2.1. A very convenient way to visualize the performance of a classifier is to compute the loss (2.3) separately for all M classes, and represent the results in matrix form, where each element (i, j) represents the classification error of data from class i as (incorrectly) belonging to class j. This is called confusion matrix and, ideally, it is the identity matrix for a perfect classifier; in practice, the confusion can be toward certain (typically, more similar) classes or be rather spread across several classes, as shown in Figure 2.7. The confusion matrix reflects the classifier’s performance on unseen data, and hence is a way to assess its generalization capability. The key concept is that of decision region, namely a subset of the decision space associated to a certain class: data falling in each region are classified accordingly, which means that points crossing the border between regions (e.g., due to noise), are incorrectly classified as belonging to one of the neighboring classes. Note that in this sense, neighboring classes in the decision space are more similar, meaning that it is easier for the classifier to mix them up; however, this might not correspond to a qualitative similarity between the corresponding data. It is worth referring again to the comparison in the case of digital modulations. It is well known that the decision space, considering symbol-bysymbol detection, is the complex plane where I/Q samples can be represented as two-dimensional points. Such a decision space is partitioned into M regions,

Figure 2.7

Confusion matrices for three differently accurate classifiers (assuming classes are labeled in decreasing order of similarity). (a) Perfect classification (identity matrix), (b) good classifier with weak, near-class confusion, and (c) bad classifier with strong, far-class confusion.

and output decision for each class, exactly the approach adopted in the design of optimal symbol detectors in digital communications. As already noted, however, such probabilities are generally unavailable.

Coluccia: “chapter2_v2” — 2022/10/7 — 13:06 — page 57 — #9

58

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

according to the modulation order, and in fact optimal decision rules are those that minimize the error probability by considering decision region boundaries that best separate the different points of the constellation, as already shown in Figure 2.6. However, transmission of coded sequences with memory, namely through the use of convolutional codes, leads to multidimensional decision spaces of large dimension (exponentially growing with the length of the sequence) in order to capture the correlation among the symbols. This is very typical in machine learning problems, where the size and shape of the decision space is a fundamental aspect to capture the intrinsic correlation structure in the data and exploit it for accurate discrimination among classes, and hence good classification results. To this aim, raw observations o are typically first mapped into a suitable feature space, and f is consequently sought as a mapping between such a feature space and the decision space, as illustrated in Figure 2.8. In the following we use the notation x to generally refer to input data to the classifier, including the special case in which feature extraction is the identical function; that is, feature space and observation space coincide (x = o). The available dataset is denoted by T = {xi , i = 1, . . . , NT }, possibly adding a superscript to specify the class data j belong to, for example, xi represents a datum from class cj . Features are essentially values that better represent the data, in terms of dimensionality and redundancy. They are derived values built as functions of the raw data, and should be highly relevant for the classification problem at hand. To this aim, they typically perform dimensionality reduction (data compression) and selection of nonredundant information. Reducing the data space dimensionality from m to n < m (even n m) helps in saving computational resources and memory, and may ease human interpretation. Moreover, avoiding redundancy generally reduces the risk of overfitting the training data because only the essential information is retained. Finally, working in a low-dimensional space is beneficial against the curse of dimensionality, as better discussed later.

Figure 2.8

Feature extraction from the m-dimensional observation space of the raw data to the n-dimensional feature space, then mapping to the M -dimensional decision space of the classes.

Coluccia: “chapter2_v2” — 2022/10/7 — 13:06 — page 58 — #10

Classification Problems and Data-Driven Tools

59

In traditional machine learning, the definition of the feature vector plays a very important role in the success of the classifier’s design, and such a knowhow intensive task is in fact celebrated as the art of feature engineering. Further condensation of the information is possible through a feature selection that produces a subset of the original feature vector. Indeed, a plethora of approaches and tools exists that can be exploited, thus delivering a huge number of degrees of freedom to the designer. Feature extraction in particular can be performed via linear or nonlinear methods: important examples, discussed later in Section 2.3.2, are principal component analysis (PCA), linear discriminant analysis (LDA), kernel PCA, and autoencoders. The latter is actually representative of the more recent deep learning paradigm, in which algorithms based on neural networks are exploited for blind feature extraction, thereby replacing feature engineering with a data-driven feature extraction that implicitly produces the feature vector (more details in Section 2.4). Irrespective of the way features are obtained, including the case of direct use of raw data (i.e., x = o; see Figure 2.8) the classification stage involves a linear or nonlinear function f able to split data belonging to different classes into appropriate decision regions.8 In this respect, an important aspect is whether data in the decision space f (x) are linearly separable or not. Figure 2.9 illustrates this concept for the case n = 2 and M = 2; that is, two-dimensional feature vectors (corresponding to the two Cartesian axes) for a binary classification problem. Clearly, data in the left-most example are linearly-separable hence a linear decision function can be adopted to best separate them, resulting in a linear boundary between the two decision regions. Conversely, data in the rightmost example are not linearly-separable, hence optimal classification requires a nonlinear function (and accordingly a nonlinear decision boundary). Linear separability highly depends on the choice of the features and dimension of feature

Figure 2.9

Linearly separable data (left), in comparison with data separation through a nonlinear decision boundary (right).

8. As already mentioned, in contrast to model-based classification problems, the data-driven approach does not make statistical assumptions about the data, hence f cannot be obtained through a mathematical derivation. Instead, the best approximation fˆ of f is learned based on a training process (e.g., using labeled training data (supervised learning)), as will be discussed in Section 2.2.1.

Coluccia: “chapter2_v2” — 2022/10/7 — 13:06 — page 59 — #11

60

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

space, and in this respect different strategies are adopted by well-known classifiers, as discussed later. We conclude this section with a real-world example aimed at clarifying the concepts introduced so far, in a context closer to that of radar. In fact, a problem of great interest for both the communications and radar domains is signal classification. While symbol detection can benefit from the knowledge of the modulation constellation, and based on a channel model can derive the optimal decision rule for symbol decoding (as seen in Section 2.1.1), other applications such as spectrum sensing for cognitive radio have no such information. Modulation classification is thus the task of deciding, among a set of possible formats, which one has been used for the signal that is received. Additional aspects can also be included for a finer classification, by increasing the number of classes. This also applies to other domains where classifying the characteristics of radiofrequency signals is of interest, including radar emitters as well as target returns, as discussed in more detail in Section 3.2. Experimentation in signal classification has recently become much easier thanks to the availability of simple and low-cost software-defined radio (SDR) devices.9 This is very instructive to gain direct experience with the data-driven approach. From a mathematical modeling perspective, in fact, we are in the case of signal detection in white noise, with no or limited knowledge of the structure in the useful signal. Therefore, suitable model-based detectors for this case can be identified in those adopting as statistic the energy E = z2 (see Section 1.2.1), the maximum-minimum eigenvalue ratio λmax /λmin (1.24), and also the alternative eigenvalue-based statistic T1 /T2 provided in [8]. Each one of such statistics leads to a stand-alone detector by comparison against a suitable threshold, whose performance can differ according to the signal at hand. From a data-driven point of view, however, they represent possible features of the measured data. Indeed, Figure 2.10 shows an example of real data for the feature vector T λmax T1 x= E (2.5) λmin T2 represented as points of a three-dimensional (feature) space. In particular, four different datasets are depicted: for H0 (acquisitions performed in absence of detectable transmissions) at the frequencies 1,500 MHz (amateur/land mobile) 9. In particular, the data reported here is obtained through a RTL-SDR dongle, which is a lowcost (about $20) and easy-to-use USB device able to receive radio signals in the range from 25 MHz to 1.75 GHz [7]. Originally these devices were designed to serve as Digital Video Broadcast Terrestrial (DVB-T) receivers, but it was later discovered that they can be used as generic (receive only) SDRs. The front end of the RTL-SDR receives RF signals, downconverts them to baseband, digitizes them, and finally outputs the samples of the baseband signal across its USB interface.

Coluccia: “chapter2_v2” — 2022/10/7 — 13:06 — page 60 — #12

Classification Problems and Data-Driven Tools

Figure 2.10

61

Representation of different signals in 3-D feature space.

and 857 MHz (4G uplink frequency), for H1 at the frequencies 816 MHz (4G downlink frequency) and 569 MHz (TV broadcast UHF). The different points allocate among the space based on the presence or absence of signal and on the received power (linked to the transmitted power as well as distance between transmitter and receiver), which is reflected in the height over the vertical dimension. Intuitively, the stronger the signal, the farther the points will be from the origin over the direction E , while the lower the signal, the nearer it will be to the origin. However, different signals also exhibit diverse dispersion on the horizontal plane represented by the other two features. On the qualitative side, it is apparent that data under H0 tend anyway to cluster together without appreciable overlap with data under H1 , meaning that one might expect they can be linearly separable through a suitable (linear) classifier. The resulting data-driven detector, however, could classify the presence or absence of a signal but without neither control on PFA nor CFARness. In fact, detectors based on plain energy are sensitive to the noise power σ 2 (see Section 1.2.1). Such aspects are more general and represent the two most important challenges in adopting machine learning tools for the design of radar detectors, as better discussed below. 2.1.3 Binary Classification vs Radar Detection So far, we have discussed the general aspects of decision problems and classification, which can essentially apply to almost all kinds of data. Indeed, machine learning is a very general methodology. However, one of the main questions in this book is whether, which, and how such methods can work for radar

Coluccia: “chapter2_v2” — 2022/10/7 — 13:06 — page 61 — #13

62

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

detection. Adaptive radar detection, in particular, is a peculiar problem, sharing similarities but also showing important differences with respect to other decision or classification problems. A first remarkable aspect is the fundamental difference in the adopted optimization criterion. Specifically, radar detection certainly falls in the category of binary classification problems, but the former adopts a different rationale rooted in the Neyman–Pearson approach (see Section 1.2.2). Therefore, the confusion matrix for this binary classification problem, reported in Figure 2.11, is a constrained one, with the value of the upper off-diagonal element (PFA ) set upfront and the value of the lower diagonal element (PD ) representing the objective of the maximization (the remaining elements follow accordingly). By comparison, general classification problems, including binary ones, adopt the misclassification error as loss function to be minimized (see (2.3)), which does not distinguish between confounding H0 for H1 or vice versa. Furthermore, the design of adaptive radar detectors also includes the additional constraint of the CFAR property (refer to Chapter 1), which is not considered in the empirical loss minimization criterion of general-purpose classifiers. The requirement of CFARness together with the asymmetric role of misclassification rates discussed above are two fundamental aspects that differentiate adaptive radar detection from plain-vanilla binary classification. As a consequence, it is necessary to select the most appropriate techniques for adaptive radar detection, from the huge bulk of classifiers that have a long history in pattern recognition, machine learning, computer vision, and data science and artificial intelligence at large. Such contents will be covered in Chapters 3 and 4, while the rest of this chapter is devoted to introducing the necessary background on learning techniques (Section 2.2) and data-driven classification algorithms (Section 2.3), including neural networks and deep learning (Section 2.4).

Figure 2.11 For a binary classification problem, the confusion matrix boils down to the usual nomenclature of detection theory. Optimality criteria are generally different though.

Coluccia: “chapter2_v2” — 2022/10/7 — 13:06 — page 62 — #14

Classification Problems and Data-Driven Tools

63

Before proceeding further, however, two general observations are worth noting about what to expect from data-driven tools to prevent getting lost in the details thus missing the big picture. A first interesting question is whether considering more features instead of a single detector is advantageous or not. The answer is not obvious. On the one hand, more features mean that different aspects of the data are captured, while a stand-alone detector uses a single detection statistic that compresses all the information into a scalar value. However, features are typically correlated, and indeed we have mentioned in Section 2.1.2 that dimensionality reduction and feature selection are typical in order to retain the most relevant information for the classification task (and reduce its complexity). On the other hand, an adequate level of correlation is beneficial for the emergence of clearly distinguishable clusters, and dimensionality also plays an important role toward this goal. A basic but significant example is given by the binary classification problem depicted in Figure 2.12, where data obey to the XOR pattern. Two-dimensional data points are noisy versions of the true/false logical levels: for instance, assuming −1 means a logical zero (false) and 1 means a logical one (true), input data around (−1, −1) and (1, 1) (agreement between the two coordinates) should be associated to false, while points around (−1, 1) and (1, −1) (disagreement between the two coordinates) should be associated to true, according to the truth table of the exclusive OR (XOR). The resulting clusters are clearly not linearly separable. However, linear separability can be easily gained by considering a 3-D feature space where the two coordinates are extended with a third component given by their product (right figure). This may capture mutual correlation between the two variables. More generally, suitable functions of the features may help describing better the data. Along this line, a remarkable theorem by Cover establishes indeed that data that is not separable in low-dimension may become

Figure 2.12 (a) Noisy XOR data in the original (raw) 2-D space o = [o1 o2 ]T , and (b) the same data in a 3-D feature space x = [o1 o2 o1 o2 ]T .

Coluccia: “chapter2_v2” — 2022/10/7 — 13:06 — page 63 — #15

64

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

separable in high-dimension by means of a nonlinear transformation [9, 10]. The consequences are manyfold. We will see that some classifiers exploit this possibility through the so-called kernel trick, to achieve excellent classification performance of complex patterns. At the same time, increasing the dimension too much can have a detrimental effect on separability due to some peculiarities of high-dimensional geometry, and one may incur in the curse of dimensionality, discussed later. A second question that may naturally arise is about the reason why so many different machine learning tools exist even for the same problem (namely, classification). Actually, one may argue that the same happens in model-based adaptive radar detection, with a plethora of different detectors (as seen in Chapter 1). In that case, the reason is theoretical; since the UMP test for the general detection problem does not exist, it makes sense to chase the detector that can provide the best relative performance, and certain trade-offs with respect to robustness or selectivity, by adopting different modeling and design approaches. Similar considerations can be done for the data-driven case, however. In fact, there are theoretical guarantees that any mapping f can be well-approximated by a function fˆ under mild conditions using rather basic mathematical tools, such as polynomials.10 The general reference analytical framework is that of function series and Hilbert spaces, which provide many other results using different functions and/or alternative assumptions: in the standard background of an engineer one may list Taylor’s series, Fourier series, and wavelets.11 It is established that truncation of such series or transforms can be efficient in different ways to represent certain classes of signals in order to deliver a satisfactory approximation fˆ . Likewise, it should not come as a surprise that different types of learning machines may all provide universal approximation guarantees, but at the same time they can be more or less efficient in capturing certain structures in the input data. Moreover, they may introduce a different level of bias in the learning process, thereby leading to diversified trade-offs, as discussed below. For this reason, the rest of the chapter will specifically focus on relevant tools that are suitable for the radar detection problem. 2.1.4 Signal Representation and Universal Approximation The previous sections touched on a number of points and introduced some important concepts. Since the aim of the book is to provide a straightforward and 10. In particular, a well-known theorem by Weierstrass states that any continuous function f over a closed interval on the real axis can be expressed in that interval as an absolutely and uniformly convergent series of polynomials. 11. The whole theory of normed complete spaces of functions, including sparse signal representation and compressed sensing, is also relevant.

Coluccia: “chapter2_v2” — 2022/10/7 — 13:06 — page 64 — #16

Classification Problems and Data-Driven Tools

65

focused introduction to the use of data-driven approaches for radar detection—as an alternative to, or in hybrid solutions with, classical model-based methods—the presentation has been neat and many details have been omitted or postponed to later sections. The concepts of signal representation and universal approximation deserve, however, a slightly more in-depth discussion. Decomposing a signal into a linear combination of judiciously chosen basis vectors is indeed an everlasting, foundational idea in signal processing, whose motivation is to obtain compact and interpretable representations useful for downstream processing. Examples are time signals or series decomposition into sinusoids (Fourier), or speech and images into wavelets. Such formalisms, which have a long history, can be recast within the more general framework of atomic norm minimization [11]. Basically, the task is to represent a signal x in a vector space using atoms from a collection of vectors in A = {ai }, called an atomic set (which contains either a finite or infinite number of atoms) or dictionary; that is: x= ci ai (2.6) i

where ci are the coefficients of the decomposition. In many cases, the cardinality of A is much larger than the dimension of the signal, leading to an overcomplete representation, with an infinite number of possibilities to decompose x. Several criteria can be adopted to select the most appropriate one, for instance according to Occam’s razor principle the most parsimonious decomposition is preferable; that is, the one with the minimum number of atoms. This leads to a sparse representation [12]. Examples of A that lead to sparse representation for certain classes of signals are wavelets (for natural images), or unit-norm rank-one matrices (for low-rank matrices) [13]. Data-driven approaches are related as well to some form of (possibly sparse) signal representation, as will become clear throughout the book, and also especially highlighted in Chapter 5. As to universal approximation, this is a desirable property of a learning machine with a specific structure (e.g., a neural network), which states that a configuration exists—namely, values for the coefficients of the vector space expansion, parameters, or weights—that is able to approximate a wide class of functions arbitrarily well. Many results of this type have been provided so far for different machine learning tools. This means that mappings of interest can be effectively represented via data-driven algorithms, although it is not specified how—this is indeed a different problem; namely, the training of a learning machine. While the latter is addressed in Section 2.2, and universal approximation is further discussed in Section 5.2.3, it is worth highlighting here that the XOR problem encountered in this section is a prominent counterexample for the lack of universal approximation of some types of classifiers. The XOR problem has been deeply investigated since the onset of machine learning, in particular in the definition of learning machines based on linear combiners followed by a

Coluccia: “chapter2_v2” — 2022/10/7 — 13:06 — page 65 — #17

66

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

hard limited (McCulloch-Pitts model of a neuron), which then evolved in the many facets of artificial neural networks. More in detail, the so-called multilayer perceptron was introduced to extend the class of maps f that can be approximated through such neurons, overcoming the limitation of the original Rosenblatt’s perceptron, which is not able to approximate the XOR function due to the lack of linear separability of the two clusters. The basic reason is that a multilayer perceptron can learn an approximation fˆ with nonlinear boundary between decision regions, while a single neuron only admits a linear decision boundary (separating hyperplane). We will come back to these aspects in Section 2.4.

2.2 Learning Approaches and Classification Algorithms We have seen in the previous section the basic elements of a learning machine and the underlying signal representation, in particular for classification purposes. It is now a good moment to address the other side of the coin, which is how a machine can learn. There is evidence, in fact, that while certain human tasks can be easily reproduced by an algorithm, and often much more quickly, it is very difficult to give to a computer the perception capabilities of even a one-year-old baby.12 The idea of machine learning, in the end, is to replace a rule-based description containing full instructions that characterize a certain class with a (large) set of data from which such a description can be inferred. As seen in Section 2.1.2, for instance, the problem of signal classification is different from symbol detection since the latter can benefit from the knowledge of the modulation constellation and channel model (although its parameters must be estimated at the receiver) to derive the optimal decision rule, while classification of modulation formats (or radar characteristics) lacks such an information. Suitable algorithms are, however, needed to learn the decision function from the provided data. 2.2.1 Statistical Learning The most direct way to perform learning is through a dataset of input/output labeled pairs (examples). This is called supervised learning or learning with a teacher [10]: labels i represent the ground truth class of each example xi (feature vector or raw data), and are exploited by a training algorithm to learn the parameters of the machine.13 A loss function is adopted during the training, which makes use of a numerical optimizer: for each input example xi an output yi is predicted 12. This fact, popularized as Moravec’s paradox, is reflected in the difficulty to unambiguously formalize what distinguishes a real-life signal of a certain type from other signals of different types. 13. Other approaches do not require labels, in particular unsupervised learning (e.g., clustering) or reinforcement learning, discussed later in Section 2.3.5. Intermediate approaches are possible as well, namely semisupervised learning and hybrid solutions (the latter will be discussed in Chapter 4 for the specific problem of adaptive radar detection).

Coluccia: “chapter2_v2” — 2022/10/7 — 13:06 — page 66 — #18

Classification Problems and Data-Driven Tools

67

and compared with the desired output i , obtaining an error L(yi , i ) for all the pairs in the labeled training set T = {(xi , i ), i = 1, . . . , NT }. We are assuming here, for the sake of simplicity, a balanced training set where all classes are equally represented; although this may not be the case in several applications (a point further discussed later in the book), the development of this chapter can be straightforwardly extended to consider also unbalanced classes. The training algorithm updates the parameters accordingly to a learning rule, and the process is repeated iteratively, in the attempt to converge to a minimum of the (sum) loss function. Typically, a stop criterion is adopted that should strike a balance between conflicting objectives. In fact, supervised learning differs from mere optimization in the fact that the learned mapping fˆ should not only be able to minimize the classification error of examples seen in the training set, but also to correctly classify unseen examples. This generalization capability is usually a trade-off with the minimization of the training loss, due to the phenomenon of overfitting. The latter means that fˆ adheres too much to the training data of a certain class that loses its ability to recognize slightly different examples from the same class; differently put, fˆ is too specialized (biased) to be an accurate representation of the underlying unknown f .14 On the other hand, although learning includes the estimation of internal coefficients or weights during the training phase, it does not coincide with parameter estimation, since the aim is not just to find the best fit for the data. To clarify these aspects, consider the problem of curve fitting from experimental data, which apparently resembles the supervised training of a learning machine (for regression or classification). We focus on the onedimensional case for simplicity, but the discussion applies as is to multidimensional spaces. Denoting the input data by x, there is an underlying unknown relationship f and we can only observe noisy data obeying the relationship y = f (x) +

(2.7)

where is a zero-mean random error term, independent of x. Estimating f , which here appears as a curve-fitting problem, is at the core of statistical learning as well: both need to specify a more or less complex parametric template for the sought approximation fˆ . Linear regression, for instance, postulates a linear relationship fˆ (x) = β0 + β1 x and estimates the two parameters (slope and intercept) that best fit the available data. The fitting error is given by the MSE [5] E[(y − fˆ (x))2 ] = E[(f (x) + − fˆ (x))2 ] = E[(f (x) − fˆ (x))2 ] + E[ 2 ] (2.8) 14. We will provide a more critical discussion on specialization vs generalization, which is related not only to overfitting but also to flexibility and interpretability, in Chapter 5.

Coluccia: “chapter2_v2” — 2022/10/7 — 13:06 — page 67 — #19

68

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

since E[] = 0 (and exploiting linearity of the mean). The second term is an irreducible error due to the variance of ; the first term is instead the reducible error that can be minimized through a proper parameter estimation, in order to end up with an approximation fˆ as close as possible to f . However, in practice this process does not guarantee that the fitted curve will be able to predict values of f for unseen points x. Indeed, parameters are estimated from a finite number of pairs (xi , yi ), i = 1, . . . , NT , and the best fit parameters are those minimizing the sample version of the MSE (2.8), that is: NT 1 L(f ) = (yi − fˆ (xi ))2 . NT

(2.9)

i=1

This does not guarantee that the MSE is minimized on unseen data. In other words, the prediction capabilities of fˆ can be quite different according to the chosen template (model) for the fitting task. Figure 2.13 illustrates this phenomenon, where linear, quadratic, and high order spline best fits are compared for a set of data. The ground truth f , unknown in practice, is also shown for reference. By increasing the degrees of freedom (number of parameters), the training error can be reduced more and more, since

Figure 2.13 Comparison among curve fits with different number of degrees of freedom: linear regression leads to underfitting of the ground truth curve, while high-order spline leads to overfitting.

Coluccia: “chapter2_v2” — 2022/10/7 — 13:06 — page 68 — #20

Classification Problems and Data-Driven Tools

69

the fitted curve will more closely follow the data points. However, this does not necessarily yield the best approximation of f , as it can be seen even by visual comparison (in this case the quadratic fit is the closest to f , despite the fact that data points would suggest a somewhat higher order). As an extreme example, in case f is actually close to linear, then linear fitting is the best approximation despite the fact that high-order polynomials may lead to a MSE equal to zero thanks to perfect interpolation of all data points. This overfitting phenomenon is general, and equally found in machine learning: supervised learning is basically fitting the parameters of the learning machine so as to minimize the training loss (compare (2.9) with (2.4)), and likewise this does not guarantee good generalization capabilities on unseen input data. Therefore, the training phase is always followed by a test phase, where the learned parameters are kept fixed and performance metrics (test error) are computed on ground truth labels not used in the training phase. It results in a typical U-shape trend of the test error, qualitatively shown in Figure 2.14 as a function of the flexibility of the model (degrees of freedom), in comparison with the typical decreasing trend of the training error.15 The existence of so many machine learning tools is thus also motivated by the fact they may provide different test errors even for the same training error. Consequently, a significant effort should be put into the choice and proper setting of the machine, including the identification of the most suitable number of degrees of freedom.

Figure 2.14 Typical U-shape curve of the test error, decomposed in terms of bias-variance trade-off, in comparison with the decreasing trend of the training error. 15. The decomposition in terms of squared bias and variance will be discussed later in Section 2.2.2.

Coluccia: “chapter2_v2” — 2022/10/7 — 13:06 — page 69 — #21

70

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

On the practical side, implementation of the training phase requires a suitable optimizer and considerations about complexity, scaling, and numerical issues. Generally, some form of gradient of the loss function will be basically exploited to iteratively update the parameters in the direction where the loss function decreases.16 Assessment of the training error is adopted by the training algorithm during the learning phase to decide when to stop. This depends on the peculiarities of the specific machine learning method, as it will be discussed in Section 2.3. Regarding the test error, since by definition unseen examples are not available, a common practice is to reserve a subset of the training data for this purpose. The available dataset is therefore split into training data (the larger part) for the training phase, and test data (a smaller part) for the subsequent assessment once the training is complete. Sometimes, the splitting is threefold, with validation data left out for a preliminary assessment of the generalization performance at the design stage. In fact, most machine learning tools have a number of additional tunable parameters that define the specific inner structure, including the order of the model (namely, the number of degrees of freedom). These are often called hyperparameters and cannot be part of the training; as a consequence, some trial-and-error is usually necessary to tune them for best performance. The error computed on the validation data is indeed used as proxy for the test error, to preliminary assess the latter while the parameters are adjusted, before freezing their final (learned) values; at that point, test data (not overlapping neither with training data nor with validation data) are used to obtain the performance metrics. A more refined approach is κ-fold cross-validation, in which the available dataset is randomly partitioned into κ equal sized subsets: κ − 1 are used for training, and one for testing, repeating the process κ times (each of the κ subsets is thus used exactly once for validation). The κ results can be averaged to produce a single performance metric, which is statistically more reliable since it avoids that a single training set, possibly containing unbalanced classes or particularly favorable/unfavorable cases, biases the learning process too much. When κ = NT (the number of available data), κ-fold cross-validation is equivalent to leave-one-out cross validation, but several other variants exist [6]. In the discussion above, similarities between training of a machine learning algorithm and statistical curve fitting have been highlighted. However, it is important to also note some differences. In particular, fitting a curve to data is usually motivated by obtaining a parsimonious description of the underlying relationship to be possibly used for extrapolating or predicting other points. Therefore, the template for the curve to be fitted is of limited complexity, with a small and fixed number of parameters, and retains a certain degree of interpretability. Statistical learning, instead, is much broader and encompasses 16. More details will be given in Section 2.4, in particular about the well-known backpropagation and stochastic gradient descent algorithms, and the role of regularization will also be discussed.

Coluccia: “chapter2_v2” — 2022/10/7 — 13:06 — page 70 — #22

Classification Problems and Data-Driven Tools

71

classical statistical problems such as regression (i.e., curve fitting) in some of its tasks. “Learning” actually refers to any algorithm or system that tries to improve over time through experience, which in the supervised case is the training phase. Parameter estimation, which is the key ingredient in regression and other statistical problems, often plays a role, but other learning approaches are possible (see Section 2.3.5) in which the algorithm can improve over time without explicit parameter estimation. Nonetheless, many advanced machine learning tools typically deal with a large and possibly variable number of parameters. A template for the input-output mapping function is still needed, which is obviously parametric, but this is typically implicit in the chosen architecture of the machine. The latter is in fact rather opaque, with a less interpretable internal structure, shaped through the training process to capture the peculiarities of the data at hand. As we move from traditional machine learning to neural networks and related tools, such an opacity increases and interpretability accordingly decreases: at the extreme end we have deep learning techniques, where the classifier (including the feature extractor) is considered as a black box with a huge number (millions, billions, or even more) of parameters to be learned, like a gigantic machine full of knobs that are automatically twisted by the invisible hand of the training algorithm. Nevertheless, the boundary between classical statistics, traditional machine learning, and contemporary deep learning is not hard, and the approach followed in this book is indeed to try to keep things whole. 2.2.2 Bias-Variance Trade-Off Equation (2.8) can be rewritten in a form that highlights the different contributions to the MSE on the test error for an unseen input x • ; that is [5, 6]: E[(y • − fˆ (x • ))2 ] = VAR[f (x • )] + (Bias[fˆ (x • )])2 + VAR[ 2 ]

(2.10)

which defines the average test MSE that could be obtained by repeatedly estimating f using a large number of training sets, and testing each of them at x • (the overall expected test MSE would be the average over all possible values of x • in the test set). Equation (2.10) reveals that, besides the irreducible error (last term), the test error can be reduced by lowering both the bias and variance of our estimate fˆ . This depends on the performance of the chosen statistical learning method. The variance of fˆ is linked to the level of variability in the estimates obtained by using different training datasets. Flexible methods exhibit higher variance due to their greater adherence to the training data, hence changes in a few points can produce quite different estimates of f . On the contrary, methods with a small number of degrees of freedom are more robust to variations in the training data, and hence exhibit low variance: looking again at Figure 2.13, moving a few observations will produce quite different results for higher-order curves, while it likely will not affect the parameters of the linear fit that much.

Coluccia: “chapter2_v2” — 2022/10/7 — 13:06 — page 71 — #23

72

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

Bias measures instead how much the average error of the estimates E[fˆ ] deviates from the true f . This basically depends on the complexity of f : a sufficiently sophisticated model is needed to yield good approximations. As an example, since in Figure 2.13 the true f is nonlinear, the linear fit will never be an accurate approximation no matter the amount of available training data. In general, more flexible methods lead to lower bias but higher variance, and the sum of the two terms determines whether the test MSE increases or decreases, as seen in Figure 2.14. As the number of degrees of freedom grows, in fact, the bias significantly decreases because the model is gaining capability to better capture the relationship underlying the data, thereby estimating f more accurately. This reduces the bias to a more significant extent than the concurrent variance increase, ultimately producing a reduction of the expected test MSE. Such a trend continues until the model is sufficiently complex for the problem at hand: after that, any further increase in degrees of freedom will have a minor impact on the bias but will, conversely, significantly increase the variance, ultimately worsening the test MSE. The overall effect is referred to as bias variance trade-off, and yields the classical U-shape curve illustrated in Figure 2.14. The left-most part of the plot is the underfitting region, whereas the right-most part is the overfitting region. While it is easy to obtain a very low variance method by just using a oneparameter or even constant model, this comes with high bias due to the model inaccuracy, resulting in poor generalization capability; likewise, a curve passing through all data points (interpolant) has low bias for enough training data, but adapts too closely to them, hence again it will not generalize well (it has high variance due to the strong dependency on the training set). The ultimate goal of a machine learning designer is to find the right method and set its flexibilityrelated hyperparameters—namely, those linked to the degrees of freedom, as for example, order of the polynomial in curve fitting or number of layers in a neural network—so as to catch the sweet spot for which both the variance and the squared bias are low, leading to a low MSE. Unfortunately, statistical bias, variance, and MSE cannot be generally computed in practice. The use of cross validation (see Section 2.2.1) to estimate the test MSE is a common approach for keeping under control the bias-variance trade-off. Additional ways also exist to estimate the optimal amount of model complexity for a given prediction method and training set, such as model order selection criteria, bootstrap, and other techniques (for the details see [6]).

2.3 Data-Driven Classifiers The first part of this chapter introduced general concepts and discussed the most important aspects of classifiers and more in-general statistical learning methods. The second part, starting from this Section 2.3, will review in more detail a

Coluccia: “chapter2_v2” — 2022/10/7 — 13:06 — page 72 — #24

Classification Problems and Data-Driven Tools

73

selection of machine learning algorithms, in particular those that can also be suitably applied to the radar domain, as discussed in Chapter 3. 2.3.1 k -Nearest Neighbors We mentioned earlier that the average test rate is minimized by a Bayes classifier, which indeed exploits the conditional probabilistic relationship between input data x and output decision y for each class, exactly the approach adopted in the design of optimal symbol detectors in digital communications. As already noted, however, such probabilities are generally unavailable, making the Bayes classifier an unattainable gold standard to be used as a reference for the performance of actually implementable classifiers. Several approaches have been developed specifically with the aim of approximating the unavailable conditional distribution of y given x, and one of the most important is the k-nearest neighbors (KNN) classifier. KNN is a local method : for a test observation x• , the KNN classifier identifies the k closest points (with respect to x• ) in the training set, referred to as the neighborhood Nk (x• ), then estimates the conditional probability for class cj as the fraction of points in Nk (x• ) labeled as cj ; that is: P(y = cj |x = x• ) ≈

1 k

i:xi ∈Nk (x

I (i = cj )

(2.11)

•)

where i is the label of xi ∈ Nk (x• ). Following the Bayes rule, the KNN classifies x• according to the class with the highest probability. For a binary classification problem, this is tantamount to a majority vote rule on Nk (x• ), deciding for the class having more points falling in the set. For instance, k = 1 classifies each test point according to the label of the point having minimum distance among those j in the training set T = T 0 ∪ T 1 , with T j = {xi , i = 1, . . . , NT }, j = 0, 1; k = 5 means that among the three closest points to x• , the decision is for the class with three or more points (which means estimating the corresponding conditional probability as 3/5 or more). Figure 2.15 reports an example where the decision boundary is shown for several values of k. Its inverse impact on the flexibility of the classifier is visible: in this example k = 1 leads to overfitting (low bias but very high variance), since test points very close to each other may easily find points from both classes in their neighborhood, producing an irregular decision boundary; higher values of k bring more robustness, meaning that the decision boundary is relatively insensitive to changes in the training dataset and produce a more regular decision boundary that ultimately tends, for very large k, to a straight line (low variance but high bias).

Coluccia: “chapter2_v2” — 2022/10/7 — 13:06 — page 73 — #25

74

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

Figure 2.15

Decision boundaries for a KNN classifier for several choices of k.

Despite its simplicity, the KNN is able to provide quite complex nonlinear decision boundaries, even noncontiguous as in the case k = 1 of Figure 2.15. Furthermore, for a proper selection of k (and enough training data), the KNN can be remarkably close to the optimal Bayes classifier [5, 6]. Notice also that no model is learned: KNN is a nonparametric technique, and it does not make any assumption about the statistical distribution of the training data. However, KNN has two disadvantages. The first one is that it needs to compute all pairwise distances between the test point and the whole training dataset; this is clearly computationally demanding at run-time. A second issue is that approximating the theoretically optimal conditional expectation by KNN averaging does not work in high-dimensional feature spaces, a phenomenon commonly referred to as the curse of dimensionality. The main reason is that data points in high-dimensional spaces tend to be sparse (their density significantly decreases)—from which the separability of patterns or clusters mentioned in Section 2.1.3 follows [9, 10]—but unfortunately this also implies that all data is approximately equidistant to each other. Consequently, the capability of the KNN (which is distance-based) to discriminate among the different classes vanishes. More in general, the curse of dimensionality implies that, for an increasingly large feature space, the training dataset must exponentially grow. For more details refer to [6].

Coluccia: “chapter2_v2” — 2022/10/7 — 13:06 — page 74 — #26

Classification Problems and Data-Driven Tools

75

2.3.2 Linear Methods for Dimensionality Reduction and Classification An important family of classifiers has the special property of having a linear decision boundary. One of the first approaches was proposed by Fisher [14] and is known as discriminant analysis. The idea is to consider only the mean and covariance of the data; namely, assuming a binary problem with a conventional mapping of the classes c1 = 0 (H0 ) and c2 = 1 (H1 ), µ0 = E[x] and 0 = E[(x − µ0 )(x − µ0 )T ] for class 0 and analogously µ1 and 1 for class 1. A linear transformation wT x with coefficients w is considered, which produces a scalar output that is the linear combination of x, with mean wT µj and covariance wT j w, j = 0, 1. Fisher defined the separation between these two distributions as the ratio of the variance between the classes to the variance within the classes, using a similar approach as with the ANOVA statistical technique. The result can be interpreted as a kind of SNR, and is maximized when w ∝ ( 0 + 1 )−1 (µ1 − µ0 )

(2.12)

leading to maximal separation between the two distributions. This technique, which maps feature vectors into a single scalar, is used for dimensionality reduction. Based on that, a classifier is simply obtained by comparing the projected data through w with a threshold. In practice, the true statistics are unknown, therefore sample estimates are used based on the training datasets T 0 and T 1 for the two classes. A more popular version of Fisher’s discriminant analysis is known as LDA. This assumes however, unlike the original Fisher’s, that data comes from a multivariate Gaussian distribution. As a result, it is a model-based parametric technique, which is expected to work better when data is approximately normally distributed. This assumption, in fact, allows one to derive the conditional distribution and, hence, the optimal Bayes classifier based on the (log)likelihood ratio. The resulting classifier (x − µ0 )T ( 0 )−1 (x − µ0 ) + log det 0 − (x − µ1 )T ( 1 )−1 (x − µ1 ) H1

− log det 1 ≷ η

(2.13)

H0

is referred to as quadratic discriminant analysis (QDA). It is often simplified by assuming further that 0 = 1 = (and that covariances have full rank), which finally leads to the LDA, taking the form of a scalar product [3] H1

wT x ≷ η

(2.14)

w = −1 (µ1 − µ0 )

(2.15)

H0

where

Coluccia: “chapter2_v2” — 2022/10/7 — 13:06 — page 75 — #27

76

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

and 1 (2.16) η = wT (µ1 + µ0 ). 2 By comparing (2.15) with (2.12), the relationship between LDA and Fisher’s discriminant analysis is evident, although they are derived based on different assumptions and methodologies. This essentially means that the resulting classification algorithm bears general-purpose applicability: geometrically, what counts is the projection of the n-multidimensional point x onto the direction w (i.e., which side of the hyperplane perpendicular to w and identified by η the point x lies on). It is worth noting that dimensionality reduction can be performed in a more sophisticated and flexible way by Pearson’s PCA [15]. Compared to LDA, PCA is an unsupervised approach (it ignores labels i , see Section 2.3.5) and does not compress data into a single scalar, but rather it performs a change of basis and projects each data point onto only the first few principal components.17 As a result, a lower-dimensional data vector is obtained, which, however, preserves the inherent data variability (variance) as much as possible. In particular, the first principal component is the direction that maximizes the variance of the projected data, while the second, third, and so on, are directions orthogonal to all the preceding ones with maximal residual variance. It turns out that the principal components are eigenvectors of the covariance matrix of the data, and therefore the orthogonal transformation implementing the PCA is often obtained by eigendecomposition or singular-value decomposition of a sample matrix estimate. Regularization techniques such as shrinkage toward a pool matrix, or diagonal loading, is necessary when the dimension of the data space n is larger than the number of available data NT , a typical problem in covariance matrix estimation. Further details can be found in [3, 16]. If linear separability of two classes is of concern, other approaches different from LDA can be adopted as well. The Rosenblatt’s perceptron [3, 10] is a particularly important one, since it triggered the development of neural networks (see Section 2.4). The feature vector is processed by means of a linear model in the form wT x (as in (2.14)), whose output is however further passed through a nonlinear activation function ϕ(·). The latter is chosen as the signum (step) function: +1 x ≥ 0 . (2.17) sign(x) = −1 x < 0 Classes are conventionally encoded with target values +1 for c1 and −1 for c2 , for the sake of convenience, since this matches the choice of the activation function [3]. Such a choice also provides a suitable cost function for the misclassification 17. A further possibility to perform dimensionality reduction in a fully unsupervised and datadriven way is to exploit autoencoders, which is discussed in Section 2.4.3.

Coluccia: “chapter2_v2” — 2022/10/7 — 13:06 — page 76 — #28

Classification Problems and Data-Driven Tools

77

error to be minimized during the training process. This leads to a supervised learning algorithm known as the perceptron criterion, which returns the vector w containing the synaptic weights and the bias, according to the McCulloch-Pitts model of a neuron (more details will be given in Section 2.4). The perceptron vector w defines a hyperplane, and classification is performed based on which side the test point falls in. However, despite that the nonlinearity in the activation function may be misleading, the perceptron y = f (x) = sign(wT x)

(2.18)

is a linear classifier for linearly separable data. 2.3.3 Support Vector Machine and Kernel Methods The perceptron returns a particular hyperplane for data that are linearly separable, among many possible solutions, as depicted in the left-most graph of Figure 2.16. This depends on the initialization of the learning algorithm; however, if data is not linearly separable, there might even be convergence issues [6]. The optimal separating hyperplane is instead the unique solution maximizing the (perpendicular) distance to the closest point from either class, known as margin, and represented by the shaded area in Figure 2.16(a). This can be found by solving a quadratic optimization problem subject to constraints of the type (wT xi + b)i ≥ 0

(2.19)

where we recall that i ∈ {+1, −1} are the labels associated with the training set {xi , i = 1, . . . , NT }, and we have made the bias parameter b explicit out of the weight vector. Maximizing the margin subject to such constraints (a problem that can be rewritten in a more suitable form for numerical resolution) means that the found parameters w, b define an hyperplane where the training data are all

Figure 2.16 (a) Different separating hyperplanes (dashed and dotted lines) and maximal-margin hyperplane (solid line) identified by the support vectors on the margin boundaries, and (b) optimal soft-margin hyperplane for nonlinearly separable data.

Coluccia: “chapter2_v2” — 2022/10/7 — 13:06 — page 77 — #29

78

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

correctly classified. This is expected to lead to better classification performance also on test data [6]. However, when the data is not linearly separable, there is no feasible solution to this problem and an alternative formulation is needed. The general approach, known as the support vector machine (SVM), produces nonlinear decision boundaries by determining a linear boundary in a large, transformed version of the feature space, as discussed below. The three points in Figure 2.16(a) that are equidistant from the maximal margin hyperplane and lie along the dotted lines indicating the width of the margin are called support vectors, since they support the hyperplane: a small change in the position of such points would significantly affect the direction of the hyperplane (i.e., it would change the optimal vector w). Thus, the maximal margin hyperplane depends directly on the support vectors, but not on the other observations, whose movement would not affect the separating hyperplane (unless the change crosses the margin). Such characteristics suggest that overfitting may occur in this approach. To overcome the sensitivity of the maximal-margin hyperplane to support vectors, as well as to extend its applicability to nonlinearly separable data, the concept of soft-margin is introduced. This allows some training points to be misclassified, as shown in the right-most graph of Figure 2.16. The optimal softmargin hyperplane is found by introducing slack-variables in the optimization problem, according to some budget of margin violation.18 The latter is a tuning parameter (generally found via cross-validation), and basically acts on the biasvariance trade-off [5]: when such a parameter is large, the margin will be wide and hence there will be a higher number of support vectors; vice versa, when such a parameter is small, the margin will be narrow and hence have a lower number of support vectors. The fact that SVM predictions only depend on the support vectors, and thus training is robust to points that are far away from the hyperplane, is in sharp contrast with other linear approaches, namely LDA, in which the classification rule depends on the mean of all data within each class, as well as the within-class covariance (which is a function of all the observations). The importance of SVM as a classification tool is further motivated by the fact that it can be easily modified to produce nonlinear decision boundaries, thus greatly extending the class of problems it can attack. The basic idea is to map the original feature space into a larger space through a nonlinear transformation. As already mentioned in Section 2.1.3, in fact, data that is not separable in low-dimension may become separable in high-dimension [9, 10]. A simple way to artificially enlarge the feature space is to consider higher-order polynomials 18. We omit here many details on SVM, since the goal is to provide a gentle and succinct introduction for the sake of the use of data-driven tools in radar detection, which is addressed in Chapter 3. Excellent references to go deeper are [3, 6, 5, 10]. Note, however, that ready-to-use implementations of SVM are available in all statistics, machine learning, and signal processing packages and software.

Coluccia: “chapter2_v2” — 2022/10/7 — 13:06 — page 78 — #30

Classification Problems and Data-Driven Tools

79

and mixed terms as additional dimensions, as already done in Figure 2.12. The generalization of this approach is called kernel trick, and also has the benefit of keeping a sustainable computational complexity while greatly enlarging the feature space, even to an infinite number of dimensions. It can be shown, in fact, that the dual problem for the SVM does not depend on the values of the training points directly, but only through inner (scalar) products ·, · . Specifically, the learned function can be written as [5, 6] f (x) = sign(wT x + b) = sign( ai i x, xi + b) (2.20) i∈S

where x, xi = xT xi and S is the set of the support vectors. By introducing a feature space transformation h(·), the previous equation simply rewrites as ai i h(x), h(xi ) + b) (2.21) f (x) = sign(wT h(x) + b) = sign( i∈S

which means that one may even not define explicitly h, but just assign a kernel K(x, xi ) = h(x), h(xi )

(2.22)

that computes inner products in the transformed space. This avoids any direct computation on the transformed feature vectors which, depending on the choice of h(·), can be very high-dimensional. The requirement for K(·, ·) to be a kernel is that it is a symmetric positive (semi-)definite function (Mercer’s theorem). For instance, the d th degree polynomial K(x, xi ) = (1 + xT xi )d

(2.23)

generalizes the empirical idea of higher-order terms discussed earlier, and includes the linear kernel of plain SVM (with h the identical function) that is linear in the features and essentially quantifies the similarity of a pair of observations using Pearson (standard) correlation xT xi . The implicit mapping h can be found by developing the expression, for instance for d = 2 and x = [x y]T , xi = [xi yi ]T , K(x, xi ) = (1 + xT xi )2 = (1 + xxi + yyi )2

= 1 + x 2 xi2 + y 2 yi2 + 2xxi + 2yyi + 2xxi yyi which implies the nonlinear transformation √ √ √ x → h(x) = [1 x 2 y 2 2x 2y 2xy]T

(2.24)

(2.25)

as it can be easily verified that K(x, xi ) = h(x)T h(xi ) returns (2.24).

Coluccia: “chapter2_v2” — 2022/10/7 — 13:06 — page 79 — #31

80

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

Another popular choice is the Gaussian kernel or radial basis function (RBF) K(x, xi ) = exp(−β0 x − xi 2 )

(2.26)

which has a sharp decay (regulated by the parameter β0 ) thereby further training observations play essentially no role in the prediction. This type of kernel has very local behavior, but enlarges the feature space to a surprisingly infinite number of dimensions. To have an idea of why this happens, just consider the Taylor expansion of the exponential function, which has an infinite number of non-zero terms, with the argument replaced by the explicit expression of x−xi 2 obtained by the multinomial theorem.19 Finally, the choice of a sigmoid K(x, xi ) = tanh(β1 x, xi + β2 )

(2.27)

mimics the behavior of a two-layer neural network, discussed later in Section 2.4. Kernels can be constructed based on existing kernels and proper algebraic rules. The kernel trick is very powerful and enables the generalization of a number of techniques, which obtains a palette of kernel methods. It is sufficient to rewrite a chosen algorithm such that it only depends on the kernel dot products instead of the data points, thus yielding a number of nonlinear algorithms. Examples are the kernel perceptron as well as the kernel discriminant analysis, also known as kernel Fisher discriminant analysis, which is a kernelized version of LDA. They are simply obtained by considering, instead of the original feature vector x, the transformed one h(x). The same idea can also be applied to other machine learning tools that are not classifiers, namely dimensionality reduction techniques like PCA (thereby the obtaining kernel PCA). A real-world example of features for signal classification has been discussed earlier in Section 2.1.2 and Figure 2.10. We consider it again here, by selecting two signals and a 2-D feature space (maximum-minimum eigenvalue and energy) that make the data linearly separable, as shown in Figure 2.17. Since linear separability is already present in the feature space x = [λmax /λmin E ]T , there is no need to use a nonlinear classifier and plain SVM (with linear kernel) provides the best separating hyperplane. However, the use of nonlinear kernels project the feature in a larger space where data have different properties, hence the use of a polynomial or Gaussian kernels is also worth investigation. The resulting decision boundaries are clearly nonlinear, which might better accommodate for test data that are slightly different from the training ones, hence providing better generalization capabilities. Notice also that RBF may produce closed decision boundaries, even noncontiguous, which may be important in some applications. For nonlinearly 19. A deeper motivation is given by the theory of reproducing-kernel Hilbert spaces (RKHS), which is beyond the scope of this book (see [6, 10]), although some aspects are discussed in Section 5.2.4.

Coluccia: “chapter2_v2” — 2022/10/7 — 13:06 — page 80 — #32

Classification Problems and Data-Driven Tools

Figure 2.17

81

SVM-based classification of radiofrequency signals with different decision boundaries according to the choice of the kernel.

separable data in the original feature space, the use of kernels is unavoidable anyway. Results for the 3-D feature space considered in Figure 2.10 show that the use of higher-order terms (polynomial kernels) improves the classification performance in spectrum sensing [17]. 2.3.4 Decision Trees and Random Forests Among traditional machine learning tools, it is instructive to consider the random forest algorithm because of its different approach compared to KNN and SVM on the one side, and neural networks on the other side—although it shares some characteristics with all of them. Developed by Leo Breiman in 2001 [18], it is an ensemble method based on the concept of a decision tree. A decision tree is a flowchart-like structure in which a decision is taken (output) at each node based on the information coming from its parent node (input), forming a classification tree or chart. In the context of machine learning, the resulting binary tree processes the feature vector a feature at a time, deciding either one of the two paths in each node bifurcation according to the value of such a feature. The decision at each node is taken by comparing the value of the considered feature with a threshold. The order of feature processing, from the root down to the tree until reaching the leaves, as well as the threshold values at the intermediate nodes, are determined during the training process.

Coluccia: “chapter2_v2” — 2022/10/7 — 13:06 — page 81 — #33

82

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

Decision trees can represent any multistage decision process. As a noticeable example, the two-stage radar detectors encountered in Section 1.4.4 can be regarded as simple decision trees of depth two.20 Decision trees are, however, more interesting for problems with a larger number of features, where the splitting possibilities become exponentially large. Algorithms for constructing decision trees usually work top-down, by choosing at each step the feature that best splits the data, according to some metric such as the misclassification error, Gini impurity, information gain, variance reduction, or other measures of goodness. Optimal learning is known to be NP-complete, which is why practical decisiontree learning algorithms are based on heuristics such as the greedy algorithm where locally optimal decisions are made at each node. As a trivial example, for the case of Figure 2.17 the Gini criterion yields a decision boundary that is a horizontal line separating the two clusters, which is reasonable since the data is linearly separable. In fact, the learned decision tree exhibits as the splitting condition E = 0.0904. Note that a decision tree can only implement splitting parallel to the coordinate axes in the feature space; any nonlinear splitting is thus approximated by a composition of smaller parallel regions. However, decision trees are typically weak learners, which means that their performance may be only slightly better than random guessing. They, in fact, tend to overfit data due to their intrinsic growing process, which is very nonrobust (small changes in the training data can result in large differences in the tree and resulting predictions). Although some mechanisms, such as pruning, can be adopted to reduce this large variance, it will increase the bias hence worsening the overall performance. A more interesting strategy is to perform ensemble learning. The latter ensures that a proper combination of weak learners can yield a strong learner, and essentially so random forests do with decision trees. Ensemble learning lies at the basis of several metaheuristic algorithms aimed at improving the performance of a given classifier, such as bagging (bootstrap aggregating ) and boosting (implemented in the popular AdaBoost algorithm). Bagging tries to approximate a consensus of NL independent weak learners, each with error-rate e < 0.5 (even only slightly less, hence weakly better than random guessing). The number of decisions for a certain class is thus distributed as a binomial random variable with parameters (NL , 1−e), hence the probability that the sum of the votes takes the majority (i.e., exceeds NL /2) approaches one for large NL . This concept has been popularized as the wisdom of crowds [23], since the collective knowledge of a diverse and independent body of people is typically better than any single individual, and can be harnessed by voting [6]. However, a crucial assumption is that of independence between the classifiers, and bagged 20. We recall that two-stage detectors are formed by cascading a robust detector and a selective detector, such as AMF and Kelly’s statistics, or AMF and ACE in the so-called adaptive sidelobe blanker [19, 20], as well as the discussed detectors based on the random-signal approach [21, 22].

Coluccia: “chapter2_v2” — 2022/10/7 — 13:06 — page 82 — #34

Classification Problems and Data-Driven Tools

83

trees are not because they are obtained by random sampling with replacement of the training set. Boosting also considers NL weak classifiers fj (x), and builds a committee that takes the final decision based on a weighted majority vote; that is:   NL y = f (x) = sign  aj fj (x) . (2.28) j=1

However, it proceeds quite differently from bagging in the way trees are grown. In particular, data modifications are performed at each boosting step, by applying weights w1 , . . . , wNT to each of the training observations (xi , i ), i = 1, . . . , NT . Initially all weights are set to wi = 1/NT , but in successive iterations they are modified and the classification performed again on the weighted observations. At each step, observations misclassified by the classifier induced at the previous step have their weights increased; conversely, the weights are decreased for those correctly classified. Thus, as iterations proceed, observations that are difficult to classify correctly receive ever-increasing influence, and each successive classifier will be forced to concentrate on them [6]. Random forests are actually a specific type of bagging for decision trees that are aimed at reducing the correlation between the sampled trees. They use a modified tree learning algorithm that selects, at each iteration, a random subset of the features (feature bagging in addition to tree bagging). This solves the issue of standard bagging, in which features that are very strong predictors for the target output are selected in many of the trees, causing correlation among the trees. The performance of a bag of trees is often very similar to boosting, but it is simpler to train and tune since trees can be grown in parallel, while boosting is an inherently sequential process. Back again to the simple example above, training a bag of trees would yield alternative splitting values, for instance 0.0899, 0.0868, 0.0984, corresponding to slightly shifted decision boundaries and, consequently, different generalization performance. The average or weighted combination of them provides improved classification accuracy. This, however, comes at the price of sacrificing interpretability. In fact, while decision trees are simple enough to be easily interpretable (an exception rather than a rule in machine learning tools) since they highlight the role of each single feature and its relevance to determine the decision, combining many trees undermines this possibility. Interesting compression techniques have been proposed to obtain a minimal born-again decision tree that is equivalent to a given random forest, thereby retrieving interpretability [24]. We will revisit interpretability throughout the rest of the book, and, in particular, we will discuss it in Section 5.3.2. As a final note, we point out that several variants of random forests exist, and connections have been drawn with both KNN and kernels, showing that random forests can be seen as adaptive kernel estimates.

Coluccia: “chapter2_v2” — 2022/10/7 — 13:06 — page 83 — #35

84

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

2.3.5 Other Machine Learning Tools Obtaining labeled data may be a difficult task, because annotation takes a significant amount of time, and even accessing the ground truth may be problematic in several contexts. Radar systems, in general, are not as widespread as cameras or microphones so collecting a large amount of data and labeling them brings quite different complexity and cost in the two cases. This partially explains why data-hungry machine learning tools, in particular deep learning (Section 2.4.3), have been especially applied to speech and video data, which can be easily acquired and whose labeling is a nonspecialistic task that can be distributed among many people.21 We will see that the issue of obtaining labeled data can be mitigated by techniques for augmenting, or even replacing, real training datasets via synthetic data generation. Not all learning methods, however, need labeled data. Unsupervised learning refers to a class of methods that do not use labels, but instead try to classify data based on some properties of the data that differentiate the different classes. In clustering methods, the aim is to discover intrinsic patterns that aggregate data in homogeneous clusters based on some distance in a metric space and the usual definition of neighborhood. Examples are the K-means, DBSCAN, and many more. Anomaly detection is a different version of the clustering problem, where the goal is to identify comparably rarer cases (outliers) with respect to the usual data distribution, in this sense defined as anomalies. Examples are isolation forests and one-class SVM (we will revisit both in Chapter 3). Dimensionality reduction can be considered as a special type of unsupervised learning. We have already discussed its role to obtain a more compressed and informative feature vector, before further processing, a prominent example being PCA. Other popular techniques are based on random projections, sketching, and other techniques based on the Johnson-Lindenstrauss lemma (e.g., t-distributed stochastic neighbor embedding (t-SNE) [25]), which guarantees that data can be projected to a smaller space by preserving distances. Sometimes the number of dimensions can be lowered to two or three, allowing for a convenient and trustworthy graphical visualization of the feature space22 despite the original data at hand being high-dimensional. A different family of unsupervised learning techniques, which can be also used for multidimensionality reduction, is that of autoencoders, addressed later in Section 2.4.3. Semisupervised learning also exists, which tries to improve the performance by combining (a moderate amount of ) supervised learning with unsupervised learning. The idea is that, while labeled data is much more informative, it is harder 21. A good example of how a large amount of data can be labeled through collaborative crowdsourcing efforts is Amazon Mechanical Turk https://www.mturk.com and Google’s reCAPTCHA system https://support.google.com/recaptcha. 22. Visualization for data-driven algorithms is discussed in Section 5.3.3.

Coluccia: “chapter2_v2” — 2022/10/7 — 13:06 — page 84 — #36

Classification Problems and Data-Driven Tools

85

to obtain than unlabeled data; nonetheless, the distribution of the latter around labeled points can enhance the understanding of the intrinsic data structure, with positive effects on the generalization accuracy. Finally, a completely different paradigm is reinforcement learning. This is more related to the way a biological intelligence naturally learns, at least to some extent. Reinforcement learning does not use any training dataset; it rather focuses on the actions taken by an agent within a surrounding environment, which generates a reward fed back to update its internal state (“representation of the world” or “worldview”). The problem is formulated as cumulative reward maximization under a Markov decision process framework, and is solved by finding the appropriate sequence of actions. The latter should balance the exploration of the solution space in terra incognita with the exploitation of the current knowledge. In this respect, similarities are found with dynamic programming, which, however, assumes an analytical model for the Markov process, while reinforcement learning is purely data-driven. As well, reinforcement learning can be combined with supervised learning to take advantage from both approaches.

2.4 Neural Networks and Deep Learning Neural networks (NNs) are a very important class of machine learning tools, whose importance has grown enormously in the past decades. The traditional term “artificial neural network” highlights the fact that they have been introduced as a way to mimic the behavior of biological neural tissues inside the brain. Although many different types of NNs exist, they share the definition of a model of the neuron and the interconnection of several layers of neurons, with the input signal passing through the resulting network to produce the final output. This is the basic architecture of feedforward networks, while recurrent NNs also include feedback connections with preceding layers. Moreover, NNs can be shallow or deep in the number of layers, and more or less wide in the number of neurons per layer. Although the origin of NNs dates back to the late 1950’s, the new millennium hype was boosted, as already mentioned, by the availability of large amounts of labeled multimedia data (sound, speech, text, images, and videos) thanks to the advance of electronic devices available in the consumer market. At the same time, the computational power has increased enormously, in particular with the repurposing of graphics processing units (GPUs) for machine learning. And just as important, very large investments have been flowing toward intelligent technologies, since the latter are expected to produce significant returns in many business markets, and NN-based learning represents a cutting-edge solution for several problems [26]. This also resulted in the widespread use of libraries, packages, and frameworks to ease the implementation of (especially deep) learning tools. We will see, however, that all this should be taken with a grain of salt when

Coluccia: “chapter2_v2” — 2022/10/7 — 13:06 — page 85 — #37

86

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

moving to specific contexts, namely target detection in a radar context, where other important requirements and constraints exist. 2.4.1 Multilayer Perceptron As explained in Section 2.3.2, the development of NNs was triggered by Rosenblatt’s perceptron [10, 3]. This is essentially a linear combiner, through a set of synaptic weights w and a bias b, followed by a nonlinear activation function ϕ(·); that is: y = f (x) = ϕ(wT x + b).

(2.29)

In the original McCulloch-Pitts model of a neuron, the signum step function is adopted, but later other activation functions have been introduced, in particular sigmoids based on hyperbolic tangent ϕ(x) = tanh(x) or logistic function ϕ(x) = 1 , and the rectifier or rectified linear unit (ReLU) ϕ(x) = max(0, x). 1+e−x Moreover, while a single perceptron can classify only linearly separable data (hence, for instance, fails on XOR patterns, as discussed in Section 2.1.4), the multilayer perceptron can learn an approximation fˆ with nonlinear boundary between decision regions, overcoming the limitation of a single neuron that only admits a linear decision boundary (separating hyperplane). Figure 2.18 shows an example of a multilayer network, composed of the input layer, the output layer, and q hidden layers. For binary classification problems, the output layer typically contains M = 2 neurons that represent the two classes, with encoding e1 = [1 0]T and e2 = [0 1]T (see Section 2.1.1). The input layer starts with the feature vector x, which may coincide with raw data o. The standard feedforward architecture is also fully connected, which means that the output of each neuron of a given layer goes in input to all neurons of the subsequent layer. The resulting input-output mapping is thus given by the composition of linear combiners and nonlinear activation functions. For instance, a NN with a single hidden layer (q = 1) will produce as k-th output the value     p n (2) (1) (0) (1) (2) yk = ϕ  wki ϕ  wij xj + bi  + bk  , k = 1, . . . , M i=1

j=1

(2.30) where the superscript between round brackets indicates the corresponding layer, (0) from the input layer (0, so xj are the components of x) to the output layer (2 = q + 1) passing through the hidden layer (1) with p neurons width. In Figure 2.18 this corresponds to having the first (dark gray) hidden layer connected directly to the output layer, with no additional (light gray) hidden layers. Many

Coluccia: “chapter2_v2” — 2022/10/7 — 13:06 — page 86 — #38

Classification Problems and Data-Driven Tools

Figure 2.18

General architecture of a neural network (M = 2 is for binary classification).

87

Coluccia: “chapter2_v2” — 2022/10/7 — 13:06 — page 87 — #39

88

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

variants of this architecture are possible, with a variable number of neurons at each hidden layer and additional processing units to perform certain operations, based on the type of NN and learning task to be performed. One of the key success factors of NNs is that they can be iteratively trained by a simple algorithm similar to the least mean square (LMS). Learning consists of adjusting weights and biases to minimize the classification error computed on the training dataset, a procedure known as backpropagation since it is based on the propagation of signal errors from the output layer back to the input layer. The functional transformations involved in the input-output mapping are in fact differentiable, thus the gradient of the loss function with respect to the weights of the network can be computed by the chain rule, and an update recursion derived for the weights and bias of all neurons [27]. The latter contains a parameter called learning rate, which is basically the step-size of the iterative minimization procedure. Proper setting of the learning rate is a trade-off between the rate of convergence (so the time needed to train the NN) and oscillations due to overshooting, and even convergence to unsatisfactory local minima. To avoid the otherwise necessary manual tuning by trial and error, adaptive gradient descent algorithms are commonly used, such as Adagrad, RMSprop, and Adam. Moreover, stochastic gradient descent is often adopted, where computation of the actual (full) gradient is replaced by an estimate calculated from a randomly selected subset called mini-batch, an idea that can be traced back to the RobbinsMonro algorithm [3]. This is especially important when the input layer is high-dimensional, to reduce the computational cost by trading off more iterations (due to the reduced convergence rate of the stochastic gradient compared to the actual gradient) with significantly shorter per-iteration time. An important theoretical result obtained by Cybenko [28], later extended to more general settings, ensures that multilayer perceptrons with a single hidden layer are already universal function approximators, meaning that they can be used to approximate any mapping f (see Section 2.1.4). In this respect, any addition of layers would appear as redundant. However, as most proofs of existence, Cybenko’s theorem only proves that universal approximation is possible, but does not provide any practical way to obtain it, nor guarantees that the resulting NN would not require an extremely wide hidden layer (huge number of neurons). Later studies have shown that using many layers of limited width can be beneficial: such an approach, popularized as deep learning, is discussed below. 2.4.2 Feature Engineering vs Feature Learning Although the input of NNs can be any vector of features, an appealing idea is that the network itself could extract the most appropriate features during the training process. It is common, in fact, that NNs are fed by data that have been just preprocessed for better conditioning and consistency but with no feature extraction, or even by raw data. This approach is known as feature learning or representation

Coluccia: “chapter2_v2” — 2022/10/7 — 13:06 — page 88 — #40

Classification Problems and Data-Driven Tools

89

learning, and is meant to replace manual feature engineering. There is thus a paradigm shift when moving from traditional machine learning to NNs and, in particular, deep NNs where the number of hidden layers is particularly large. The hidden neurons can act in fact as feature extractors, since during the training they gradually discover the salient features that characterize the training data [10]. In doing so, the first layers build a mapping toward an implicit feature space, then the actual classification task is performed by later layers closer to the output. Feature learning introduces a point of philosophical interest: not only is the machine used to learn an unknown mapping f , but it also learns the best feature-space representation toward this goal. This is done, however, in a purely data-driven fashion, while in traditional machine learning the definition of an appropriate feature space is a main task for the designer, who should put into it all their domain knowledge. Unfortunately, in several cases real-world data resist the definition of specific features that can capture all the complexity and relevant information for accurate performance. The alternative of feature learning is thus very appealing. Still, in some contexts where physics plays a major role, and radar is one of those, giving up feature engineering may come with some drawbacks. More generally, a purely data-driven approach introduces a strong dependency on the training dataset, which should therefore be very large and as comprehensive as possible regarding the different operational conditions. Although feature learning can also be implemented in shallow NNs, it is especially in the context of deep learning that it emerged due to the large amount of degrees of freedom that can accommodate for an effective representation learning. However, this exacerbates the requirement about the size of the training dataset, and obtaining a huge amount of high-quality (labeled) data might be not feasible in certain applications. We will touch on all such issues several times later in the book. But now, we devote the last part of this chapter to reviewing the most important deep learning concepts and tools. 2.4.3 Deep Learning Deep learning generally refers to a class of machine learning techniques with many layers of information processing stages that are exploited for feature learning and pattern analysis/classification [26, 29]. This includes many different methods and encompasses NNs as well, in this context called deep neural networks (DNNs). Particular architectures have been proposed in addition to the fully connected multilayer perceptron scheme, as discussed below. 2.4.3.1 Convolutional Neural Networks

One of the most important types of DNNs, for its surprising performance in many speech and image processing problems, is that of convolutional neural networks

Coluccia: “chapter2_v2” — 2022/10/7 — 13:06 — page 89 — #41

90

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

(CNNs). They are inspired by the physiological characteristics of the visual cortex, where receptive fields act locally to identify particular patterns in a certain region of the visual field [30]. This leads to the idea of using convolution layers specialized to one task (e.g., edge detection in image processing, similar to what happens in the visual cortex) and applied only to a certain region of the previous layer (receptive field). Initially, convolution kernels were manually engineered, but according to the purely data-driven paradigm that emerged over the last decades, it turned out that they can be efficiently learned by backpropagation. Moreover, to reduce the number of parameters, weights are shared among convolutional units. This also enables parallel computation during the training, which can today be sped up significantly by dedicated GPU-hardware, leading to the current hype of this type of NNs. In more detail, a CNN is a multilayer perceptron designed specifically to recognize two-dimensional shapes with a high degree of invariance to translation, scaling, and other forms of distortion [10]. It alternates between convolution and processing (subsampling and local averaging) layers, and its merits are due to the additional constraints posed on the architecture, compared to the generic fully connected multilayer perceptron. The main characteristics are: • Feature extraction: Each neuron takes its inputs from a local receptive field in the previous layer, thereby it extracts local features. • Feature mapping : Each layer is composed of multiple feature maps, sharing the same set of synaptic weights. Besides reducing the number of parameters to be learned, this promotes shift invariance through the use of convolution with a kernel of small size followed by a sigmoid function. • Subsampling : Each convolutional layer is followed by a computational layer for local averaging and subsampling, which reduces the resolution of feature maps, thereby promoting invariance to shifts and other forms of distortion. Convolution is a term clearly reminiscent of the signal processing operation in linear time-invariant systems, and in fact the convolution kernel is actually the same as a (two-dimensional) linear filter. It can be thus interpreted as a sliding window moving over the input, selecting a small portion, and computing a correlation. 2.4.3.2 Training a DNN

DNNs are generally prone to overfitting because of their large capacity; that is, high number of degrees of freedom that may capture too detailed aspects of the training data. Regularization methods and dropout are typically used during the training to mitigate this problem. Regularization consists of adding to the loss function a term to promote certain properties of the solution by penalization,

Coluccia: “chapter2_v2” — 2022/10/7 — 13:06 — page 90 — #42

Classification Problems and Data-Driven Tools

91

for instance using a norm multiplied by a chosen factor. Dropout instead refers to the trick of randomly omitting some neurons from the hidden layers during training, which means considering a smaller network with some parts deactivated. This yields at each iteration a simpler model instance with less capacity (degrees of freedom), thus overall reducing the overfitting. Moreover, training datasets can be extended in size by means of data augmentation techniques, for instance cropping and rotating images in the case of CNNs. One of the main issues in training a deep architecture is the vanishing gradient problem. In fact, activation functions like sigmoids shrink their input into a saturated interval, for instance (0, 1) or (−1, 1), which means that a large variation in the argument of the activation function ϕ will cause a small variation in the returned value. As a consequence, the gradient will be small. By adding many layers in the network, the ultimate value of the backpropagated gradient will be given by the product of several of such terms (chain rule), so the feedback error will quickly (exponentially) approach zero and the network will stop learning (weights remain practically unchanged over iterations). Mitigation solutions are the use of unbounded activation function such as ReLU (which, however, may incur the opposite effect of exploding gradient), the introduction of batch normalization layers, the use of some additional direct connections that bypass activation functions (skip connections used in the ResNet architecture), and suitable weight initializations. A further aspect to consider is that DNNs contain a number of hyperparameters that control the learning process, which cannot be learned though backpropagation. These include the number of layers, number of neurons per layer, as well as the learning rate and initial values for the weights. Manual tuning is time-consuming and lacks established guidelines, being more based on the experience of the designer and on the specific problem. Several procedures have been identified for hyperparameter optimization. The goal is to find a tuple of hyperparameters that is optimal in the sense that it minimizes a certain loss function. Cross validation is typically used to estimate the associated generalization performance. The most naive approach is a grid search (or parameter sweep), which is an exhaustive method. This is, however, feasible mostly on shallow architectures, while it becomes too computationally demanding for deep networks. Many alternative methods have been proposed such as random search and Bayesian optimization. 2.4.3.3 Autoencoders and Other Types of Neural Networks

A large number of different NNs have been developed over a relatively short period of time, due to the hype and significant investments in deep learning worldwide. A noteworthy type is called autoencoder and it is particularly attractive because it is an unsupervised method for learning the “best” data representation

Coluccia: “chapter2_v2” — 2022/10/7 — 13:06 — page 91 — #43

92

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

in a latent feature space. This means that it can be used as a feature extractor, but, additionally, it can serve as a (deep) dimensionality reduction technique [31]. An autoencoder is a feedforward multilayer perceptron-like network that has an input layer with the original raw data or feature vector, one or more hidden layers, and an output layer that matches the input layer for reconstruction. While autoencoders are often trained with a single-layer encoder and a single-layer decoder, the use of deep architectures brings advantages in terms of computational complexity, required size of the training set, and compression rate [26]. The width of the hidden layers can be either smaller (when the goal is feature compression) or larger (when the goal is mapping the feature to a higherdimensional space) than the input dimension [29]. In fact, while the output layer has the same number of neurons as the input layer, the feature space may have smaller or larger (or equal) width. In the former case (undercomplete autoencoders), a compressed representation of the input is learned; in the latter case (overcomplete autoencoders), interesting representations can be obtained by introducing regularization (e.g., sparse, denoising, or contractive autoencoders). To learn a representation (encoding) for the data, backpropagation is used to refine feature extraction to reconstruct the input from the encoding. Many variants exist that try to promote some properties in the learned representations [26]: for instance, in addition to the already mentioned regularized autoencoders, variational autoencoders are of interest since they can effectively be used as powerful generative models. Autoencoders can be also stacked inside DNNs. As one might expect, if linear activations are used (or only a single sigmoid hidden layer) the optimal autoencoder becomes equivalent (although not equal) to the PCA. However, the great advantage of autoencoders, as for any NN, comes from the nonlinearity in the activation functions, which generally yields a representation with less information loss and better generalization compared to PCA. To conclude, it is worth briefly mentioning other families of NNs that provide alternatives to feedforward (D)NNs. A different approach is to capture explicitly some temporal information by using directed connections between neurons and internal states (memory) with feedback. The resulting networks, called recurrent neural networks (RNNs) have dynamic behaviors: they can indeed be linked to several other disciplines including system theory and statistical physics. Popular examples are associative memories such as Hopfield networks and bidirectional associative memories, and long short-term memory (LSTM) networks. The latter are particularly important since they revolutionized application fields where the temporal information in the sequence of data is important, such as speech recognition. Its internal structure, based on gates, is clearly inspired by signal processing concepts. While the model of neuron with linear combiner and nonlinear activation function is by far the most widely adopted, it should be noted that other

Coluccia: “chapter2_v2” — 2022/10/7 — 13:06 — page 92 — #44

Classification Problems and Data-Driven Tools

93

possibilities exist. In particular, stochastic models have been under consideration for decades, a well-known one leading to the so-called Boltzmann machines (stochastic Hopfield network with hidden units) and restricted Boltzmann machines (RBMs). The latter solve the issue of impractical learning of general Boltzmann machines by removing some of the connections from the architecture. RBMs can be learned alone and stacked in many layers, achieving deep learning. More discussion on stochastic machines will be provided in Section 5.2.5.

2.5 Summary This chapter has been a long and dense one in an attempt to provide a selfconsistent but neat introduction to the problem of classification as well as to the main machine learning tools. Several concepts and algorithms have been introduced, while keeping reference to the scope of this book, which is radar detection—a similar, but different problem compared to binary classification. The relationship between the Neyman–Pearson rationale, typical of radar detection, and the use of loss functions in classification problems, following the formulation of statistical learning, has been highlighted. The ultimate goal was to provide the necessary background for discussing data-driven approaches to radar detection in Chapter 3. The review of linear/nonlinear classifiers and several popular machine learning tools (including SVM, KNN, random forests, neural networks, and deep learning) indeed focused on approaches suitable for radar detection. Supervised learning has been particularly illustrated, but some unsupervised techniques, either shallow or deep, have been also reported. The main aspects of neural networks, and the peculiarities of deep learning, have been reviewed. Some issues have been highlighted along the way. Addressing the radar detection problem through the use of machine learning tools is in fact a challenging and articulated task given its peculiarities. The rest of the book will discuss the main approaches that have been devised, and will develop a discussion around the questions arisen so far.

References [1]

LeCun, Y., and C. Cortes, “MNIST Handwritten Digit Database,” 2010, http:// yann.lecun.com/exdb/mnist/.

[2]

Cohen, G., S. Afshar, J. Tapson, and A. van Schaik, “EMNIST: An Extension of MNIST to Handwritten Letters,” 2017 International Joint Conference on Neural Networks (IJCNN), IEEE, 2017.

[3]

Bishop, C. M., Pattern Recognition and Machine Learning (Information Science and Statistics), Berlin: Springer-Verlag, 2006.

Coluccia:

“chapter2_v2” — 2022/10/7 — 13:06 — page 93 — #45

94

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

[4]

Proakis, J., Digital Communications, Fifth Edition, McGraw Hill, 2007.

[5]

James, G., D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning: With Applications in R, Springer, 2013, https://faculty.marshall.usc.edu/gareth-james/ISL/.

[6]

Hastie, T., R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, New York: Springer, 2001.

[7]

Stewart, R. W., K. W. Barlee, and D. S. Atkinson, Software Defined Radio Using MATLAB & Simulink and the RTL-SDR, Strathclyde Academic Media, 2015.

[8]

Zeng, Y., and Y.- C. Liang, “Spectrum-sensing Algorithms for Cognitive Radio Based on Statistical Covariances,” IEEE Transactions on Vehicular Technology, Vol. 58, No. 4, May 2009.

[9]

Cover, T. M., “Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition,” IEEE Transactions on Electronic Computers, Vol. EC-14, No. 3, 1965, pp. 326–334.

[10]

Haykin, S. S., Neural Networks and Learning Machines, Third Edition, Upper Saddle River, NJ: Pearson Education, 2009.

[11]

Chi, Y., and M. Ferreira Da Costa, “Harnessing Sparsity over the Continuum: Atomic Norm Minimization for Superresolution,” IEEE Signal Processing Magazine, Vol. 37, No. 2, 2020, pp. 39–57.

[12]

Donoho, D., and X. Huo, “Uncertainty Principles and ideal Atomic Decomposition,” IEEE Transactions on Information Theory, Vol. 47, No. 7, 2001, pp. 2845–2862.

[13]

Chen, Y., and Y. Chi, “Harnessing Structures in Big Data Via Guaranteed Low-Rank Matrix Estimation: Recent Theory and Fast Algorithms Via Convex and Nonconvex Optimization,” IEEE Signal Processing Magazine, Vol. 35, 2018, No. 4, pp. 14–31.

[14]

Fisher, R. A., “The Use of Multiple Measurements in Taxonomic Problems,” Annals of Eugenics, Vol. 7, No. 7, 1936, pp. 179–188.

[15]

Pearson, K., “On Lines and Planes of Closest Fit to Systems of Points in Space,” Philosophical Magazine, Vol. 2, 1901, pp. 559–572.

[16]

Jolliffe, I., Principal Component Analysis, Springer Verlag, 1986.

[17]

Coluccia, A., A. Fascista, and G. Ricci, “Spectrum Sensing by Higher-Order SVM-BASED Detection,” in 2019 27th European Signal Processing Conference (EUSIPCO), 2019, pp. 1–5.

[18]

Breiman, L., “Random Forests,” Machine Learning, Vol. 45, No. 1, 2001, pp. 5–32.

[19]

Bandiera, F., D. Orlando, and G. Ricci, Advanced Radar Detection Schemes under Mismatched Signal Models, San Rafael, CA: Morgan & Claypool Publishers, 2009.

[20]

De Maio, A., and D. Orlando, “A Survey on Two-Stage Decision Schemes for Point-Like Targets in Gaussian Interference,” IEEE Aerospace and Electronic Systems Magazine, Vol. 31, No. 4, 2016, pp. 20– 29.

[21]

Coluccia, A., and G. Ricci, “A Random-Signal Approach to Robust Radar Detection,” in 52nd Annual Conference on Information Sciences and Systems (CISS), 2018, pp. 1–6.

Coluccia:

“chapter2_v2” — 2022/10/7 — 13:06 — page 94 — #46

Classification Problems and Data-Driven Tools

95

[22]

Coluccia, A., G. Ricci, and O. Besson, “Design of Robust Radar Detectors Through Random Perturbation of the Target Signature,” IEEE Transactions on Signal Processing, Vol. 67, No. 19, 2019, pp. 5118–5129.

[23]

Surowiecki, J., The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations, Doubleday, 2004.

[24] Vidal, T., and M. Schiffer, “Born-Again Tree Ensembles,” in Proceedings of the 37th International Conference on Machine Learning (ICML 2020), July 13–18, 2020, PMLR, 2020, pp. 9743–9753, http://proceedings.mlr.press/v119/vidal20a.html. [25]

van der Maaten, L., and G. Hinton, “Visualizing High-Dimensional Data Using T-SNE,” Journal of Machine Learning Research, Vol. 9, 2008, pp. 2579–2605.

[26]

Goodfellow, I., Y. Bengio, and A. Courville, Deep Learning, MIT Press, 2016.

[27]

Haykin, S. S., Neural Networks: A Comprehensive Foundation, Second Edition, Prentice Hall, 1998.

[28]

Cybenko, G., “Approximation by Superpositions of a Sigmoidal Function,” Mathematics of Control, Signals, and Systems, Vol. 2, No. 4, 1989, pp. 303–314.

[29]

Li, D., and Y. Dong, “Deep Learning: Methods and Applications,” Foundations and Trends in Signal Processing, Vol. 7, No. 3–4, pp. 197–387, 2014, http://dx.doi.org/10.1561/ 2000000039.

[30]

LeCun, Y., and Y. Bengio, “Convolutional Networks for Images, Speech, and Time Series,” in The Handbook of Brain Theory and Neural Networks, M. Arbib (ed.), Cambridge, MA: MIT Press, 2003.

[31]

Hinton G. E., and R. R. Salakhutdinov, “Reducing the Dimensionality of Data with Neural Networks,” Science, Vol. 313, No. 5786, 2006, pp. 504–507.

Coluccia: “chapter2_v2” — 2022/10/7 — 13:06 — page 95 — #47

Coluccia: “chapter2_v2” — 2022/10/7 — 13:06 — page 96 — #48

3 Radar Applications of Machine Learning This chapter illustrates in detail the applications of the tools discussed in Chapter 2 to the radar field. In particular, it discusses how machine learning and more generally data-driven tools have been applied to the problem of target detection in the literature, but other applications of machine learning techniques, including signal classification, will be also reviewed. With the onset of the third millennium, in fact, machine learning has experienced renewed interest, driven by the advances on speech and image data for classification and recognition tasks. At the same time, social networks and online platforms for shopping, streaming, and other services have started to provide unprecedented types of data. Key factors in this revolution are the availability of huge datasets and ready-to-use frameworks: multimedia and social data are in fact abundant and easy to collect, and tech companies are pushing for applications of their machine learning tools in more and more contexts. More controversial is instead whether and how data-driven tools can be applied to fields where performance guarantee and interpretability are important, and huge datasets are not available. In this respect, the combination of data-driven and model-based approaches seems wiser and more attractive, but it has only very recently started to attract interest in the radar community: this will be covered in Chapter 4. In this Chapter 3 we will conversely focus on purely data-driven approaches for radar applications (Section 3.1), signal classification (Section 3.2), and especially target detection (Section 3.3).

3.1 Data-Driven Radar Applications As mentioned, a key driver for the current machine learning (and artificial intelligence at large) hype can be identified in the significantly improved 97

Coluccia: “chapter3_v2” — 2022/10/7 — 17:30 — page 97 — #1

98

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

performance brought by neural networks and in particular deep learning (see Section 2.4), on classification and recognition of multimedia signals. It comes as no surprise, thus, that the first radar applications that benefited from such advances are those more directly related to image processing, namely SAR, radar imaging, and target classification based, for example, on the micro-Doppler signature1 [1–3]. Since the scope of this book is target detection, we will give in the following only a quick overview on the main possibilities in other such applications. Classification tasks for complex data, which may lack accurate models, are indeed very suitable to be attacked via data-driven techniques. Figure 3.1 shows a general scheme of the main processing steps in data-driven radar applications that aim at a classification or decision goal. Data in input can be provided from either measurements or synthetic (computer-based) simulations, with associated labels; in the former case, the dataset may be extended through data augmentation, as discussed more in detail in Section 3.2.2.4. Such data is used in the training phase to learn the parameters of the machine learning algorithm, which represents the offline phase. Then, in the operational online phase, (unlabeled) raw measurements are fed in input to the algorithm to obtain a prediction of their label (i.e., a decision), which according to the considered application is related to the detection or classification of targets (or their properties).

Figure 3.1

General scheme of the main processing steps in data-driven radar applications.

1. Note, however, that the use of the micro-Doppler in such applications does not necessarily require a purely data-driven approach and that several model-based approaches are also available.

Coluccia: “chapter3_v2” — 2022/10/7 — 17:30 — page 98 — #2

Radar Applications of Machine Learning

99

Within the machine learning umbrella, deep learning based on newgeneration NNs has generally shown so far the best performance for these problems, sometimes in combination with more traditional machine learning tools. Radio signals are in fact a somewhat borderline type of data: on the one hand, they obey physical laws that can be modeled in mathematical terms to be used for engineering purposes (including detection/classification); on the other hand, some scenarios are too complex to be captured by an analytical model; hence, data-driven learning may be more attractive. Several machine learning tools have already been applied successfully to radiofrequency data exploitation. Besides communications and radar signal classification (discussed in more detail later in Section 3.2), significant work has already been conducted on deep learning for single and multiple target classification in SAR imagery [4] and for automatic target recognition [2, 5]. The reader is also referred to [6] for more information on these topics. A number of applications of deep learning in radar contexts have also been developed in recent years, in particular using short-range (and mediumrange) radars in the industrial, healthcare, and automotive fields. Examples in the latter case are car/pedestrian detection and overtaking assistance [7–10], while in assisted-living and telemedicine applications gait monitoring, gesture recognition, and fall detection enabled by short-range radars are of interest [11–14]. In all such contexts, the use of frequency-modulated continuouswave (FMCW) radar spectrograms obtained via time-frequency analysis (e.g., short-time Fourier transform), makes deep learning tools such as convolution neural networks (CNN; see Section 2.4.3.1) directly applicable [15]. The great advantage is however that more fine-grained properties of the targets can be sensed through the radar sensor, for instance by exploiting micro-Doppler signatures [16, 17].2 On the detection side, this opens new possibilities for the detection and classification of targets that escape conventional radar surveillance systems, due to their agility and small radar cross section (e.g., small drones) [19, 21, 22]. Many more applications are possible by relying on the strict relationship between certain radar problems and deep learning techniques developed in the 2. It is worth noting the importance of the micro-Doppler, which is a peculiarity of radar sensors as an enabler for several of the applications mentioned above. Not only are micro-Doppler signatures very informative with regard to the sensed targets, an additional advantage compared to video solutions is that radars are less sensitive to weather and light conditions and can enable unobtrusive applications that avoid the privacy concerns of video-based monitoring (again for surveillance, remote sensing, telehealth, etc.). The reader is referred to [18] for a comprehensive discussion on micro-Doppler radars and their applications. As well, besides the abovementioned topic of deep neural networks for radar micro-Doppler signature classification, sparsity-driven methods, (L1-norm) principal component, and discriminant analysis (see Section 2.3.2) techniques are also discussed for application contexts ranging from indoor human activity recognition to automotive, surveillance, assisted living and contactless healthcare sensing, and small-drone vs bird target signatures classification [19, 20]. Multistatic as well as passive radar approaches are also illustrated.

Coluccia: “chapter3_v2” — 2022/10/7 — 17:30 — page 99 — #3

100

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

computer vision field. Examples are human-machine interface applications, such as gesture recognition and sensing, human activity classification, air-writing, material classification, vital sensing, people sensing, people counting, people localization, and in-cabin automotive occupancy and smart trunk opening (see [23] for more details). Other approaches (e.g., those discussed in Section 2.4.3.3), have however found application as well, for various radar problems are not always strictly linked to image-like processing. For instance, a CNN can also be leveraged to address the antenna selection problem as a multiclass classification framework within a cognitive radar framework [24]. Colocated MIMO radars can benefit from machine learning and deep learning in several respects [25], including very recent trends in the use of reinforcement learning for target detection in complex environments, as will be discussed in Section 3.4.2. LSTM can be combined with CNN for clutter parameter estimation [26], radar waveform and antenna array design, and radar jamming/clutter recognition [3]. Unsupervised methods such as autoencoders have also great potential for target detection [27] and recognition [28].

3.2 Classification of Communication and Radar Signals Before delving into the details of the most important data-driven tools specifically applied to the problem of radar detection, as an intermediate step we now discuss the problem of signal classification. Signal classification is a term used to identify a broad class of similar but actually different problems. It is commonly adopted both in the communications and radar fields and concerns the identification of the type of radio signal(s) possibly present in the environment. The perspective is that of a receiver, which aims at gaining awareness about the status of the radiofrequency spectrum in its surroundings. While the topic is not new, it has been undergoing a revolution with a progressive shift toward machine learning-oriented approaches, and in particular deep learning based solutions have been recently investigated. 3.2.1 Automatic Modulation Recognition and Physical-Layer Applications With regard to communications, automatic recognition of communication signals is of great interest in several civilian and military applications, as part of the tasks of an intelligent or cognitive receiver, for the sake of spectrum awareness (e.g., spectrum sensing ), adaptive transmissions (e.g., adaptive modulation and coding ), and interference avoidance (e.g., carrier sense multiple access with collision avoidance 3 ). This decision problem in its generality is more difficult than 3. CSMA/CA is implemented, for instance, in commonly used communication standards such as IEEE 802.11 (Wi-Fi) and IEEE 802.15.4 (including Zigbee).

Coluccia: “chapter3_v2” — 2022/10/7 — 17:30 — page 100 — #4

Radar Applications of Machine Learning

101

plain modulation decoding (see Section 2.1.1), since no information is usually available regarding the transmitted data, signal power, carrier frequency, timing information, and channel effects (multipath and frequency-selective and timevarying fading), which are particularly challenging in real-world scenarios [29]. In some contexts partial information is available regarding the possible classes of transmitted signals. Automatic modulation classification, in particular, aims at blindly identifying the modulation type of an incoming signal at the receiver in wireless communication systems. Traditional approaches are either model-based, namely based on likelihood function and hypothesis testing (including GLRT; see Section 1.3.2), or rely on selected features of the received signal to be fed to a classifier [30] (engineered features; see Section 2.4.2). The advent of data-driven classifiers based on conventional machine learning tools has introduced a first performance enhancement, with support vector machines (see Section 2.3.3), neural networks (see Section 2.4.1), decision trees, and ensemble methods (boosting and bagging; see Section 2.3.4) as prominent examples of effective approaches [31]. More recently, however, deep learning has especially attracted attention due to its superiority in feature extraction (learned features; see Section 2.4.2) and classification accuracy [32]. CNNs, in particular, have been applied with success to complex-valued temporal radio signals, quickly becoming a strong candidate approach for blind learning especially at low signal-to-noise ratio, and overcoming expert feature based methods [29]. This success can be explained by considering that the main building block of most radio receivers is a matched filter, which is convolved with the incoming time signal, and returns peaks as the former slides over the correct symbol time in the received signal. This convolution is known to maximize the SNR by averaging out the noise, and additional expert-designed processing is performed through estimators derived analytically (for a specific modulation and channel model) to compensate for transmitter-receiver and channel impairments. Thus, the intuition behind CNNs is that they will learn to form matched filters for numerous temporal features, each of which will have some filter gain to operate at lower SNR, and which when taken together can represent a robust basis for classification [29]. However, besides CNNs, other neural network architectures have been also investigated, including recurrent neural networks and LSTM [33], meaning that the reason for the impressive performance of such approaches cannot be fully explained by the affinity of convolutional layers with a flexible matched filter.4 Furthermore, an important aspect to be considered in modulation classification is the choice of the most suitable preprocessing of the received signal, it can be represented in a proper format before feeding the signal into deep neural networks, namely by either features, images, sequences, or a combination of them [32]. 4. We will return to the general issue of theoretical understanding of deep learning in Chapter 5.

Coluccia: “chapter3_v2” — 2022/10/7 — 17:30 — page 101 — #5

102

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

On the other hand, neural networks and particularly deep learning are very general tools, and several applications to the physical layer have already been identified (see [34–36]). Channel modeling is a prominent example of a traditional topic where models play a major role by providing simplified analytic expressions that are well known and widely adopted to model certain propagation/disturbance conditions, such as additive white Gaussian noise (AWGN), Rayleigh fading channels, or other compact parametric models. However, deep neural networks have been shown to produce an accurate representation of channel effects [37], due to their excellent approximation performance (see Sections 2.1.4 and 2.4.1). Furthermore, by interpreting a communications system as an autoencoder, communications system design can be seen as an end-to-end reconstruction task that seeks to jointly optimize transmitter and receiver components in a single process [38] (a similar approach can be attempted for the joint design of radar waveform and detector, as will be discussed in Section 3.4.2). 3.2.2 Datasets and Experimentation 3.2.2.1 Data Collection

The basic prerequisite of any supervised data-driven approach is the availability of a sufficiently large dataset, which also needs to be accurately annotated in the most widely used approaches of supervised machine learning. Especially in deep learning, a huge amount of data is required, not only for training but also for validation and testing. This is generally an issue, and much more challenging in the case of radio data, which is by far harder to collect. While in fact speech and images/videos can be easily recorded through inexpensive and widespread hardware such as microphones and cameras, the same does not hold true for communications and radar systems. Hardware for transmission of radio signals is generally expensive and needs expertise to be operated, in addition to a spectrum license if experimentation takes place out of the industrial, scientific, and medical (ISM) band. Using opportunistic signal sources from legacy systems is an option, although it reduces the experimentation possibilities since the source characteristics cannot be controlled. An additional problem with RF signal collection is that output raw data (time samples) are not always available from the hardware, hence one needs to build their own receivers or get an evaluation board capable of providing raw data. In any case, performing RF measurements is significantly more difficult than taking a picture or recording a video, produces much higher data due to high-rate sampling, and is complex-valued (I/Q samples) due to the importance of both amplitude and phase in the subsequent processing. Such issues are even more severe in the case of radars, as many commercially available receivers have limited capabilities; rather sophisticated and expensive systems are required instead, whose availability for data collection and experimentation is very limited.

Coluccia: “chapter3_v2” — 2022/10/7 — 17:30 — page 102 — #6

Radar Applications of Machine Learning

103

3.2.2.2 Software-Defined Radio

In recent years, research and experimentation in the topics presented above, as well as in those discussed in the following, has been boosted by the increasing availability of SDR equipment, which enable over-the-air measurements with reduced effort compared to specialized measurement instrumentation. Using SDR communication boards, for instance, the effects of carrier frequency offset, symbol rate, and multipath fading can be assessed not only by simulation, but also with real data, thereby making data-driven approaches developed for other machine learning applications applicable to radio signal classification [31]. Examples of widely adopted equipment include the universal (USRP) and HackRF, but also the simple and cheap (receive-only) RTL-SDR dongles already introduced in Section 2.1.2. 3.2.2.3 Unbalancedness and Biasedness

There are several issues in building a good dataset. Again, as anticipated at the very beginning of this Section 3.2.2, expert knowledge is important for radar: for instance, calibration impacts the quality of radar data significantly and needs expert knowledge and competence. Additionally, two important general issues in building a training dataset are its possible unbalancedness and biasedness. As the performance of any machine learning algorithm is strictly dependent on the quality of the training dataset, such issues need to be seriously taken into account. A dataset is balanced if examples belonging to different classes are approximately in the same amount, or in other words, no class is underrepresented (see Section 2.2.1). If the dataset is unbalanced, the algorithm may unevenly learn the characteristics of the classes, producing biased results. The issue is that in some applications one (or more) of the classes may be intrinsically rare, so making dataset balancedness a challenging goal that calls for alternative approaches as we will discuss in Section 3.4.1. If the designer has control over the data collection for the different classes, unbalancedness can be reasonably mitigated. Note, however, that unbalancing to some extent the amount of data from different classes in a training dataset is also a way to promote some properties of the resulting detector in an attempt to obtain a desired behavior. Dataset biasedness—which should not be confused with the bias of a classifier related to underfitting (see Section 2.2.2)—is conversely a much more insidious issue. Several factors during the collection process, in fact, impact onto the characteristics of the data. The ultimate effect will be clearly different if a single acquisition is performed, or the dataset consists of multiple acquisitions, possibly with different measurement hardware, sources, and operations environments. On the one hand, it is desirable to get enough diversity in the dataset, in order to represent a large set of conditions where the algorithm will operate

Coluccia: “chapter3_v2” — 2022/10/7 — 17:30 — page 103 — #7

104

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

in, without binding the solution to the unique characteristics found during the dataset building process. On the other hand, when mixing different sets of data, some of the factors of each one may correlate with the associated characteristics of the acquired data, in the worst case appearing systematically associated to a certain characteristic that is thus a confounding factor for the learning process.5 Exploratory analysis of the dataset split into partitions based on the considered source, environmental conditions, or other factors, might reveal the bias; in more subtle cases, it is necessary to compare the results of different trainings using the split datasets. To this end, a controlled cross-validation approach can be adopted by partitioning the dataset according to one or more variables—hence not randomly as in the conventional (uncontrolled) cross-validation discussed in Section 2.2.1 (see also [40])—while paying attention that classes do not become unbalanced due to the splitting. If the results significantly differ, this may indicate that a bias is influencing the learning process. A major difficulty in building a representative dataset for training (and validation/testing) is that it is very hard, and often almost impossible, to sample all the possible situations that can be found in operational environments. While this is always true in engineering, and model-based approaches are not immune from it, major criticalities may emerge when the algorithm exclusively relies on data for building its representation of reality or worldview. While such a point of philosophical discussion would bring us far from the main road here (but a more detailed discussion is included in Chapter 5), its practical consequences should not be disregarded, as this is one of the main reasons for considering as main performance metric of data-driven methods their generalization capabilities on unseen data. And, besides the risk of underrepresentation of situations of interest, there is at the same time the opposite risk of making the training dataset too heterogeneous, which tends to weaken the discrimination capabilities due to the possible confounding overlap between intraclass (within-class) and interclass (between-class) variability. We will return to this point in Section 3.3.5. Nonetheless, models can be partially reintegrated in hybrid approaches (discussed in Chapter 4) to try to provide some general, data-independent knowledge of reality in combination with the expressiveness of machine learning algorithms. 3.2.2.4 Dataset Annotation and Augmentation

An additional problem is dataset annotation. Data can be generally labeled at different aggregation and processing levels depending on the application, but 5. There is a famous case of a neural network trained to distinguish between wolves and huskies, which performed very well except for some unexplainably gross mistakes on certain images. The reason was that the network actually learned to distinguish the two classes based on whether there was snow on the image background or not; as it happened, all images of wolves in the training set had snow while the ones of huskies did not [39]. This opens the big question of interpretability, which together with other issues will be covered in Chapter 5.

Coluccia: “chapter3_v2” — 2022/10/7 — 17:30 — page 104 — #8

Radar Applications of Machine Learning

105

radio data, from either communication or radar systems, needs to be directly annotated on the raw samples, which requires full control of the experimental setup and awareness about the environmental conditions. Unfortunately, annotation is a very time-consuming task, and its complexity and cost further increase in the case of radio data. In fact, not only are the latter more difficult to collect, but humans are not experienced in visually understanding their content. As a consequence, the availability of physical-layer data is usually much scarcer than for multimedia data. Additional sensors, in particular cameras, may be used as ground truth for annotation of raw radio data. In the case of radar, additional complications arise since clutter cannot be controlled, and possible disturbing targets may be present on the scene; finally, for the specific problem of target detection, additional equipment is needed to know the position of intended moving targets (ground truth) at precise time instants commensurate to the high-speed sampling of a radar system. A common solution to enlarge and also enrich a dataset, especially where deep learning approaches are concerned, is to perform some form of data augmentation (Section 2.4.3.2). This consists of artificially extending the dataset by suitable processing, which produces synthetic data as variations (transformations) of the original dataset. In the case of radio data, simulators can be also used to this end, or even as a full replacement for feeding data-driven algorithms in case real data is unavailable. To obtain realistic training datasets in the case of radar, highfidelity modeling and simulation tools need to be used, which are able to capture much of the real-world physics that gives rise to nonstationary, heterogeneous, and time-varying radio environments such as heterogenous clutter, dense background targets, and intentional/unintentional interference [41]. Finally, as mentioned, Chapter 4 will illustrate possible hybrid approaches, in which one tries to combine elements from both the model-based—to gain control over some properties of the solution and partially circumvent the need for large datasets of balanced, bias-free real data—and the data-driven approaches. 3.2.2.5 Testing Performance and Properties

To assess the performance of data-driven detectors, the generalization capability is a main indicator, which for a radar detector means a high PD for a given PFA , for certain types of targets and SNR values. On the other hand, it is also important to evaluate the properties of the designed data-driven solutions, for which theoretical guarantees are much weaker than for model-based detectors. For testing, consider that both synthetic (simulated) and real data are the best manners to assess the performance: on the one hand, in a simulation the ground truth is known and all parameters can be controlled, so the performance can be assessed in a surgical way; on the other hand, although accurate simulators exist even for complex communication scenarios, only real data intrinsically includes all the non-idealities found in operational environments.

Coluccia: “chapter3_v2” — 2022/10/7 — 17:30 — page 105 — #9

106

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

Still, unfortunately, dataset availability of radio data is far from that of speech, image, and video that triggered the deep learning hype.6 For beginners, simulations through a dedicated MATLAB® toolbox is a starting option,7 but simple experimentation can be done by the most accessible SDR platform— the RTL-SDR (with associated MATLAB add-on or other software, e.g., GNU Radio, SDRSharp; see [44] and https://www.rtl-sdr.com). For instance, considering the problem of blind classification of communication signals (which is thus more general than modulation recognition), Figure 3.2 can be very easily obtained by means of a sub-$20 RTL-SDR dongle as noted in Section 2.1.2 and is very instructive to assess the descriptive capabilities of three features8 : energy (see Section 1.2.1), maximum-minimum eigenvalue ratio λmax /λmin (1.24), and alternative eigenvalue-based statistic T1 /T2 [45]. If the goal is to separate H0 signals from H1 ones, a linear classifier seems a suitable solution, since the two clusters (each including, in turn, two different signal type subclusters) appear linearly separable. Indeed, although deep learning approaches might further improve the performance, the use of SVM is satisfactory for this problem, in particular with increasing order of polynomial kernels, also for separating the no-signal cluster from the any-signal one of interest in spectrum

Figure 3.2

Representation of different signals in 3-D feature space.

6. A small corpus of synthetic and real data used in [31] can be downloaded from https:// www.deepsig.ai/datasets. Data for radiofrequency fingerprinting experimentation, collected via USRP SDRs, is freely available at www.genesys-lab.org/oracle [42, 43]. 7. See in particular https://it.mathworks.com/help/comm/ug/modulation-classification-withdeep-learning.html. 8. See Figure 2.10.

Coluccia: “chapter3_v2” — 2022/10/7 — 17:30 — page 106 — #10

Radar Applications of Machine Learning

107

sensing [46] (see Section 2.3.3 for SVM and the kernel trick). However, in any case the resulting data-driven detector brings no guarantees either on PFA or on the CFAR property, and therefore, while performance can be satisfactory, other characteristics of the solution may be not be. As a matter of fact, despite the similarity of signal classification and spectrum sensing with radar detection, data-driven tools cannot be directly applied to the latter; indeed, as discussed in Section 2.1.3, adaptive radar detection is a peculiar problem, sharing similarities but also showing important differences with respect to other decision or classification problems. Again referring to the example of Figure 3.2, a spectrum-sensing statistic based on plain energy is sensitive to the noise power σ 2 (see Section 1.2.1) as well as to the transmitted power and distance between transmitter and receiver, hence the resulting classifier will not be CFAR. Moreover, a weak signal that is a missed detection is generally less problematic for spectrum sensing, since it will not be harmful for the communications (it is a weak interference). Conversely, in a radar detection problem weak targets may represent possible obstacles or threats that are harder to detect due, for example, to far distance or weak reflectivity. This is one of the reasons for adopting, as optimization criterion, the Neyman–Pearson (see Section 1.2.2) or the related GLRT (see Section 1.3.2) approaches, and for putting great emphasis on the CFAR property (see again the discussion in Section 2.1.3). Therefore, although signal classification lies at the intersection between communications and radar, it requires dedicated methodologies when applied to either field, and adaptive detection is a different problem still. From a mathematical modeling perspective, real-world radar data also usually departs from the assumption of white noise with no or limited knowledge of the structure in the useful signal, which was instead considered in the spectrum sensing case above. We will see in the following section which suitable techniques have been selected for classification of radar signals and radiation sources, and how the feature space needs to be modified. Adaptive detection based on supervised learning is instead thoroughly covered in Section 3.3. 3.2.3 Classification of Radar Signals and Radiation Sources A number of different signal classification problems are of interest in radar systems. Especially in the military field, passive systems that receive emissions from various platforms are adopted to gain information about the characteristics of the surrounding environment—for instance, classification of the type of radar in the scene (civil, military, surveillance, fire control), or identification of the specific parameters of the radiation source (called specific emitter identification (SEI)). Such radar emitter source classification and recognition includes denoising and deinterleaving (or separation) of the collected pulse streams, dealing with missing and spurious data, possibly in real-time, and other aspects. Also in this case, traditional methods are based on domain knowledge

Coluccia: “chapter3_v2” — 2022/10/7 — 17:30 — page 107 — #11

108

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

from expert designers, exploiting statistical modeling and signal processing techniques [47].9 However, the growing complexity of the electromagnetic environment as well as of radar waveforms and systems challenge the recognition performance of such approaches, which again motivates the appeal for data-driven methods. Both traditional machine learning and deep learning approaches have been successfully applied to radar emitter source classification: the former include SVM, KNN, decision trees, random forests, NNs, and many more; deep NN architectures, which have the additional advantage of automatic feature extraction (overcoming the limitation of expert knowledge for feature engineering) are typically based on (1-D or 2-D) CNN, but also recurrent NNs (RNNs) and autoenconders [47].10 However, data sampling limitation and computational cost are still severe challenges in real recognition scenarios, hence a network based on metatransfer learning has been proposed in [52] that can operate on fewer samples. The case of SEI, where the goal is to uniquely identify an individual emitter from the same class of radars by its individual properties that arise from hardware imperfections, is more challenging due to the subtle differences in the properties of the signals to be classified. Joint radar function identification, model identification, and SEI has been proposed for improved performance based on the radiofrequency features extracted from time-domain transient signals. These include the duration, maximum derivative, skewness, kurtosis, mean, variance, fractal dimension, Shannon entropy, and polynomial coefficients of the normalized energy trajectory of a transient signal, as well as the area under the trajectory curve [53]. More recent approaches for radar emitter source classification are focusing on deep learning solutions, and with new radar types arising and the electromagnetism environment getting complicated, this has become a very challenging problem. In [54], high-dimensional sequences of equal length are generated as new features, then 1-D CNN is used to classify them, using a large and complex radar emitter’s dataset to evaluate the performance. In [55], instead, 2-D CNN are adopted on the time-frequency image of the signal obtained via 9. For instance, waveform classification to eight classes can be performed to distinguish among linear frequency modulation (LFM), discrete frequency codes (Costas codes), binary phase, and Frank, P1, P2, P3, and P4 polyphase codes, by considering a supervised classification system with features extracted from the collected radar signal. A large set of engineered features is investigated in [48], including those based on Wigner and Choi-Williams time-frequency distributions. To discard redundant features, an information-theoretic feature selection algorithm is adopted. 10. Radar signal classification and source identification using NN, SVM, and RF have been investigated in [49, 50] using a large dataset of pulse train characteristics such as signal frequencies, type of modulation, pulse repetition intervals, scanning type, and scan period, represented as a mixture of continuous, discrete, and categorical data. Discriminant analysis has been used instead in [51].

Coluccia: “chapter3_v2” — 2022/10/7 — 17:30 — page 108 — #12

Radar Applications of Machine Learning

109

STFT. Hough transform is used to detect pulses in such images and pulses are represented with a single line; then, CNN are used for pulse classification. In doing so, an end-to-end framework for automatic radar pulse detection and intrapulse modulation classification is obtained. As for the case of communication signal and modulation classification, beginners may start practicing with radar waveform classification through a dedicated MATLAB toolbox.11 A few well-known real radar datasets are also available, in particular land clutter data from the Phase One radar of the MIT Lincoln Laboratory [56], and lake clutter data from the McMaster IPIX radar.12 To cope with such data, however, proper features need to be selected. For instance, the feature space used for spectrum sensing and communication signal classification in Section 3.2.2 (and Figure 3.2) is not suitable for colored (possibly non-Gaussian) and heterogeneous noise, which is, however, a more typical case in real-world radar data, as will be shown in Section 3.3.5. Moreover, a serious issue is how to control PFA . The next section will discuss different approaches for the design of data-driven detectors specifically tailored to the adaptive radar detection problem.

3.3 Detection Based on Supervised Machine Learning Traditional radar target detection in complex scenarios include, in addition to thermal noise, clutter and possibly interference and/or jamming signals. Such disturbances render the noise colored and must be removed or significantly suppressed through filtering prior to target detection in order to have a large enough signal-to-interference plus noise ratio (SINR) for reliable decisions on the possible presence of a target echo. MTI radars use a filter to remove any stationary clutter around zero Doppler frequencies. Pulse-Doppler processing in airborne radars has the effect of a stopband filter that removes ground clutter from specific Doppler frequency bands, while the STAP approach uses space-time adaptive filtering for this purpose. In any case, estimation of the characteristics of the disturbance, in particular its covariance matrix, is needed. Machine learning tools aim instead at recognizing and detecting targets based on distinguishable features of target vs disturbance without removing the latter from the radar receiver. In this respect, they have the potential to be less sensitive to the issue of inaccurate disturbance estimation that may cause significant processing loss in traditional adaptive approaches [57]. On the other hand, they are informed only by the data in the training set, whose selection plays a fundamental role in shaping the behavior of the resulting detector. 11. See in particular https://it.mathworks.com/help/phased/ug/modulation-classification-ofradar-and-communication-waveforms-using-deep-learning.html. 12. http://soma.mcmaster.ca/ipix.php.

Coluccia: “chapter3_v2” — 2022/10/7 — 17:30 — page 109 — #13

110

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

Early methods that have been investigated for data-driven radar detection are based on engineered features, considering that radar targets typically possess certain peculiarities of being concentrated or conversely extended in some domain (time, space, Doppler frequency), which are different from those of clutter and jammers. Simple approaches can be adopted based on feature vector distance based detectors or decision trees [57], but SVM and KNN are also very appealing techniques. More recently, neural networks and deep learning approaches have been attempted as well. In the following we will discuss the most effective techniques, with particular regard to those that allow one to control PFA and, subsequently, to be adaptive to colored and/or nonhomogeneous (possibly non-Gaussian) disturbance. 3.3.1 SVM-Based Detection with Controlled P FA As discussed in Section 2.1.3, radar detection is not just binary classification, and it is of utmost importance to have control on the value of PFA produced by the detector. In classical detection theory, this is a design parameter set according to a maximum tolerable level; then, the design goal is typically to maximize the probability of detection (PD ), but the actual value of this performance metric will inversely depend on the chosen PFA based on which threshold of the detector is set. This is a simple way to control PFA , since it only requires to either raise or lower the threshold to fulfill such a constraint. Conversely, in a typical machine learning approach, once the features vector is computed for a given training set (or automatically learned via a fully data-driven method, namely a deep NN), the algorithm parameters are frozen (kept fixed) and should be considered an integral part of the detection statistic. This will directly yield a (PFA , PD ) pair, with no control on either metrics, irrespective of the delivered performance. There is no general fix for this issue, and solutions must be sought on a case-by-case basis. What is needed is an equivalent of threshold setting for datadriven approaches. In the case of SVM, the optimal weights and bias uniquely identify the separation hyperplane used for classifying new data; consequently, as mentioned, PD and PFA are determined only on the basis of the specific training set, and hence cannot be easily tuned. However, due to the intrinsic geometric meaning of the decision statistic as hyperplane in a multidimensional space, a simple workaround is to consider the parallel sheaf of (hyper)planes with respect to the optimal hyperplane. This is tantamount to modifying the bias (affine term in the hyperplane equation) after the training process (i.e., to keep only the learned weights and adjust the bias). By this heuristic bias-shifting procedure, one can expect that shifting the hyperplane toward the H1 region (far away from the origin), PFA will decrease, while moving it toward the H0 region (close to the origin) it will increase. Clearly, PD will simultaneously decrease or increase as well,

Coluccia: “chapter3_v2” — 2022/10/7 — 17:30 — page 110 — #14

Radar Applications of Machine Learning

111

respectively, as well known from detection theory (see Section 1.2.2), but control on PFA is retrieved as in model-based detection (i.e., a procedure equivalent to threshold setting is found). An alternative approach is to introduce additional parameters to control the relative weight given to each class. In [58] it is shown that, as one might expect, introducing an additional parameter in the cost function can outperform shifting of the offset parameter (bias) after training, but optimizing over this additional parameter significantly increases the training time. Improvements are proposed to reduce the complexity as well as the variance of the error estimates, which are crucial for NP classification. Two parameters β0 and β1 are used in [59] to weight the H0 and H1 related costs, respectively, and this approach is demonstrated on IPIX sea clutter data. Increasing β0 would reduce the false alarm rate for a given β1 , as by this the obtained hyperplane will tilt toward the target signals and thus less sea clutter signals will be misclassified. On the other hand, enlarging β1 would increase the false alarm rate for a given β0 , as the hyperplane will be more partial to the sea clutter signals in this case. A large margin classifier that directly maximizes PD at a desired PFA via the inclusion of an optimization constraint on the estimated PFA is discussed in [60]. Unlike NP or minimax classifiers (see Section 3.3.3), this approach can adopt an arbitrary loss function and is applicable to datasets that exhibit a high degree of class imbalance. It has been demonstrated on real sonar data from the Malta Plateau Clutter Track Database [61]. A similar idea but in a different context is the asymmetric extension of the SVM proposed in [62], which employs an objective function that models the imbalance between the costs of false predictions from different classes such that a tolerance on PFA can be explicitly specified without much degradation of the prediction accuracy or increase in training time. Bias-shifting has been applied in [46] for SVM-based spectrum sensing, which is the communication signal classification example of Section 3.2.2 (and Figure 3.2). This basic SVM-based detector, however, cannot also be directly applied to the problem of adaptive radar detection. In particular, it is not CFAR, since H0 data exhibit heterogeneous noise powers and plain energy is not an adaptive statistic; in contrast, the CFAR property is of fundamental importance in radar. Ways to address this issue are discussed in Sections 3.3.4 and 3.3.5. 3.3.2 Decision Tree-Based Detection with Controlled P FA Decision trees (see Section 2.3.4) are a simple yet powerful binary classification method that also has the advantage of a better interpretability compared to other machine learning tools. Direct application of this approach to radar detection, however, cannot meet the typical requirements of working at a chosen false alarm rate. As for the case of SVM discussed in Section 3.3.1, it is necessary to modify the algorithm in order to consider the PFA constraint.

Coluccia: “chapter3_v2” — 2022/10/7 — 17:30 — page 111 — #15

112

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

A decision tree is constructed based on labeled vectors of features using the well-known classification and regression tree (CART) algorithm. It works by starting from the root and successively partitioning the dataset to the child nodes by using a quality indicator to increase the nodes purity (e.g., minimum Gini index). This procedure is repeated at each child node until the size of the nodes is less than the minimum leaf size TLS , which is the number of data (observations) in the final nodes of the tree (leaves). If the tree is overgrown, as it happens when TLS is set too small, overfitting will occur; conversely, a too large TLS may lead to poor generalization performance. An optimal TLS can minimize the cross-validated error of the detector. Thus, a possible way to control PFA in a decision tree is based on the consideration that increasing or decreasing TLS from its optimum point will increase the misclassification rate (since by definition the cross-validated error is larger in any point different from the minimum) and hence will change the value of PFA [63]. More in detail, the approach proposed in [63] foresees two stages. In the first stage, a tree is constructed by the CART algorithm, then the optimal TLS is searched by the simulated annealing algorithm, such that the cross-validated error is minimized. The optimal value of TLS is used to update the tree and calculate the minimum achievable false alarm rate as the fraction of misclassified data out of the total number of data in the training set. If this minimum achievable rate is larger than the required PFA , it means that the problem is unfeasible (i.e., the PFA requirement cannot be satisfied). Otherwise, a second algorithmic stage is run to by iteratively incrementing TLS (by a small step size) and recomputing (empirical) false alarm rate on the updated decision tree. The procedure stops when the difference between the computed (empirical) false alarm rate and the desired PFA value is smaller than a tolerance. Some results on IPIX data have been reported in [63]. 3.3.3 Revisiting the Neyman–Pearson Approach There has been some work in different domains to try to adapt principled decision theory, adopted in model-based detection approaches, to the new data-driven paradigm. The starting point is that binary decision problems involve two types of errors (Types I and II), which are PFA and Pmiss = 1 − PD (probability of missing the target). In general, it is desirable to minimize both errors (i.e., the overall probability of error given by PFA + Pmiss ). As discussed in Chapter 2, particularly in Section 2.1.2 and Sections 2.3.1–2.3.2, the ideal solution would be to optimize the expected misclassification (Bayes) cost, but this is typically impractical because the involved prior and conditional probabilities are mostly unknown, thus motivating the adoption of data-driven approaches. However, in some applications such as target detection, machine monitoring, or disease diagnosis, one type of error is more important than the other. In this respect,

Coluccia: “chapter3_v2” — 2022/10/7 — 17:30 — page 112 — #16

Radar Applications of Machine Learning

113

the NP approach based on the likelihood ratio test is the gold standard, since it provably minimizes Pmiss (i.e., maximizes PD ) under a constraint on PFA (see Section 1.2.2) by exploiting the statistical models (pdfs) under the two hypotheses. The corresponding optimization problem is fNP = arg

min

f :P(f ≥η)≤PFA

Pmiss (f ).

(3.1)

An alternative approach is to solve the minimax (MM) problem: fMM = arg min max{PFA (f ), Pmiss (f )}. f

(3.2)

Both NP and MM can be shown to be obtainable, under mild assumptions, as solutions of the more general problem [64] min δPFA (f ) + (1 − δ)Pmiss (f ) (3.3) f

for appropriate δ ∈ (0, 1). In this respect, it is possible to train a variant of SVM for minimax and NP classification using a cost-sensitive SVM by tuning the parameter δ to achieve the desired error constraints [65]. Application of this approach to radar detection has been demonstrated in [66], with the aim of approximating the gold standard NP detector for noncoherent scenarios, in which the signal model is unstructured (see Section 1.2; compare with Section 1.3). In particular, a comparison is performed considering cases for which the NP detector can be analytically formulated, namely the detection of Swerling I-II targets in white Gaussian noise, and Swerling V targets with unknown Doppler shift in non-Gaussian (K-distributed) clutter. Some results on real data are also shown, by estimating the clutter parameters from target-free cells. The conclusion is that this class of SVM leads to a quite good approximation of the NP detector for the unstructured signal model for a given point of the ROC curve. Approximation of the NP detector can be also obtained by means of different learning machines, namely a neural network (multilayer perceptron) [67], and other links with statistical learning can be established [68, 69]. The approach discussed above requires deciding on a convenient definition for the training set under H1 . In fact, while data under H0 are important to let the machine learn the latent characteristics of the noise/clutter, it is more difficult to precisely define the H1 dataset. In particular, it must be decided which target to include, for which SNR, possibly considering a mixed training with several types of targets and/or SNRs. While this is already a limitation in practice, since it makes the detector very tailored to the imagined scenarios (and, as discussed in Section 3.2.2, it is difficult to consider all possible situations, or conversely too heterogeneous data might be included), further difficulties arise in the case

Coluccia: “chapter3_v2” — 2022/10/7 — 17:30 — page 113 — #17

114

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

of coherent detection of structured signal in correlated clutter, namely in STAP contexts. In that case, many unknown parameters, changing with time and space, prevent the adoption of the NP approach even under the model-based paradigm; indeed we recall that, being the NP inapplicable, the GLRT has become the dominant approach. In the following we will discuss other ways to tackle the adaptive radar detection problem through data-driven methods. 3.3.4 SVM and NN for CFAR Processing When the detection statistic is nonadaptive, the detection threshold is typically adjusted by a CFAR processor. This approach is quite common in many radar applications, since it offers control over the false alarm rate as well as some adaptiveness to the environment, in particular for large SNR. In Section 1.4.1 we have however discussed that this processing is suboptimal, as adaptive detection statistics have the potential to outperform decoupled detection and CFAR processing and to promote specific properties in the obtained detector (more on this point later in Section 3.3.5). Still, the decoupled approach is widely used in many contexts due to its simplicity, modularity, and effectiveness especially for large-SNR targets. Unfortunately, for low SNR the CFAR processor may set the detection threshold too high, thereby missing the targets. SVM can be adopted to learn more accurately how to discriminate between clutter and target [70]. In the training phase, target and nontarget training data are fed to a linear SVM with weights w (and zero bias): w is optimized to best discriminate target cases from nontarget cases, with the former simulated synthetically. As shown in Figure 3.3, this detector has the same structure of the CACFAR, but data from leading and lagging windows directly is used to classify the CUT (see Figure 1.5), somewhat like the joint rationale of more advanced model-based detectors (see Section 1.4.2). An interesting aspect is that the pdf of the decision statistic wT x used by the SVM, where x is the feature vector given by the amplitude of the data (from square-law envelope detection) in

Figure 3.3

CFAR scheme with SVM-based detection.

Coluccia: “chapter3_v2” — 2022/10/7 — 17:30 — page 114 — #18

Radar Applications of Machine Learning

115

leading and lagging cells, has a support clearly disjoint for the two H0 and H1 hypotheses, meaning that the SVM-induced mapping is able to make them more distinguishable (some figures are reported in [70]). This approach can work at low SNR, with PD and PFA degrading more gracefully than for CA-CFAR, which instead more abruptly stops detecting. SVM has very low complexity at run-time, but training is required beforehand. In favorable environments, that is where Gaussian noise is dominant, this may be done offline; but in a more dynamic environment where clutter returns dominate, periodical retraining is required. We will discuss in Section 3.3.5 an alternative approach based on the use of feature spaces guaranteeing the (generalized) CFAR property. CA-CFAR suffers performance degradation in nonhomogeneous environments due to clutter edges from one level noise power to another or interfering targets. As discussed in Section 1.2.3, several other variants of the CFAR processor have been introduced over the years. Unfortunately, there is no one size fits all for this problem, motivating the idea of switching between different processors based on the type of clutter at hand; to this end, data-driven clutter classifiers have been proposed based on SVM [71] but also NN (multilayer perceptron) [72]. To accommodate the possibility of various environments, variability index (VI-) CFAR is proposed in [71] as a feature to train the SVM: the threshold is selected according to classification results of the SVM, achieving low loss in homogeneous backgrounds while being robust to nonhomogenous environments.13 When moving to more complex radar scenarios, additional issues must be addressed. Although clutter parameters are usually estimated from radar measurements, target parameters are usually unknown. A well-established approach is to model them as random variables: the resulting composite hypothesis testing problem can then be solved by using the ALR, which is the NP detector where the target unknown parameter(s) have been averaged out (see Section 1.3.1, in particular the example in (1.28)). Unfortunately, the ALR usually involves integrals without analytical solution, and alternative approaches are proposed, a prominent one being the GLRT where such parameter(s) are maximized for instead of averaged out (see Section 1.3.2). In the case of unknown Doppler, the use of a Doppler filterbank (MTI) with independent CFAR processors on each branch whose decisions are finally combined through some logical operation (e.g., OR), is a simpler alternative to the GLRT approach of Section 1.3.4 [73, 74]. Such a scheme, illustrated in Figure 3.4, is also 13. In fact, greatest-of (GO-)CFAR enables better control of PFA in the case of clutter edges, but incurs dramatic PD loss if multiple targets are present; smallest-of (SO-)CFAR exhibits better performance in the latter case, but has higher PFA than CA-CFAR when clutter edges occur. On the other hand, order statistics (OS-)CFAR is a more robust approach, but leads to higher PFA in clutter edges. Finally, automatic censored cell averaging (ACCA-)CFAR detectors are instead similar to CA-CFAR in homogeneous environment and robust in multiple targets, but exhibit higher PFA than OS-CFAR in clutter edges.

Coluccia: “chapter3_v2” — 2022/10/7 — 17:30 — page 115 — #19

116

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

Figure 3.4

Detection scheme based on MTI Doppler processing, square-law detection, CFAR processing, and binary integration.

practically applied under non-Gaussian disturbance, because clutter is correlated and hence most of it is expected to be rejected by the Doppler filter looking for a particular frequency shift. This, together with the fact that linear filtering tends to transform the input random processes into Gaussian ones according to the central limit theorem (the filter output is a linear combination of inputs), justifies the assumption that the output of each filter is the sum of thermal noise and clutter residuals matching the hypothesis of independent Gaussian samples of conventional CFAR processors [75]. Therefore, incoherent CFAR techniques are applied to estimate clutter residuals plus thermal noise statistics, and adapt the detection threshold to guarantee the desired PFA while coping with varying power of the samples induced by the residuals. In most cases envelope (squarelaw) detectors are assumed since the design of CFAR processors for NP-, ALR-, or GLRT-based approaches is a complicated task for which analytical solutions are typically not available. Indeed, the maximum output of the envelope detectors to be thresholded (compare with Figure 1.8) generates samples that are not expected to be Gaussian due to the nonlinear maximum function. As a consequence, the assumptions of CFAR processors are violated, which may introduce performance degradation. The use of NNs can overcome this difficulty in some circumstances. The general idea is to design NN-based CFAR techniques to adapt the detection threshold according to clutter statistics variations without knowledge of the clutter model [75]. Although this is in principle agnostic to the detection strategy, the problem of detecting a signal with unknown Doppler in colored clutter plus white Gaussian noise has been addressed in particular by adopting the GLRT-based

Coluccia: “chapter3_v2” — 2022/10/7 — 17:30 — page 116 — #20

Radar Applications of Machine Learning

117

Doppler processing with the maximum function and a single CFAR processor as final stage [75]. A multilayer perceptron is applied to the output of the maximum function to learn the statistics of the input samples and provide the required threshold for maintaining the PFA under variable clutter parameters. No statistical constraint is imposed to the input space of the CFAR processor due to the ability of the NN to approximate functions based on input-output pairs. Nonetheless, other NN-based CFAR detection schemes also exist [76–78]. NNs have also been used as a direct approach to radar signal detection in K-distributed [79, 80], Weibull-distributed [81], and other non-Gaussian clutter models [82, 83].14 As a representative example of the data-driven paradigms, NNs have generally attracted interest for signal detection since they may be trained to operate in an environment for which the optimum detector is not available. However, training a NN to work at a preassigned PFA is not straightforward. A simplified approach is to train the network for best performance based on the adopted loss function (minimizing the probability of error),15 then to alter its operating point by acting on the bias term of the output neuron (see Section 2.4.1) to approximately meet the PFA constraint. This has a similar effect as bias-shifting of the SVM hyperplane (see Section 3.3.1). An alternative approach is to modify the detection threshold used to compute the error during the training, as discussed in [82], where however it is also shown that the simplified bias-shifting approach is sufficient to outperform both matched filter and locally optimum detectors. 3.3.5 Feature Spaces with (Generalized) CFAR Property The previously discussed scenarios all adopt the simpler decoupled approach, where in particular Doppler and CFAR processing are disjoint, while as previously said, the optimal adaptive approach to coherent target detection would entail a joint processing. In general, in colored noise settings this requires estimating the (time, space, or STAP) covariance matrix of the clutter plus interference and thermal noise, which complicates the problem. In fact, if the training is performed under certain disturbance conditions, the detector will not work properly under a completely different background. This is due to the fact that, although machine learning tools all have some degree of generalization, their worldview is solely shaped by the training set. On the other hand, including too many heterogeneous conditions in the training set will produce an average learning that may lose the adaptive ability to discriminate, for a given scenario at hand, targets from 14. For more details on non-Gaussian clutter models and related detectors, see [84, Ch. 7]. 15. More in detail, during each training epoch, the network is fed with one noise-only and one signal-plus-noise vector and required to produce output close to zero and one, respectively. The network weights are updated twice during each training epoch based on the corresponding errors, which should direct the NN to converge to a point that gives the smallest probability of error based for the given training dataset.

Coluccia: “chapter3_v2” — 2022/10/7 — 17:30 — page 117 — #21

118

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

background. Indeed, the adaptive detection rationale is to focus the detector to reveal what (if anything) deviates from the current disturbance, which is a relative concept: the same statistical variation may be interpreted as physiological, or rather declared as a target, according to the background characteristics. This focusing capability is lost in too-heterogeneous training, but at the same time a too-narrow set of conditions is not representative of operational environments and hence the false detection rate will be increased. Periodic retraining at runtime is needed to counteract this issue, but for many data-driven approaches this is typically computational intensive and time consuming. Furthermore, it does not completely solve the issue, since after a change point and before the next retraining, the performance will meanwhile severely degrade. While the problem remains open for general data-driven approaches, especially deep learning (as will be discussed in Section 3.3.6), an alternative approach is to seek special feature spaces that can provide full adaptiveness (i.e., generalized CFAR property). Let us consider first the decoupled approach, and as an example take again the spectrum sensing feature space already considered in Section 3.2.2; that is, energy E , ratio of maximum and minimum eigenvalues λmax /λmin , and the T1 /T2 statistic. In order to cope with colored (possibly non-Gaussian and heterogeneous) noise, which is typically found in real-world radar contexts, a prewhitening transformation is needed based on the estimated covariance matrix of the disturbance. Assuming the general correlated model z = s + n, where s and n have covariance matrices Rs and Rn , respectively, in a two-step approach Rˆ n is estimated from neighboring range cells (secondary data, −1/2 see Section 1.4.1), then after prewhitening (Rˆ n z) the problem is approximately reduced to detection in white noise (see Chapter 1 for the general radar observation model). In fact, if Rˆ n ≈ Rn , roughly −1/2 −1/2 E[Rˆ n z(Rˆ n z)† ] ≈ R−1/2 R R−1/2 +I . n s n N R z˜ z˜ †

(3.4)

If R were approximately the same over Nobs consequent observations of N dimensional vectors zi , i = 1, . . . , Nobs , this second-order statistic would be Nobs † estimated via the sample covariance N1 i=1 z˜ i z˜ i , from which the statistics obs λmax /λmin and T1 /T2 are readily obtained, and conventional incoherent Nobs integration E = i=1 ˜zi 2 can be performed to compute the (whitened) energy statistic. On IPIX data, since the clutter is heterogeneous, the normalized sample covariance matrix defined in [85] is a suitable choice for Rˆ n to gain adaptiveness to the power of the disturbance, while decorrelating the N samples. Figure 3.5 shows 1,000 feature vectors obtained by processing N = 4 pulses at each time t and using K = 2N range cells for covariance matrix estimation with and without a synthetic random (Gaussian) target added. The signal covariance R

Coluccia: “chapter3_v2” — 2022/10/7 — 17:30 — page 118 — #22

Radar Applications of Machine Learning

Figure 3.5

119

Spectrum sensing features after adaptive prewhitening, using IPIX data (cell 30 from file D85, polarization VV).

is estimated using Nobs = 8 observations. As it can be seen, despite that the modeling assumptions may not strictly hold true, the two clusters H0 and H1 appear sufficiently separated even after adaptiveness is recovered. The exercise above is just a first example of possible ways to overcome the lack of CFARness of certain feature spaces through a two-step approach. In general, it is possible to conceive detectors that exploit, instead of the generic spectrum sensing features adopted in this example, other features more apt at capturing the specific characteristics of the data at hand: for instance, sea clutter peculiarities are better captured by features such as relative amplitude, relative Doppler peak height, relative entropy of the Doppler amplitude spectrum, ridge integration of the normalized time-frequency distribution, normalized Hurst exponent, and others [86–89]. However, making such feature vectors fully adaptive is not a trivial task. On the other hand, if one aims at obtaining detectors that have generalized CFAR property comparable to fully adaptive model-based detectors, statistics inspired by the approaches in Chapter 1 can be expected to play a role. While this is discussed in detail in Chapter 4, here we anticipate the basic idea. Considering the structured signal model with steering vector s = αv and a joint coherent processing, it is known for instance that Kelly’s and AMF detectors are two very good adaptive detectors, with different behavior in terms of selectivity vs robustness. Following the machine learning paradigm, they can

Coluccia: “chapter3_v2” — 2022/10/7 — 17:30 — page 119 — #23

120

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

thus be considered as a 2-D feature space capturing two different views of the signal. Such statistics have the generalized CFAR property with respect to the whole covariance matrix (in colored Gaussian noise), hence the resulting detector will inherit such a property no matter what type of classifier is used to design it. Alternatively, it is possible to reparametrize the two detectors in terms of statistics with more profound theoretical meaning, namely their maximal invariant statistics (β, t˜). Figure 3.6 shows an example of colored Gaussian data with the same SNR but different mismatch level cos2 θ; the dots cluster at the bottom indicate H0 data (no target). Based on this feature space, several detectors can be conceived. Both the approaches of using either well-known CFAR detectors or maximal invariant statistics as features will be discussed in detail in Chapter 4, where it will also become clear why they should be considered as hybrid methods. 3.3.6 Deep Learning Based Detection We now turn our attention to fully data-driven approaches, which do not rely on engineered features but rather learn them directly from the data. Moving from handcrafted (engineered) features by the domain expert designer to automatically extracted features is a major paradigm shift, which is understandably not

Figure 3.6

Example of synthetic correlated Gaussian data represented in a maximal invariant feature space.

Coluccia: “chapter3_v2” — 2022/10/7 — 17:30 — page 120 — #24

Radar Applications of Machine Learning

121

completely well-received in some communities. Radars are indeed one of those fields in which control and interpretability are key aspects, which call for some prudence in letting the data speak for itself (we will be back on such aspects in Chapter 5). Indeed, while deep learning is more directly applicable in several radar applications, as discussed in Sections 3.1 and 3.2, for target detection the requirements of control over PFA and CFARness yield serious additional complications. Nonetheless, neural networks and deep learning have the potential to perform target detection in a very accurate way, especially in complex scenarios, so research efforts are worth along this direction. 3.3.6.1 Detection as Image-Like Classification Task

A main trend is to try to recast the target detection problem as an image-like classification task. In this respect, the use of a CNN architecture is an obvious preferable choice given its characteristic receptive fields (partially connected layers and convolutional blocks) able to extract local information while sliding over the 2-D data. Indeed, radar signal processing is often performed in 2-D domains. Early applications of CNNs to target detection considered time-frequency maps based on STFT or Wigner-Ville distribution (WVD) [90]. In case of STFT, since it is complex-valued, the input image needs to be split into two maps for the real and imaginary parts, respectively. Results on IPIX data [91] show the potential powerfulness of CNNs for detection in different types of sea clutter and polarizations; moreover, the use of a normalized WVD image can lead to better detection performance compared to STFT with CFAR property with respect to the clutter power [92]. This approach can be extended to include moving target detection capabilities [93]. Moreover, an ensemble of NNs, consisting, for example, of CNN and autoencoder whose outputs are integrated to reach the final decision, can be used for improved performance [94]. Another type of 2-D map arises from pulse-Doppler radars, where coherent processing foresees matched filter, Doppler processing, and CFAR detector: the latter, in fact, take in input the (2-D) range-Doppler map, where the target is possibly present in one or more neighboring cells, emerging as a narrow peak or a more extended mountain-like shape. While the CFAR detector processes the range-Doppler spectrum cell by cell, a CNN-based detector operates in a slidingwindow manner over the range-Doppler map (with window size slightly larger than the expected target extent) [95–97]. To improve the performance, binary integration can be introduced: the detection is repeated several times for a given range-Doppler cell and the final decision is taken if the number of detections exceeds a certain value (e.g., majority vote). The architecture of the CNN can be easily modified to incorporate binary integration [95]. An alternative heuristic is to exploit the local spatiotemporal information in the detection process by stacking a few range-Doppler images on top of each other as separate channels,

Coluccia: “chapter3_v2” — 2022/10/7 — 17:30 — page 121 — #25

122

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

then combining them so that persistent detections pop up [98].16 Recent advances show that CNN-based detectors have the potential to replace traditional CFAR algorithms in certain multitarget scenarios with nonhomogeneous interference, and their complexity can be reduced by suitable modifications, thus suggesting a possible adoption in real-time radar applications [97]. Still, coherent joint processing of multiple pulses in colored noise via deep learning techniques, while ensuring the generalized CFAR property, is an open problem.

3.3.6.2 Detection on Raw Data

Several other architectures are clearly possible, and only a small fraction have been experimented with so far. For instance, the adoption of Faster R-CNN (region CNN) brings some computational advantages and has been validated on calibration data17 using a 77-GHz automotive radar [99]. By this approach, complex I/Q raw radar data can be directly fed to the deep neural network, which enables detection and localization in the 4-D space of range, Doppler, azimuth, and elevation. The aim is to replace the entire signal processing chain found in certain types of radar (including automotive ones), which, although suboptimal, is widely used for the sake of simplicity and practicality [100]: detection on range-Doppler map, typically using a (square-law) detector followed by a CFAR processor, and upon detection, beamforming to estimate the DOA. Whether deep learning based detection on raw data is computationally achievable on an embedded microcontroller, especially in reduced-complexity radars, remains to be ascertained. Additionally, while a standard CFAR method is computationally very efficient and easily interpretable—for instance, the user might adapt the threshold to adjust the performance in a very intuitive way—how to control and explain the behavior of deep learning based detection remains an issue, as more deeply discussed in Chapter 5. To augment the dataset for training the deep network on raw complex data and improve the limited diversity of calibration data (target echoes come from one or a few static targets at fixed range and Doppler bins), in [99] the recorded chirp signals are multiplied by a complex exponential term to introduce a desired 16. This deep temporal detection approach is inspired by the way a trained radar operator recognizes a target in a range-Doppler display; that is, based on its persistence, not just its instantaneous power, while false detections are observed as flickering noise. This is not possible in a conventional CFAR detector, as it does not evaluate from dwell-to-dwell [98]. At the same time, both separate CFAR processing and binary integration are suboptimal compared to joint coherent (multipulse, multiantenna) processing, for which, however, suitable fully data-driven approaches are still unavailable. 17. Radar sensor array responses to a known point-target (corner reflector) located at a variety of angles, measured in anechoic chamber, are used.

Coluccia: “chapter3_v2” — 2022/10/7 — 17:30 — page 122 — #26

Radar Applications of Machine Learning

123

phase shift, thereby obtaining radar targets at any range and Doppler.18 Moreover, since each radar data frame typically contains significantly larger amount of nondetection bins than targets, class-balanced cross-entropy loss is used in the training. Several other improvements are possible, and again binary integration of multiple frames can be introduced to reduce the false alarm rate [101]. Unfortunately, as anticipated, the computational feasibility and effectiveness of such approaches in more sophisticated radar detection setups and operational scenarios is still not fully demonstrated. In particular, the fundamental issue of controlling PFA while performing adaptive coherent processing remains only partially addressed, typically by the use of a softmax layer at the output (which produces a single value in [0, 1], to be thresholded in accordance with the desired PFA ) or an additional false alarm controllable SVM classifier ([59], discussed earlier in Section 3.3.4) [102]. Moreover, dataset collection for different backgrounds and SNRs is challenging, unless simulations are used instead of real measurements (i.e., synthetic datasets are adopted, which however need some modeling assumptions to be generated [103]). This is just one of the many hybridizations between traditional signal processing and deep learning for radar detection, in this case to partially overcome the bottleneck of lack of large-labeled datasets. Other examples are preprocessing techniques such as pulse compression, Doppler processing, and time-frequency transforms, which can help to improve the detection performance. Indeed, while NNs are very powerful as classifiers, they need to be combined with more principled approaches to guarantee control on the PFA and the CFAR property. However, the available literature on NNs and deep learning for adaptive radar detection still remains relatively sparse and lacks highlevel maturity: shallow NNs and CNNs have been mainly investigated despite the great variety of architectures and deep learning techniques that have been proposed in other fields [103]. Given the complexity and challenges of the radar detection problem, further research is needed; at the same time, hybridization can be upgraded significantly by combining model-based detection based on statistically principled methods and data-driven approaches more amenable to mathematical tractability, as it will be discussed in Chapter 4.

3.4 Other Approaches 3.4.1 Unsupervised Learning and Anomaly Detection We have seen that the availability of large datasets is a limitation for the application of machine (especially deep) learning tools to radar detection. However, it is comparably easier to collect H0 data than H1 data, given the great variety of the 18. The chirp is in fact a LFM signal whose complex baseband representation is ejπSt with S = T B , where B is the signal bandwidth and Tchirp the chirp duration. chirp

Coluccia: “chapter3_v2” — 2022/10/7 — 17:30 — page 123 — #27

124

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

latter in real-world scenarios, in terms of type of target, SNR (or SINR/SCR), Doppler, DOA, and correlation statistics. The use of multiple features, possibly even high-dimensional—by construction, kernel trick, or automatic feature extraction—can help to better discriminate between the two hypotheses, since they provide target vs disturbance views from many different perspectives, but still, how to best characterize and obtain H1 datasets is a very challenging task. To overcome this difficulty, the use of unsupervised learning techniques has been considered as an alternative to the more common supervised learning, in particular by reformulating the detection problem as an anomaly detection problem. In particular, while traditional anomaly detection tools can be used as well, a recent trend has shown the potential of machine learning approaches to this purpose. 3.4.1.1 PCA-Based Detector

A basic solution is to consider a dimensionality reduction technique such as PCA, from which reconstruction errors are computed and used as a detection statistic (an approach known as PCA-based detector [104]), with a threshold that can be set to guarantee PFA [86]. This approach is not CFAR but can be applied to feature vectors of any size, overcoming the limitation of convex-hull based anomaly detectors, which try to construct a decision region by convex hull enclosing of the H0 data cluster. Such a computation is in fact feasible only up to 3-D feature spaces [87] and is sensitive to the geometric shape of the H0 data cluster (which may not be well suited a convex hull enclosing). 3.4.1.2 One-Class Classifiers

A more sophisticated approach is based on one-class classifiers, whose goal is to determine whether or not test data belongs to the same (unique) class of training data. To this end, training is performed using only the normal data representative of the large majority of cases. For instance, in one-class SVM the decision region is obtained by computing the hyperplane (in feature space, possibly kernelized) that maximally separates all points from the origin [105]. An alternative approach, called support vector data description (SVDD), is to identify the smallest hypersphere that encloses all the training data in feature space [106], since by minimizing the volume, the risk of incorporating outliers in the solution is also minimized. For the problem of target detection, thus, SVM can be used either in a two-class (supervised as in [59]; see Section 3.3.1) or one-class (unsupervised, anomaly detection) configuration [88]. 3.4.1.3 Isolation Forest

As mentioned, high-dimensional spaces containing a great variety of features may improve the discrimination performance. Besides PCA-based detector and

Coluccia: “chapter3_v2” — 2022/10/7 — 17:30 — page 124 — #28

Radar Applications of Machine Learning

125

one-class SVM, high-dimensional feature spaces can be also handled via isolation forest [107], an unsupervised version of random forest specifically tailored to anomaly detection. This technique has the advantage of a linear computational complexity with a low constant and a low memory requirement. The forest is obtained by recursive binary partitioning of the feature space, specifically by random selection of features and split values (between the minimum and maximum values of the selected feature), which thus produces a large number of binary trees. In the case of radar detection, since H1 samples are far away from H0 samples compared to the intraclass dispersion of the latter, they are typically closer to the root node in every tree; conversely, H0 samples will reach deeper leaf nodes after a sequence of partitioning. As a consequence, the average path length over the forest can be used as the test statistic. Moreover, PFA can be controlled by computing the average path lengths of all H0 training data and selecting a certain percentile as the detection threshold [89]. 3.4.2 Reinforcement Learning A third way to approach radar detection via fully data-driven techniques is to consider, instead of either supervised or unsupervised learning, reinforcement learning (Section 2.3.5). Research on this topic is still at a very early stage, but a few interesting directions have been already identified. A first idea is to exploit reinforcement learning for the design of waveform and detector in complex scenarios. In fact, it is known that for best performance radar waveforms and detectors should be designed jointly [108]. This is traditionally performed by relying on mathematical models of targets, clutter, and noise. While optimal solutions based on NP criterion can be found in some cases, in more complex scenarios (e.g., involving non-Gaussian models) the structure of optimal detectors generally involves intractable numerical integrations leading to various suboptimal approaches. Additionally, various operational constraints are often posed, such as peak-to-average-power ratio (PAR) or coexistence of radar and communication systems, which further complicate the problem. This is thus a case where data-driven approaches based on end-to-end learning may help. In [109] a radar system architecture based on the training of the detector and the transmitted waveform, both implemented as feedforward multilayer NNs, is proposed. Two design approaches are considered. In the first one, detector and transmitted waveform are trained alternately: for a fixed waveform, the detector is trained using supervised learning to approximate the NP detector, and for a fixed detector, the transmitted waveform is trained via policy gradient-based reinforcement learning. In the second algorithm, the detector and transmitter are trained simultaneously. Such approaches are extended to incorporate waveform constraints, specifically PAR and spectral compatibility. Some theoretical as well as simulation results are discussed, which show promising capabilities to learn the optimal detector, although the CFAR property cannot be currently guaranteed by this framework.

Coluccia: “chapter3_v2” — 2022/10/7 — 17:30 — page 125 — #29

126

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

Reinforcement learning is more generally regarded as a way to let the radar detector learn through experience, by choosing the best action to achieve its goal while interacting with an unknown environment, without any preassigned control policy, a concept similar to the cognitive radar iterative feedback control system envisioned by Haykin [110, 111]; see also [112, 113]. Within this rationale, waveform design for colocated MIMO has been investigated in [114] for white Gaussian noise with known statistics (power level). More complex and dynamic scenarios can be handled under the special setup of colocated massive MIMO radars [115, 116]. In fact, considering a single transmitted pulse (single snapshot scenario) with multiple transmit and receive antennas provides an additional degree of freedom that can be used to steer the beampattern toward specific directions and perform per-pulse detection under spatially colored disturbance with unknown covariance matrix, which can change in time and space. The only assumption is that the disturbance is a realization of a discrete-time, circular, complex random process with a polynomial decay of its autocorrelation function, an assumption weak enough to encompass all the most practical disturbance models such as autoregressive (AR), autoregressive moving average (ARMA), or general correlated non-Gaussian models [115]. While modelbased approaches such as GLRT and the Wald test cannot be applied without knowledge of the functional form of the disturbance’s pdf, thanks to the massive number of antennas, a robust Wald-type detector can be derived whose asymptotic distribution is independent of unknown parameters under both H0 and H1 . Such a property allows setting the threshold that guarantees a chosen PFA irrespective of the unknown pdf of the disturbance; moreover, a constraint is imposed to guarantee the CFAR property. Such a reinforcement learning based Wald-type waveform selection scheme is able to provide the detector with a remarkable increase of its PD while keeping the CFAR property with respect to a wide range of (unknown) disturbance models [116].

3.5 Summary This chapter laid out a rather broad presentation of the possibilities of datadriven tools for radar contexts. References to several applications were provided, then more emphasis was given to a problem that shares similarities with target detection, which is signal classification. A number of approaches were reviewed, paving the way for the core topic (target detection). Significantly more space was devoted to the latter, considering in particular several supervised learning approaches. A first class, based on engineered features, includes SVM, decision tree, and other approaches for which the PFA can be controlled. The use of such tools for approximating NP detectors in certain contexts, and for various CFAR processing

Coluccia: “chapter3_v2” — 2022/10/7 — 17:30 — page 126 — #30

Radar Applications of Machine Learning

127

under different radar detection schemes and philosophies, was also discussed. Examples using real-world data were provided to clarify the most important aspects of the problem. A second class of techniques was also examined, concerned with fully data-driven methods, in particular based on deep neural networks. Finally, alternative recent trends based on unsupervised learning, in particular anomaly detection techniques such as one-class classifiers, and reinforcement learning, were noted. Several critical aspects involved in the application of data-driven methods to the problem of target detection were discussed throughout, from data collection, annotation, and augmenting, up to experimentation and other issues encountered when bringing a novel paradigm to an old but complex problem (especially considering more advanced joint detection schemes and challenging environments). In particular, ways to control PFA , to guarantee the CFAR property, as well as to gain adaptiveness, were illustrated. Hybridization between traditional signal processing and data-driven radar detection was touched on at a few points, in particular motivated by the need to partially overcome the bottleneck of lack of large-labeled datasets, as well as to promote desirable properties in the designed solution. Indeed, while datadriven tools are very powerful, they need to be combined with more principled approaches to guarantee control on the PFA and the CFAR property. The topic is still at its early stage, and despite the great variety of machine learning methods, architectures, and deep learning techniques that have been proposed in other fields, only a small fraction of them has been investigated in the context of radar detection, given the complexity and challenges of the problem. Combining model-based detection based on statistically principled methods and data-driven approaches more amenable to mathematical tractability is an interesting option, which is discussed next in Chapter 4.

References [1]

Mason, E., B. Yonel, and B. Yazici, “Deep Learning for Radar,” in 2017 IEEE Radar Conference (RadarConf ), 2017, pp. 1703–1708.

[2]

Zhu, X. X., D. Tuia, L. Mou, et al., “Deep Learning in Remote Sensing: A Comprehensive Review and List of Resources,” IEEE Geoscience and Remote Sensing Magazine, Vol. 5, No. 4, 2017, pp. 8–36.

[3]

Geng, Z., H. Yan, J. Zhang, and D. Zhu, “Deep-Learning for Radar: A Survey,” IEEE Access, Vol. 9, 2021, pp. 141800–141818.

[4]

Zhu, X., S. Montazeri, M. Ali, et al., “Deep learning Meets SAR: Concepts, Models, Pitfalls, and Perspectives,” IEEE Geoscience and Remote Sensing Magazine, pp. 0–0, 2021.

Coluccia:

“chapter3_v2” — 2022/10/7 — 17:30 — page 127 — #31

128

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

[5] Wagner, S. A., “SAR ATR by a Combination of Convolutional Neural Network and Support Vector Machines,” IEEETransactions on Aerospace and Electronic Systems, Vol. 52, No. 6, 2016, pp. 2861– 2872. [6]

Majumder, U., E. Blasch, and D. Garren, Deep Learning for Radar and Communications Automatic Target Recognition, Norwood, MA: Artech House, 2020.

[7]

Patole, S. M., M. Torlak, D. Wang, and M. Ali, “Automotive Radars: A Review of Signal Processing Techniques,” IEEE Signal Processing Magazine, Vol. 34, No. 2, 2017, pp. 22–35.

[8]

Khomchuk, P., I. Stainvas, and I. Bilik, “Pedestrian Motion Direction Estimation Using Simulated Automotive Mimo Radar,” IEEE Transactions on Aerospace and Electronic Systems, Vol. 52, No. 3, 2016, pp. 1132–1145.

[9]

Greco, M. S., J. Li, T. Long, and A. Zoubir, “Advances in radar Systems for Modern Civilian and Commercial Applications: Part 2 [from the guest editors],” IEEE Signal Processing Magazine, Vol. 36, No. 5, 2019, pp. 16–18.

[10]

Heidenreich, P., A. Zoubir, I. Bilik, M. Greco, and M. Torlak, “Editorial: Introduction to the Issue on Recent Advances in Automotive Radar Signal Processing,” IEEE Journal of Selected Topics in Signal Processing, Vol. 15, No. 4, 2021, pp. 861–864.

[11]

Mercuri, M., G. Sacco, R. Hornung, et al., “2-D Localization, Angular Separation and Vital Signs Monitoring Using a SISO FMCW Radar for Smart Long-Term Health Monitoring Environments,” IEEE Internet of Things Journal, Vol. 8, No. 14, 2021, pp. 11065–11077.

[12]

Ding, C., H. Hong, Y. Zou, et al., “Continuous Human Motion Recognition with a Dynamic Range-Doppler Trajectory Method Based on FMCW Radar,” IEEE Transactions on Geoscience and Remote Sensing, Vol. 57, No. 9, 2019, pp. 6821–6831.

[13]

Amin, M. G., Y. D. Zhang, F. Ahmad, and K. D. Ho, “Radar Signal Processing for Elderly Fall Detection: The Future for In-Home Monitoring,” IEEE Signal Processing Magazine, Vol. 33, No. 2, 2016, pp. 71–80.

[14]

Le Kernec, J., F. Fioranelli, C. Ding, et al., “Radar Signal Processing for Sensing in Assisted Living: The Challenges Associated with Real-Time Implementation of Emerging Algorithms,” IEEE Signal Processing Magazine, Vol. 36, No. 4, 2019, pp. 29–41.

[15]

Abdu, F. J., Y. Zhang, M. Fu, Y. Li, and Z. Deng, “Application of Deep Learning on Millimeter-Wave Radar Signals: A Review,” Sensors, Vol. 21, No. 6, 2021, https:// www.mdpi.com/1424-8220/21/6/1951.

[16]

Chen, V., The Micro-Doppler Effect in Radar, Second Edition, Norwood, MA: Artech House, 2019.

[17]

Fioranelli, F., H. Griffiths, M. Ritchie, and A. Balleri (eds.), Micro-Doppler Radar and Its Applications, The Institution of Engineering and Technology, 2020.

[18]

Fioranelli, F., H. Griffiths, M. Ritchie, and A. Balleri (eds.), Micro-Doppler Radar and Its Applications, The Institution of Engineering and Technology, 2020.

[19]

Coluccia, A., G. Parisi, and A. Fascista, “Detection and Classification of Multirotor Drones in Radar Sensor Networks: A Review,” Sensors, Vol. 20, No. 15, 2020.

[20]

Coluccia, A., A. Fascista, A. Schumann, et al., “Drone vs Bird Detection: Deep Learning Algorithms and Results from a Grand Challenge,” Sensors, Vol. 21, No. 8, 2021.

Coluccia:

“chapter3_v2” — 2022/10/7 — 17:30 — page 128 — #32

Radar Applications of Machine Learning

129

[21] Taha, B., and A. Shoufan, “Machine Learning-Based Drone Detection and Classification: State-of-the-Art in Research,” IEEE Access, Vol. 7, 2019, pp. 138669–138682. [22]

Shi, X., C. Yang, W. Xie, C. Liang, Z. Shi, and J. Chen, “Anti-drone System with Multiple Surveillance Technologies: Architecture, Implementation, and Challenges,” IEEE Communications Magazine, Vol. 56, No. 4, 2018, pp. 68–74.

[23]

Santra, A., and S. Hazra, Deep Learning Applications of Short-Range Radars, Norwood, MA: Artech House, 2020.

[24]

Elbir, A. M., K. V. Mishra, and Y. C. Eldar, “Cognitive Radar Antenna Selection via Deep Learning,” IET Radar, Sonar & Navigation, Vol. 13, No. 6, 2019, pp. 871–880.

[25]

Davoli, A., G. Guerzoni, and G. M. Vitetta, “Machine Learning and Deep Learning Techniques for Colocated MIMO Radars: A Tutorial Overview,” IEEE Access, Vol. 9, 2021, pp. 33704–33755.

[26]

Kerbaa, T. H., A. Mezache, F. Gini, and M. S. Greco, “CNN-LSTM Based Approach for Parameter Estimation of k-Clutter Plus Noise,” in 2020 IEEE Radar Conference (RadarConf20), 2020, pp. 1–6.

[27] Wagner S., and W. Johannes, “Target Detection Using Autoencoders in a Radar Surveillance System,” in 2019 International Radar Conference (RADAR), 2019, pp. 1–5. [28]

Dong, G., G. Liao, H. Liu, and G. Kuang, “A Review of the Autoencoder and its Variants: A Comparative Perspective from Target Recognition in Synthetic-Aperture Radar Images,” IEEE Geoscience and Remote Sensing Magazine, Vol. 6, No. 3, 2018, pp. 44–68.

[29]

O’Shea, T. J., J. Corgan, and T. C. Clancy, “Convolutional Radio Modulation Recognition Networks,” in Engineering Applications of Neural Networks, C. Jayne and L. Iliadis (eds.), Cham, Switzerland: Springer International Publishing, 2016, pp. 213–226.

[30]

Su, W., “Survey of Automatic Modulation Classification Techniques: Classical Approaches and New Trends,” IET Communications, Vol. 1, April 2007, pp. 137–156(19).

[31]

O’Shea, T. J., T. Roy, and T. C. Clancy, “Over-the-Air Deep Learning Based Radio Signal Classification,” IEEE Journal of Selected Topics in Signal Processing, Vol. 12, No. 1, 2018, pp. 168–179.

[32]

Peng, S., S. Sun, and Y.-D. Yao, “A Survey of Modulation Classification Using Deep Learning: Signal Representation and Data Preprocessing,” IEEE Transactions on Neural Networks and Learning Systems, 2021, pp. 1–19.

[33]

Huynh-The, T., Q.- V. Pham, T.-V. Nguyen, et al., “Automatic Modulation Classification: A Deep Architecture Survey,” IEEE Access, Vol. 9, 2021, pp. 142950–142971.

[34]

Chen, M., U. Challita, W. Saad, C. Yin, and M. Debbah, “Artificial Neural NetworksBased Machine Learning for Wireless Networks: A Tutorial,” IEEE Communications Surveys Tutorials, Vol. 21, No. 4, 2019, pp. 3039–3071.

[35]

Simeone, O., “A Very Brief Introduction to Machine Learning with Applications to Communication Systems,” IEEE Transactions on Cognitive Communications and Networking, Vol. 4, No. 4, 2018, pp. 648–664.

[36]

Gunduz, D., P. de Kerret, N. D. Sidiropoulos, D. Gesbert, C. Murthy, and M. van der Schaar, “Machine Learning in the Air,” arXiv:1904.12385, Tech. Rep., 2019.

Coluccia:

“chapter3_v2” — 2022/10/7 — 17:30 — page 129 — #33

130

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

[37]

O’Shea, T. J., T. Roy, and N. West, “Approximating the Void: Learning Stochastic Channel Models from Observation With Variational Generative Adversarial Networks,” in 2019 International Conference on Computing, Networking and Communications (ICNC), 2019, pp. 681–686.

[38]

O’Shea, T., and J. Hoydis, “An Introduction to Deep Learning for the Physical Layer,” IEEE Transactions on Cognitive Communications and Networking, Vol. 3, No. 4, 2017, pp. 563–575.

[39]

Ribeiro, M. T., S. Singh, and C. Guestrin, “‘Why Should I Trust You?’: Explaining the Predictions of Any Classifier,” 10.48550/ARXIV.1602.04938, 2016.

[40]

Stone, M., “Cross-Validatory Choice and Assessment of Statistical Predictions,” Journal of the Royal Statistical Society. Series B (Methodological), Vol. 36, No. 2, 1974, pp. 111–147, http://www.jstor.org/stable/2984809.

[41]

Gogineni, S., J. R. Guerci, H. K. Nguyen, J. S. Bergin, B. C. Watson, and M. Rangaswamy, “Demonstration of High-Fidelity Modeling Amp; Simulation for Cognitive Radar,” in 2020 IEEE Radar Conference (RadarConf20), 2020, pp. 1–3.

[42]

Sankhe, K., M. Belgiovine, F. Zhou, S. Riyaz, S. Ioannidis, and K. Chowdhury, “Oracle: Optimized Radio Classification through Convolutional Neural Networks,” in IEEE INFOCOM 2019 IEEE Conference on Computer Communications, 2019, pp. 370–378.

[43]

Sankhe, K., M. Belgiovine, F. Zhou, et al., “No Radio Left Behind: Radio Fingerprinting Through Deep Learning of Physical-Layer Hardware Impairments,” IEEE Transactions on Cognitive Communications and Networking, Vol. 6, No. 1, 2020, pp. 165–178.

[44]

Stewart, R. W., K. W. Barlee, and D. S. Atkinson, Software Defined Radio Using MATLAB & Simulink and the RTL-SDR, Strathclyde Academic Media, 2015.

[45]

Zeng, Y., and Y.- C. Liang, “Spectrum-sensing Algorithms for Cognitive Radio Based on Statistical Covariances,” IEEE Transactions on Vehicular Technology, Vol. 58, No. 4, May 2009.

[46]

Coluccia, A., A. Fascista, and G. Ricci, “Spectrum Sensing by Higher-Order SVM-Based Detection,” in EUSIPCO, 2019.

[47]

P. Lang, X. Fu, M. Martorella, J. Dong, R. Qin, X. Meng, and M. Xie, “A Comprehensive Survey of Machine Learning Applied to Radar Signal Processing,” ArXiv, Vol. abs/2009.13702, 2020.

[48]

Lunden, J., and V. Koivunen, “Automatic Radar Waveform Recognition,” IEEE Journal of Selected Topics in Signal Processing, Vol. 1, No. 1, 2007, pp. 124–136.

[49]

Jordanov, I., N. Petrov, and A. Petrozziello, “Supervised Radar Signal Classification,” in 2016 International Joint Conference on Neural Networks (IJCNN), 2016, pp. 1464–1471.

[50]

Jordanov, I., N. Petrov, and A. Petrozziello, “Classifiers Accuracy Improvement Based on Missing Data Imputation,” Journal of Artificial Intelligence and Soft Computing Research, Vol. 8, No. 1, 2018, pp. 31–48.

[51]

Guo, S., and H. Tracey, “Discriminant Analysis for Radar Signal Classification,” IEEE Transactions on Aerospace and Electronic Systems, Vol. 56, No. 4, 2020, pp. 3134–3148.

Coluccia:

“chapter3_v2” — 2022/10/7 — 17:30 — page 130 — #34

Radar Applications of Machine Learning

131

[52]

P. Lang, X. Fu, M. Martorella, et al., “RRSARNET: A Novel Network for Radar Radio Sources Adaptive Recognition,” IEEE Transactions on Vehicular Technology, Vol. 70, No. 11, 2021, pp. 11483–11498.

[53]

Guo, S., S. Akhtar, and A. Mella, “A Method for Radar Model Identification Using TimeDomain Transient Signals,” IEEE Transactions on Aerospace and Electronic Systems, Vol. 57, No. 5, 2021, pp. 3132–3149.

[54]

Sun, J., G. Xu, W. Ren, and Z. Yan, “Radar emitter Classification Based on Unidimensional Convolutional Neural Network,” IET Radar, Sonar & Navigation, Vol. 12, No. 8, 2018, pp. 862–867.

[55] Yar, E., M. B. Kocamis, A. Orduyilmaz, M. Serin, and M. Efe, “A Complete Framework of Radar Pulse Detection and Modulation Classification for Cognitive EW,” in 2019 27th European Signal Processing Conference (EUSIPCO), 2019, pp. 1–5. [56]

Billingsley, J., Low-Angle Radar Land Clutter: Measurements and Empirical Models, Norwich, NY: William Andrew Publishing, 2002, https://books.google.it/books?id=FEkn0-h7sz0C.

[57]

Deng, H., Z. Geng, and B. Himed, “Radar target Detection Using Target Features and Artificial Intelligence,” in 2018 International Conference on Radar (RADAR), 2018, pp. 1–4.

[58]

Davenport, M., R. Baraniuk, and C. Scott, “Controlling False Alarms with Support Vector Machines,” in 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, Vol. 5, 2006, pp. V–V.

[59]

Li, Y., P. Xie, Z. Tang, T. Jiang, and P. Qi, “SVM-Based Sea-Surface Small Target Detection: A False-Alarm-Rate-Controllable Approach,” IEEE Geoscience and Remote Sensing Letters, Vol. 16, No. 8, 2019, pp. 1225–1229.

[60]

Broadwater, J. B., C. Carmen, and A. J. Llorens, “False Alarm Constrained Classification,” Proceedings of the 4th International Conference and Exhibition on “Underwater Acoustic Measurements: Technologies & Results,” Kos Island, Greece, June 20–24, 2011.

[61]

Peterson, W. V., and W. J. Comeau, “User Manual for Malta Plateau Clutter Track Database,” Naval Undersea Warfare Center Division, Tech. Rep., 2007.

[62] Wu, S.- H., K.- P. Lin, H.- H. Chien, C.- M. Chen, and M.- S. Chen, “On Generalizable Low False-Positive Learning Using Asymmetric Support Vector Machines,” IEEE Transactions on Knowledge and Data Engineering, Vol. 25, No. 5, 2013, pp. 1083–1096. [63]

Zhou, H., and T. Jiang, “Decision Tree Based Sea-Surface Weak Target Detection with False Alarm Rate Controllable,” IEEE Signal Processing Letters, Vol. 26, No. 6, 2019, pp. 793–797.

[64]

Scharf, L., and C. Demeure, Statistical Signal Processing: Detection, Estimation, and Time Series Analysis, Addison Wesley, 1991, https: //books.google.it/books?id=y\dSAAAAMAAJ.

[65]

Davenport, M. A., R. G. Baraniuk, and C. D. Scott, “Tuning Support Vector Machines for Minimax and Neyman–Pearson Classification,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010.

[66]

de la Mata-Moya, D., M. P. Jarabo-Amores, J. Martín de Nicolas, and M. Rosa-Zurera, “Approximating the Neyman–Pearson Detector with 2C-SVMS. Application to Radar Detection,” Signal Processing, Vol. 131, 2017, pp. 364–375.

Coluccia:

“chapter3_v2” — 2022/10/7 — 17:30 — page 131 — #35

132

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

[67]

Jarabo-Amores, M.- P., M. Rosa-Zurera, R. Gil-Pita, and F. Lopez-Ferreras, “Study of Two Error Functions to Approximate the Neyman–Pearson Detector Using Supervised Learning Machines,” IEEE Transactions on Signal Processing, Vol. 57, No. 11, 2009, pp. 4175–4181.

[68]

Cannon, A., J. Howse, D. Hush, and C. Scovel, “Learning with the Neyman–Pearson and Min-Max Criteria,” Los Alamos Nat. Lab., Tech. Rep., 2002.

[69]

Scott, C., and R. Nowak, “A Neyman–Pearson Approach to Statistical Learning,” IEEE Transactions on Information Theory, Vol. 51, No. 11, 2005, pp. 3806–3819.

[70]

Ball, J. E., “Low Signal-to-noise Ratio Radar Target Detection Using Linear Support Vector Machines (L-SVM),” in IEEE Radar Conference, 2014.

[71] Wang, L., D. Wang, and C. Hao, “Intelligent CFAR Detector Based on Support Vector Machine,” IEEE Access, Vol. 5, 2017, pp. 26965–26972. [72]

Rohman, B. P., D. Kurniawan, and M. T. Miftahushudur, “Switching CA/OS CFAR Using Neural Network for Radar Target Detection in Non-Homogeneous Environment,” in 2015 International Electronics Symposium (IES), 2015, pp. 280–283.

[73]

Skolnik, M. I., Radar Handbook, Third Edition, New York: McGraw-Hill, 2008.

[74]

Schleher, D. C., MTI and Pulsed Doppler Radar, Norwood, MA: Artech House, 1991.

[75]

Mata-Moya, D., N. del Rey-Maestre, V. M. Pelaez-Sánchez, M.-P. Jarabo-Amores, and J. M. de Nicolas, “MLP-CFAR for improving Coherent Radar Detectors Robustness in Variable Scenarios,” Expert Systems with Applications, Vol. 42, No. 11, 2015, pp. 4878–4891, https://www.sciencedirect.com/science/article/pii/S0957417415000056.

[76]

Akhtar, J., and K. E. Olsen, “A Neural Network Target Detector with Partial CA-CFAR Supervised Training,” in 2018 International Conference on Radar (RADAR), 2018, pp. 1–6.

[77] Vizcarro i Carretero, M., R. I. A. Harmanny, and R. P. Trommel, “Smart-CFAR, A Machine Learning Approach to Floating Level Detection in Radar,” in 2019 16th European Radar Conference (EuRAD), 2019, pp. 161–164. [78]

Akhtar, J., “Training of Neural Network Target Detectors Mentored by SO-CFAR,” in 2020 28th European Signal Processing Conference (EUSIPCO), 2021, pp. 1522–1526.

[79]

Cheikh, K., and S. Faozi, “Application of Neural Networks to Radar Signal Detection in k-Distributed Clutter,” in First International Symposium on Control, Communications and Signal Processing, 2004, 2004, pp. 295–298.

[80]

Cheikh, K., and F. Soltani, “Application of Neural Networks to Radar Signal Detection in k-Distributed Clutter,” IEE Proceedings–Radar, Sonar and Navigation, Vol. 153, October 2006, pp. 460–466.

[81] Vicen-Bueno, R., M. Jarabo-Amores, M. Rosa-Zurera, R. Gil-Pita, and D. Mata-Moya, “Detection of Known Targets in Weibull Clutter Based on Neural Networks. Robustness Study against Target Parameters Changes,” in 2008 IEEE Radar Conference, 2008, pp. 1–6. [82]

Gandhi, P., and V. Ramamurti, “Neural Networks for Signal Detection in Non-Gaussian Noise,” IEEE Transactions on Signal Processing, Vol. 45, No. 11, 1997, pp. 2846–2851.

[83]

Bhattacharya, T., and S. Haykin, “Neural Network-Based Radar Detection for an Ocean Environment,” IEEE Transactions on Aerospace and Electronic Systems, Vol. 33, No. 2, 1997, pp. 408–420.

Coluccia:

“chapter3_v2” — 2022/10/7 — 17:30 — page 132 — #36

Radar Applications of Machine Learning

133

[84]

Greco, M., and A. De Maio (eds.), Modern Radar Detection Theory, Edison, NJ: Scitech, 2015.

[85]

Conte, E., M. Lops, and G. Ricci, “Adaptive Radar Detection in Compound-Gaussian Clutter,” in Seventh European Signal Processing Conference (EUSIPCO), Edinburgh, Scotland, 1994.

[86]

Gu, T., “Detection of Small Floating Targets on the Sea Surface Based on Multi-Features and Principal Component Analysis,” IEEE Geoscience and Remote Sensing Letters, Vol. 17, No. 5, 2020, pp. 809–813.

[87]

Shi, S.-N., and P.-L. Shui, “Sea-Surface Floating Small Target Detection by One-Class Classifier in Time-Frequency Feature Space,” IEEE Transactions on Geoscience and Remote Sensing, Vol. 56, No. 11, 2018, pp. 6395–6411.

[88]

Xu, S., J. Zhu, P. Shui, and X. Xia, “Floating Small Target Detection in Sea Clutter by OneClass SVM Based on Three Detection Features,” in 2019 International Applied Computational Electromagnetics Society Symposium–China (ACES), Vol. 1, 2019, pp. 1–2.

[89]

Xu, S., J. Zhu, J. Jiang, and P. Shui, “Sea-surface Floating Small Target Detection by MultiFeature Detector Based on Isolation Forest,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, Vol. 14, 2021, pp. 704–715.

[90]

López-Risueño, G., J. Grajal, S. Haykin, and R. Díaz-Oliver, “Convolutional Neural Networks for Radar Detection,” in Artificial Neural Networks—ICANN 2002, J. R. Dorronsoro (ed.), Berlin: Springer Berlin Heidelberg, 2002, pp. 1150–1155.

[91]

Su, N., X. Chen, J. Guan, and Y. Li, “Deep CNN-Based Radar Detection for Real Maritime Target Under Different Sea States and Polarizations,” in Cognitive Systems and Signal Processing, F. Sun, H. Liu, and D. Hu, (eds.), Singapore: Springer Singapore, 2019, pp. 321–331.

[92]

Lopez-Risueno, G., J. Grajal, and R. Diaz-Oliver, “Target Detection in sea Clutter Using Convolutional Neural Networks,” in Proceedings of the 2003 IEEE Radar Conference (Cat. No. 03CH37474), 2003, pp. 321–328.

[93]

Grajal, J., A. Quintas, and G. Lopez-Risueno, “MTD Detector Using Convolutional Neural Networks,” in IEEE International Radar Conference, 2005, 2005, pp. 827–831.

[94]

Jing, H., Y. Cheng, H. Wu, X. Cao, and H. Wang, “Radar Target Detection Method Based on Neural Network Ensemble,” in 2021 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC), 2021, pp. 1–5.

[95] Wang, L., J. Tang, and Q. Liao, “A Study on Radar Target Detection Based on Deep Neural Networks,” IEEE Sensors Letters, Vol. 3, No. 3, 2019, pp. 1–4. [96]

Xie, Y., J.Tang, and L. Wang, “RadarTarget Detection Using Convolutional Neutral Network In Clutter,” in 2019 IEEE International Conference on Signal, Information and Data Processing (ICSIDP), 2019, pp. 1–6.

[97] Yavuz, F., “Radar Target Detection with CNN,” in 2021 29th European Signal Processing Conference (EUSIPCO), 2021, pp. 1581–1585. [98]

Gusland, D., S. Rolfsjord, and B. Torvik, “Deep Temporal Detection—A Machine Learning Approach to Multiple-Dwell Target Detection,” in 2020 IEEE International Radar Conference (RADAR), 2020, pp. 203–207.

Coluccia:

“chapter3_v2” — 2022/10/7 — 17:30 — page 133 — #37

134

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

[99]

Brodeski, D., I. Bilik, and R. Giryes, “Deep Radar Detector,” in 2019 IEEE Radar Conference (RadarConf ), 2019, pp. 1–6.

[100] Sun, S., A. P. Petropulu, and H. V. Poor, “MIMO Radar for Advanced Driver-Assistance Systems and Autonomous Driving: Advantages and Challenges,” IEEE Signal Processing Magazine, Vol. 37, No. 4, 2020, pp. 98–117. [101] Chen, X., J. Guan, X. Mu, Z. Wang, N. Liu, and G. Wang, “Multi-Dimensional Automatic Detection of Scanning Radar Images of Marine Targets Based on Radar PPINET,” Remote Sensing, Vol. 13, 2021, No. 19, https://www.mdpi.com/2072-4292/13/19/3856. [102] Chen, X., N. Su, Y. Huang, and J. Guan, “False-Alarm-Controllable Radar Detection for Marine Target Based on Multi Features Fusion via CNNs,” IEEE Sensors Journal, Vol. 21, No. 7, 2021, pp. 9099– 9111. [103] Jiang, W., Y. Ren, Y. Liu, and J. Leng, “A Method of Radar Target Detection Based on Convolutional Neural Network,” Neural Computing and Applications, Vol. 33, No. 16, 2021, pp. 9835–9847. [104] Saha, B. N., N. Ray, and H. Zhang, “Snake Validation: A PCA-Based Outlier Detection Method,” IEEE Signal Processing Letters, Vol. 16, No. 6, 2009, pp. 549–552. [105] Scholkopf, B., R. Williamson, A. Smola, J. Shawe-Taylor, and J. Platt, “Support Vector Method for Novelty Detection,” in Proceedings of the 12th International Conference on Neural Information Processing Systems, Cambridge, MA: MIT Press, 1999, pp. 582–588. [106] Tax, D. M. J., and R. P. W. Duin, “Support Vector Data Description,” Machine Learning, Vol. 54, No. 1, 2004, pp. 45–66, https://doi.org/10.1023/B: MACH.0000008084.60811.49. [107] Liu, F. T., K. M. Ting, and Z.- H. Zhou, “Isolation Forest,” in 2008 Eighth IEEE International Conference on Data Mining, 2008, pp. 413–422. [108] Gini, F., A. De Maio, and L. Patton (eds.), Waveform Design and Diversity for Advanced Radar Systems, London: Institution of Engineering and Technology, 2012. [109] Jiang, W., A. M. Haimovich, and O. Simeone, “Joint Design of Radar Waveform And Detector via End-to-End Learning with Waveform Constraints,” IEEE Transactions on Aerospace and Electronic Systems, Vol. 58, No. 1, 2021, pp. 552–567. [110] Haykin, S., “Cognitive Radar: A Way of the Future,” IEEE Signal Processing Magazine, Vol. 23, No. 1, 2006, pp. 30–40. [111] Haykin, S., and J. M. Fuster, “On cognitive Dynamic Systems: Cognitive Neuroscience and Engineering Learning from Each Other,” Proceedings of the IEEE, Vol. 102, No. 4, 2014, pp. 608–628. [112] Guerci, J., Cognitive Radar: The Knowledge-Sided Fully Adaptive Approach, Norwood, MA: Artech House, 2010, https://books.google.it/books?id=8Mn\ C-iOzeEC. [113] Gurbuz, S. Z., H. D. Griffiths, A. Charlish, M. Rangaswamy, M. S. Greco, and K. Bell, “An Overview of Cognitive Radar: Past, Present, and Future,” IEEE Aerospace and Electronic Systems Magazine, Vol. 34, No. 12, 2019, pp. 6–18. [114] Wang, L,. S. Fortunati, M. S. Greco, and F. Gini, “Reinforcement Learning-Based Waveform Optimization for MIMO Multi-Target Detection,” in 2018 52nd Asilomar Conference on Signals, Systems, and Computers, 2018, pp. 1329–1333.

Coluccia:

“chapter3_v2” — 2022/10/7 — 17:30 — page 134 — #38

Radar Applications of Machine Learning

135

[115] Ahmed, A. M., A. A. Ahmad, S. Fortunati, A. Sezgin, M. S. Greco, and F. Gini, “A Reinforcement Learning Based Approach for Multitarget Detection in Massive MIMO Radar,” IEEE Transactions on Aerospace and Electronic Systems, Vol. 57, No. 5, 2021, pp. 2622–2636. [116] Ahmed, A. M., S. Fortunati, A. Sezgin, M. S. Greco, and F. Gini, “Robust Reinforcement Learning-Based Wald-Type Detector for Massive MIMO Radar,” in 2021 29th European Signal Processing Conference (EUSIPCO), 2021, pp. 846–850.

Coluccia: “chapter3_v2” — 2022/10/7 — 17:30 — page 135 — #39

Coluccia: “chapter3_v2” — 2022/10/7 — 17:30 — page 136 — #40

4 Hybrid Model-Based and Data-Driven Detection This chapter is devoted to various forms of hybridization between modelbased and data-driven concepts for the task of adaptive radar detection. Hybrid approaches, in this context, refer to dataset acquisition/generation and algorithmic procedures to design radar detectors. This should not be confused with hybrid learning approaches such as semisupervised learning, which tries to reduce the need for large datasets by combining plain supervised learning with some unsupervised techniques. Specifically, this chapter discusses techniques for data-driven CFAR detection and a restricted network architecture for analysis and design of CFAR detectors, trying to bridge the gap between the classical modelbased approach and data-driven tools. The problem with designing data-driven and hybrid detectors that are robust to signal mismatches, or can conversely reject unwanted signals (compared to the one of interest), is illustrated in particular.

4.1 Concept Drift, Retraining, and Adaptiveness We anticipated in Section 3.3.5 that machine learning based detectors/classifiers, likewise model-based approaches, have to cope with time-variance in operational conditions. In the context of radar this recalls the central role of adaptiveness, pioneered by the work of Brennan and Reed [1]. For data-driven approaches, a similar issue is identified under the term concept drift, which refers to different kinds of changes that may arise in the data distribution over time, making the trained detector inadequate to cope with the new data instances that are processed. Sudden shift occurs when the change is abrupt, while gradual shift is more progressive. 137

Coluccia: “chapter4_v2” — 2022/10/7 — 13:07 — page 137 — #1

138

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

In the context of radar, major shifts in the statistical distribution of the data occur in particular when the characteristics of the clutter background vary, causing the concept of target stemming out from the disturbance to change. Therefore, if data for training are collected under certain clutter conditions, the performance of the trained detector will likely degrade under a background with significantly different statistical characteristics. In machine learning this is called hidden context and is a fundamental cause of concept drift. To cope with such issues, typical approaches are to select training data that are most relevant to a certain context, or to weight instances to create a bias toward more representative data [2, 3]. The use of an ensemble of models is also a possible workaround: multiple algorithms are used simultaneously and their predictions are combined into a final decision that can be more accurate than any individual algorithm (see Section 2.3.4 for ensemble learning, therein introduced in the context of random forest). In any case, periodic retraining of the classifier/detector(s) is required, especially in highly dynamic contexts. It is worth making a more detailed comparison between the model-based and data-driven approaches regarding the use of training data. In classical full-adaptive detectors discussed in Chapter 1, at each time a very limited amount of secondary data from neighboring cells is selected, so as to train the detector about what the H0 hypothesis currently is. In the general case this requires estimating the covariance matrix of the overall disturbance in order to make the detector always matched to the current background. Such an estimate is part of the decision statistic, either because it is plugged into a nonadaptive detector to replace the known covariance matrix (like in two-stage GLRTs) or because it enters in a somewhat more complex way in the functional relationship encoded into the decision statistic to provide the (generalized) CFAR property. The functional relationship between data and decision statistic, what we may informally call the shape of the detector, is instead fixed, being the result of the detector’s derivation at the design stage. Fully data-driven detectors are instead totally dependent on the chosen training dataset, which is used to learn the whole functional form of the decision statistic (under a certain family dictated by the chosen learning machine). While making a model-based detector adaptive is computationally very achievable in practice, the training of deep learning methods may take several weeks of processing on dedicated hardware (such as GPUs). Indeed, for data-driven approaches, great care should be taken during the training phase. A main issue, discussed in Section 3.2.2, is how broad the training dataset for H0 should be selected, and also how specialize that should be for H1 . In addition, if either robustness or selectivity to mismatches is desired, data with a suitable mismatch should be included in either datasets to promote a specific behavior. In any case, however, given that periodic retraining is needed to cope with variations over time of the background (concept drift), a significant amount of time and resources are wasted that represents a major weakness compared to the way

Coluccia: “chapter4_v2” — 2022/10/7 — 13:07 — page 138 — #2

Hybrid Model-Based and Data-Driven Detection

139

adaptiveness is implemented in model-based approaches. Thus, finding proper hybrid approaches that can combine classical full-adaptive detection with the powerfulness of the data-driven tools is a very appealing third way. This research direction has started to be investigated only recently, but some effective ideas have been already identified and will be discussed throughout this chapter.

4.2 Hybridization Approaches 4.2.1 Different Dimensions of Hybridization Hybridization is the process of mixing two different solutions to gain advantage from both while somewhat compensating for their respective disadvantages. In the context of this book, this refers to the combination of model-based and datadriven approaches for radar detection. There are several different dimensions where hybridization can be performed; the three main ones can be identified as follows: 1. The way training data is obtained; 2. The way the functional dependency between input (CUT) and output (decision) is shaped; 3. The way adaptiveness is implemented. Regarding (1), a strict-sense definition of data-driven is that real data is collected and fed to an algorithm for processing without any modeling assumption. However, in a wider sense, such data may as well originate from a synthetic source, namely a simulator or a random generator. In the former case, a popular understanding of “data-driven” is that, implicitly, the role of the model is bypassed, with the data being the only source of information from which the detector is learned. However, a closer look reveals that the origin of the data and the role of the model in determining the shape of the detector are two distinct aspects. Indeed, one may retain the data-driven rationale where the role of the model is loose but use, at the same time, synthetic data—or, under the light of hybridization, perform some data augmentation where real and synthetic data coexist (see Section 3.2.2). This brings us to (2), which concerns the way the input is mapped to the output. In this respect, the main difference between the model-based and datadriven paradigms is that when a model is available for the data, some optimal detection criterion can be adopted to obtain (exact or approximate) closed-form expression for the decision statistic (as discussed in Chapter 1). In this way, the functional dependency between CUT and decision is crystallized into an analytical relationship, and at run-time adaptiveness is gained from the current observations (from secondary data, whose role was described in Section 4.1). Under the data-driven paradigm, instead, training data under both hypotheses

Coluccia: “chapter4_v2” — 2022/10/7 — 13:07 — page 139 — #3

140

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

are used to learn the parameters of the detector, which according to the type of adopted learning machine, can range from a few coefficients to a very large number of weights (especially in deep neural networks). In such a case, no statistical assumption about the data distribution is required and the model is bypassed, although the choice of the learning machine unavoidably implies a certain family of function approximant—but certainly much broader and vague compared to model-based approaches. Thus, even considering a synthetic data source, the way the decision statistic is shaped is fundamentally different under the two paradigms. Again, it is desirable to identify hybrid solutions that can retain some of the functional characteristics of data-driven approaches while exploiting theoretical properties of a data model, in particular to gain special qualities in the designed detector, as discussed in more depth later. As to (3), we already touched on it in Section 4.1. The key challenge for hybridization in this respect is to combine the full adaptiveness of classical modelbased detectors with the additional layer provided by the selection of a training dataset for H0 but also, importantly, for H1 , which is a main difference between the two paradigms. The selection of the training dataset introduces a bias that can be beneficial to the performance in operational conditions that are hard to model, but at the same time it should provide generalization capabilities to cope with unseen scenarios (see discussion in Section 2.2.2). To this end, the dataset should be enlarged with more diverse data, also considering possible mismatches of interest under one of the two hypotheses. If however the background changes significantly, the training dataset might become unrepresentative of the new scenario anyway, and retraining is needed; otherwise, training a detector with a very heterogeneous dataset, aiming at incorporating a vast assortment of operational conditions, may weaken its detection performance because the two hypotheses become less separable. Hybridization in this respect bears the challenging aim of finding a satisfactory balance. 4.2.2 Hybrid Model-Based and Data-Driven Ideas in Signal Processing and Communications Hybridization or contamination between model-based and data-driven ideas is a trend that has been progressively growing in recent years. The ultimate aim is to combine the interpretability of models with the powerfulness of data-driven learning. This is touching on the many different facets of signal processing and data mining, with applications to communications and, clearly, also to radar. Before delving specifically with the latter, let us look at a few examples in other fields. A plain general-purpose example of hybridization is the NeuralProphet timeseries forecasting method. This is a successor to Facebook Prophet,1 which 1.

https://neuralprophet.com/html/index.html.

Coluccia: “chapter4_v2” — 2022/10/7 — 13:07 — page 140 — #4

Hybrid Model-Based and Data-Driven Detection

141

was developed for explainable, scalable, and user-friendly forecasting. According to their developers, NeuralProphet is a hybrid forecasting framework where standard deep learning methods for training are combined with autoregression and covariate modules, which can be configured as classical linear regression or as neural networks to introduce context. Hybrid solutions are needed in this respect to bridge the gap between interpretable classical methods and scalable deep learning models [4]. Many signal processing, communications, and control problems have been traditionally addressed via classical statistical modeling techniques, but recently data-driven approaches have attracted significant interest. As is known, the advantage of model-based methods is that they utilize mathematical formulations representing the underlying physics, prior information, and additional domain knowledge, but (deep) learning based solutions have shown great performance in practice. Hybrid techniques are reviewed in [5] that combine principled mathematical models with data-driven systems to benefit from the advantages of both approaches. Narrowing down to the communications field, recent work is exploring different possibilities for hybridization. In fact, the design of symbol detectors in digital communication systems, via for example maximum a posteriori probability (MAP) rule, has classically relied on statistical channel models that describe the relation between the transmitted symbols and the observed signal at the receiver [6]. Unlike model-based receivers, which implement a specified detection rule, machine learning based receivers learn how to map the channel outputs into the transmitted symbols from training (in a data-driven manner). In the most direct approach, the receiver processing is replaced by a DNN treated as a black box ignorant of the underlying channel model. This leads to reliable results provided that the network is properly trained for the specific setup using a large dataset, which limits applicability in dynamic environments commonly encountered in communications. On the other hand, machine learning schemes are independent of the underlying stochastic model and can operate efficiently in scenarios where this model is unknown or its parameters cannot be accurately estimated. Recent work [6], however, has explored a hybrid framework that combines data-driven machine learning and model-based algorithms. In this hybrid approach, wellknown channel-model-based algorithms such as the Viterbi method and MIMO interference cancellation are augmented with machine learning based algorithms to remove their channel-model-dependence. The resulting data-driven receivers are most suitable for systems where the underlying channel models are poorly understood, highly complex, or do not capture the underlying physics well. The key aspect is that channel-model-based computations are replaced by dedicated neural networks that can be trained from a small amount of data while keeping the general algorithm intact. Results demonstrate that these techniques can yield near-optimal performance of model-based algorithms without knowing the exact

Coluccia: “chapter4_v2” — 2022/10/7 — 13:07 — page 141 — #5

142

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

channel input-output statistical relationship and in the presence of channel state information uncertainty. Deep unfolding, also referred to as deep unrolling, is another common strategy to combine deep learning with model-based algorithms. Unlike conventional deep neural networks, which utilize established architectures, in unfolded receivers the network structure is designed following a model-based algorithm. In other words, model-based methods are utilized as a form of domain knowledge in designing the architecture of the learning machine. The idea is to design the network to imitate the operation of a model-based iterative optimization algorithm corresponding to the considered problem. In particular, each iteration of the model-based algorithm is replaced with a dedicated layer with trainable parameters whose structure is based on the operations carried out during that iteration [7]. Compared to conventional DNNs, unfolded networks are typically more interpretable and tend to have a smaller number of parameters, and can thus be trained quicker [8].2

4.3 Feature Spaces Based on Well-Known Statistics or Raw Data In Section 3.3.5 we discussed how it is possible to build special feature spaces that can provide full adaptiveness (i.e., generalized CFAR property). The idea is to use either well-known CFAR detectors, such as Kelly’s and AMF, or general maximal invariant statistics as features. The two approaches are discussed in detail below, and represent other possible forms of hybridization between the modelbased and data-driven paradigms. While the considered feature spaces can be used with any classifier, in the following we focus on the KNN approach discussed in Section 2.3.1. The latter is in fact one of the simplest machine learning classifiers, since it basically performs computation of distances with respect to a training set, followed by a count-based decision rule (e.g., majority). Moreover, it has some desirable theoretical characteristics, as discussed below. 4.3.1 Nonparametric Learning: k -Nearest Neighbors The general detection problem consists of determining whether the feature vector x, which can be built via feature engineering or directly be (a transformed version of ) raw data (see Sections 2.1.2 and 3.1), is distributed in a certain way under the H0 hypothesis or in a certain alternative way under the H1 hypothesis, which in general will depend on unknown parameters. To derive the KNN classifier, we need a training set containing representative examples of the observed data under both H0 and H1 hypotheses. For simplicity and without loss of generality, we assume that independent observations under both H0 and H1 are available 2.

More discussion on deep unfolding/unrolling is provided in Section 5.2.

Coluccia: “chapter4_v2” — 2022/10/7 — 13:07 — page 142 — #6

Hybrid Model-Based and Data-Driven Detection

143

T to construct the corresponding feature vectors x0i = xi0 [1] · · · xi0 [m] , i = T 1, . . , NT0 , and x1i = xi1 [1] · · · xi1 [m] , i = 1, . . . , NT1 . We denote by ti = . xi , i = 1, . . . , NT0 + NT1 , the elements of the training set T = T0 ∪ T1 , where i the “label” i is either 0 or 1 depending on the fact that ti belongs to T0 or T1 , respectively. The following statistic is computed 1 i = (4.1) k {i: xi ∈Nk (x)} with Nk (x) the set of the k vectors xi s closest to the test vector x, according, for example, to the Euclidean norm (the k nearest neighbors of x). Finally, H0 or H1 is selected according to the decision rule H1

> ≤ η

(4.2)

H0

where η ∈ [0,1) is a chosen detection threshold. Notice that ∈ 0, k1 , . . . , k−1 k , 1 since i s are binary digits, and the test can be equivalently stated as “decide for H0 if the count of x1i ∈ Nk (x) is smaller than a certain number (between 0 and k).”3 KNN is a nonparametric approach; that is, strictly speaking there is no training phase: indeed, distances between training set and data under test are all computed at run-time, as they depend on the current observation even when a prerecorded or pregenerated training set is used. This makes it more computationally demanding at run-time compared to some supervised learning tools that perform the training phase offline and then implement a simple decision function at run-time. However, the advantage of the KNN compared to most data-driven methods is in its simple decision rule, which enables some theoretical performance assessment. In particular, it is possible to compute analytically the PFA and the PD , in terms of the two probabilities p0 and p1 that are related to elementary events in the feature space. This result is fully general and does not depend on the distribution of the data (features); furthermore, it is useful to predict the achievable classification performance and offer some insights to interpret the classification process [9]. In this respect, one may choose to condition the analysis to a given training set, or take also into account the randomness of the latter and compute the average performance by marginalization over the training set distribution [9]. Nonetheless, a fundamental result [10] guarantees that, for large dataset, the misclassification rate is never more than twice that of the Bayesian 3. is a discrete random variable, hence different thresholds η do not always lead to different values of PFA.

Coluccia: “chapter4_v2” — 2022/10/7 — 13:07 — page 143 — #7

144

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

classifier (which is optimal if its underlying assumptions are met); thus any other decision rule can cut the probability of error by at most one half. In a sense, half of the available information in an infinite collection of samples is contained in the nearest neighbor. KNN has been successfully adopted to design different types of radar detectors. Detection of small targets on sea surface, which usually leads to a large PFA , can be attacked by using a modified KNN method with a controlled false alarm rate, an approach validated also on the IPIX data [11]. A similar approach is followed in [12] where the detection is realized by an anomaly detection algorithm whose decision region is determined by the hyperspherical coverage of the training set of sea clutter, followed by a KNN-based classifier with a controllable false alarm rate. In the following we develop the KNN approach for different choices of the feature vector, based on the works [9, 13–15]. 4.3.2 Quasi-Whitened Raw Data as Feature Vector For the adaptive detection problem, the input data vector is composed of both primary and secondary data; namely z and z1 , . . . , zK . For convenience, secondary data is arranged as columns of a matrix Z (see Section 1.4.2). We recall that the general observation model for radar data is given in Section 1.1.1 and consists of N -dimensional vectors containing samples from multiple temporal pulses and/or antennas, which are coherently processed (involving a temporal, spatial, or space-time steering vector v; see also Section 1.4.2). Thus, a possibility is to use the quasi-whitened data under test as feature vector x; that is x = S−1/2 z

(4.3)

K † † where S = i=1 zi zi = ZZ is the scatter matrix introduced in Chapter 1 (K times the sample covariance matrix based on secondary data). Data can be obtained by experiments or simulations; in the latter case, the whole training set T is constructed artificially without requiring any preliminary measurement collection phase. We discuss this design procedure, whose scheme is depicted in Figure 4.1, in particular since it represents a hybrid approach where modeling assumptions are required at the design stage, but the detector operates in a datadriven fashion and can also deal with real-world data (despite that it is trained on synthetic data). If a Gaussian environment is expected, the data of the training set under H0 (i.e., x0i , i = 1, . . . , NT0 ), are pseudorandomly generated assuming z, z1 , . . . , zK are CN (0, Rdesign ). Similarly, the data of the training set under H1 (i.e., x1i , i = 1, . . . , NT1 ), are pseudorandomly generated in the same way, but z has mean αv based on a preassigned value of the nominal SNR, defined as usual as

Coluccia: “chapter4_v2” — 2022/10/7 — 13:07 — page 144 — #8

Hybrid Model-Based and Data-Driven Detection

145

Figure 4.1 General scheme for KNN-based detectors based on (whitened) raw data in Gaussian and non-Gaussian environments.

SNR = |α|2 v† R−1 design v (e.g., 15 dB). As to the preassigned (i.e., design) covariance matrix Rdesign , any choice can be made.4 In general, the performance assessment has to be conducted by operating the detector at input SNRs different from the design value adopted in the synthetic training set.5 The KNN-based detector discussed above can achieve a very high PD even for reduced SNR; that is, it is much more powerful than classical detectors, including Kelly’s detector, which requires more than 4 dB extra SNR to achieve the same performance [9]. The behavior under mismatched conditions is also interesting, since the KNN-based detector can be more robust than Kelly’s receiver but less robust than the AMF. The price for this excellent performance is the loss of the (generalized) CFAR property with respect to the overall covariance matrix of the disturbance. However, it can be shown that that P( > η), under H0 is independent of the power of z; that is, the KNN detector possesses the CFAR property with respect to the power of the data under test [9]. Moreover, the PFA is not very sensitive to variations of the one-lag correlation coefficient of the covariance matrix of the actual data under test. 4. For instance, it can be chosen as the sum of a thermal noise component plus a clutter covariance matrix (any model can be adopted), with clutter-to-noise power ratio set equal to a reasonable value according to the expected conditions, typically a few tens of decibels. 5. This is in general a good practice that should be adopted for any data-driven detector, as done, for example, in [16] to highlight robustness of neural networks to SNR deviations.

Coluccia: “chapter4_v2” — 2022/10/7 — 13:07 — page 145 — #9

146

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

Results on IPIX data show that the advantage of the KNN-based detector also remains when it is tested on real-world radar data, also deviating from the Gaussian assumption. This remarkably means that, despite the fact that the detector is obtained through synthetic data, its functional shape captures properties that are important for improving the performance in a way that may be not possible through a strictly model-based CFAR design—thus, it appears as a good prototype for what a hybrid design approach should deliver. The loss of the generalized CFAR property is a major theoretical drawback, but the amount of loss is more important from a practical (industrial) point of view. Although it has been observed that the PFA of the KNN-based detector is not very sensitive to parameter variations (i.e., the loss of theoretical CFARness is tolerable in practice), however this should be checked on a case-by-case basis. In the next section, we will show that such a property can be recovered exactly by using a different feature vector; in the latter case, the same power of Kelly’s detector can be obtained under matched conditions (but not higher, as instead we have seen with raw data), and on the other hand the level of robustness can be controlled by tuning some design parameter. To conclude this section, we mention how the (quasi-)whitened raw data approach can be modified to cope with non-Gaussian scenarios. As before, training data may be also obtained synthetically by following a hybrid design approach; to this end, any of the non-Gaussian clutter models can be adopted, such as K-distributed SIRV where the clutter is modeled as sn, with n ∼ CN (0, Rdesign ) and the square values of the texture components s 2 ruled by the gamma distribution with unitary mean value and shape parameter νdesign > 0 [17] (this encompasses the Gaussian model for large values of the shape parameter).6 Such a hybrid approach has a practical advantage, since it avoids real-world data collection and enables control of the characteristics of the training set to promote certain properties in the detector. To obtain an adaptive receiver for this non-Gaussian model, it is possible to use the transformed raw data as a feature vector [13]: x = S−1/2 norm

z 1 ⊥ 2 N −1 Pv z

(4.4)

† 2 where P⊥ v = IN − vv /v is the orthogonal projector onto the orthogonal complement of the one-dimensional space spanned by the steering vector v and

Snorm

6.

K N zi z†i = † K i=1 zi zi

(4.5)

For more details on non-Gaussian clutter modeling and related detectors, see [18, Ch. 7].

Coluccia: “chapter4_v2” — 2022/10/7 — 13:07 — page 146 — #10

Hybrid Model-Based and Data-Driven Detection

147

is the normalized sample covariance matrix [19]. The main idea consists in whitening the raw data z with Snorm up to a proper normalization factor introduced to make the resulting statistic insensitive to the power of the underlying disturbance (under H0 ). As for the Gaussian case, the resulting KNN-based detector does not strictly possess the CFAR property, but it has been shown that the detector is quite robust to actual values of ν and R different from the design values νdesign and Rdesign , and can provide a significant gain compared to classical detectors [13]. 4.3.3 Well-Known CFAR Statistics as a Feature Vector Instead of using transformed raw data, a different approach is to adopt a feature vector x obtained by stacking (compressed) statistics of some well-known CFAR detectors. This yields, by design, a data-driven detector that also exhibits the generalized CFAR property. There are several ways to implement this idea. The most basic one is to just pick the statistics of some well-known detectors to compose the feature vector. This yields a specific data representation in the corresponding feature space: an example is shown in Figure 4.2 for Gaussian background, using Kelly’s, AMF, and adaptive energy detector statistics, thus obtaining a 3-D space. Data is shown under different conditions, namely H0 (dots), H1 under matched conditions (crosses), and H1 under mismatched conditions (triangles and circles for two different values of cos2 θ ). It is clearly visible that the H0 points cluster together and are well-separated from H1 ones. This means that even a linear classifier such as SVM can work to discriminate the two hypotheses. However, it is also apparent that the clusters under mismatched conditions are rather overlapping, thus it is problematic to obtain a desired robust or selective behavior. The reason is that such features are correlated and both AMF and ED are robust detectors, hence might not be well-suited to discriminate the different mismatched conditions (although a projection onto a larger dimensional space, e.g., by means of a kernel, could lead to some improvement). To gain more flexibility, it is possible to introduce a set of parameters to gain additional degrees of freedom. Consider for instance Kelly’s detector and ACE with the goal of obtaining a selective receiver with reduced loss under matched conditions; a feature vector can be defined as d1 tACE (z, S) x= (4.6) d2 tKelly (z, S) or other equivalent forms, with d1 and d2 arbitrary (nonnegative) parameters. A proper choice of d1 , d2 allows to obtain PD performance (under matched conditions) comparable to Kelly’s detector for high SNR values, but better performance for weak signals (i.e., a reduced loss in comparison to the ACE

Coluccia: “chapter4_v2” — 2022/10/7 — 13:07 — page 147 — #11

148

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

Figure 4.2

Data clusters in a 3-D feature space obtained from well-known CFAR detectors, using N = 16, K = 2N, PFA = 10−4 , first-order autoregressive covariance matrix with one-lag correlation coefficient ρ = 0.95, and v temporal steering vector with normalized Doppler frequency νT = 0.08.

if d1 > d2 ). It is apparent that one can set d2 = 1 without loss of generality; then, considering in general a non-Gaussian background, the optimal choice of d1 depends on the clutter shape parameter ν [15]. A similar approach can be followed to obtain a robust detector by defining the feature vector as composed of Kelly’s detector and AMF. Several alternative formulations are possible: for instance, the choice in [14] was d1 t˜(z, S) (4.7) x= d2 tAMF (z, S) where (hereafter omitting notational dependency on z and S) tKelly t˜ = 1 − tKelly

(4.8)

is a reparametrization of Kelly’s statistic (1.54). Results in [14] for a Gaussian background show that for d1 = 1 and d2 = 0.7 the detector achieves the same

Coluccia: “chapter4_v2” — 2022/10/7 — 13:07 — page 148 — #12

Hybrid Model-Based and Data-Driven Detection

149

benchmark performance of Kelly’s detector under matched conditions and is almost as robust as the AMF. The level of robustness can be adjusted based on the tuning of d2 , while keeping the same PD of Kelly’s detector under matched conditions. It is important to notice that all the above statistics can be decomposed in terms of two statistics t˜ (given in (4.8)) and [20, 21] β=

1 † −1 2 1 + z† S−1 z − |z †S −1v| vS v

.

(4.9)

For instance tAMF =

t˜ , β

tACE =

t˜ . 1−β

(4.10)

This important property stems from the general theory of invariant detection [22, 23], which seeks detection statistics that are invariant under a group of suitable transformations desirable for the problem at hand. This leads to the special property that any group-invariant function of the data depends on the latter through some specific compressed statistics called maximal invariant statistics [24–27].7 A typical property of the detectors derived under such framework is the (generalized) CFAR property (with respect to the unknown covariance matrix of the data). Figure 4.3 illustrates this alternative data representation in terms of maximal invariant statistics as points in a 2-D space given by (β, t˜) coordinates. It can be seen that H0 and H1 clusters are differently spread over this plane, with the mismatched cluster better separating from the matched one. Using (4.10), it is possible to draw the equation corresponding to the AMF detector, which turns out to be a line with positive slope; Kelly’s detector is instead a horizontal line. Since dependency on t˜ and β is a general property of most well-known CFAR detectors for Gaussian environments, decision boundaries can be easily derived and interpreted as linear or nonlinear classifiers, an aspect that will be discussed more in detail in Section 4.4; for the moment we can notice that the decision boundary of the KNN-based detector above has a nonlinear shape somehow in between Kelly’s and AMF lines, which explains its interesting win-win behavior in terms of PD (under matched conditions) and robust behavior under mistmatched conditions. 7. As discussed in Section 1.3.2, Bose and Steinhardt proved, under an invariant-theoretic framework, that no UMP invariant test exists for the general adaptive radar detection problem [24], which calls for the development of more or less specialized model-based detectors for the different scenarios at hand; see, for example [18, 25, 28, 29]. This likewise motivates the seek for data-driven or hybrid detectors that can provide better performance.

Coluccia: “chapter4_v2” — 2022/10/7 — 13:07 — page 149 — #13

150

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

Figure 4.3

Data clusters (as in Figure 4.2) represented in a two-dimensional feature space given by the ˜t and β statistics in (4.8) and (4.9), and decision boundaries of Kelly’s, AMF, and KNN-based detectors. PFA = 0.0084 while the other parameters are as in Figure 4.2.

The previous design strategy can be formalized in a more general setting by considering a feature vector x with the following structure: T (4.11) x = d1 t˜b[1] d2 t˜b[2] · · · dm t˜b[n] where b[j] = fj (β), j = 1, . . . , n, denote an arbitrary (nonlinear) function of the β statistic, hence depending on the primary and secondary data z and Z (or S), and {d1 , . . . , dn } is an arbitrary set of nonnegative numbers. Considering thus the general structure of x in (4.11), it turns out that the probabilities p0 and p1 mentioned earlier can be expressed in closed-form, which allows computation of PFA and PD for the resulting KNN-based detectors [9]. Based on that, as for model-based CFAR detectors, PFA depends upon the SNR value used to generate the training set under H1 , but is otherwise independent of the actual covariance matrix R of z (possibly different from the design value Rdesign ); that is, the detector is CFAR. In addition, PD depends upon the design value of the SNR as well as on the actual SNR (using the possibly mismatched steering vector) and cosine squared of the angle between the whitened versions of the nominal and actual steering vector; that is: γ = |α|2 p† R−1 p

and

cos2 θ =

|p† R−1 v|2 . p† R−1 p v† R−1 v

(4.12)

Coluccia: “chapter4_v2” — 2022/10/7 — 13:07 — page 150 — #14

Hybrid Model-Based and Data-Driven Detection

151

This shows that, although the hybrid design strategy (discussed below) is different from the traditional model-based approach to radar detection (as well as different from a purely data-driven approach), it is still possible to obtain a receiver with desirable properties, in particular CFARness; and, the performance in terms of PD depends on the classical parameters γ and cos2 θ, as in most well-known model-based detectors discussed in Chapter 1.

4.4 Rethinking Model-Based Detection in a CFAR Feature Space 4.4.1 Maximal Invariant Feature Space The approach discussed in the previous section has profound relationships with the general theory of maximal invariant statistics. As mentioned, in fact, it is possible to reparametrize the most widely used CFAR detectors in terms of the two variables β and t˜ given in (4.8) and (4.9), which are the maximal invariant statistics for Gaussian disturbance. This leads to a natural two-dimensional feature space for CFAR detection. By following this rationale, we can take a fresh look at the problem of designing and analyzing CFAR detectors with desired robust or selective behavior [30]. Observation data (primary data z from the CUT, and secondary data z1 , . . . , zK from neighboring cells) compressed through the maximal invariant statistics result in point clusters in the β-˜t plane, referred to as CFAR feature plane (CFAR-FP). Then, the detection problem is to separate the H0 cluster from the union of the matched and mismatched H1 clusters, if robustness is of interest, or the union of the H0 and mismatched-H1 clusters from the matched-H1 cluster if selectivity is desired. In the CFAR-FP, a reinterpretation of model-based detectors is possible, since decision boundaries in such a space can be easily derived and interpreted as linear or nonlinear classifiers. This is covered later in Section 4.4.2. Beforehand, we further develop the rethinking of invariant detectors as learning machines that guarantee the CFAR property through a suitable processing by analyzing the mapping chain {z, Z = [z1 · · · zK ]} → (z, S−1 ) → (s1 , s2 ) → (β, t˜)

(4.13)

† −1

2 where (s1 , s2 ) = (z† S−1 z, |z †S −1v| ) = (tED , tAMF ) (ref. (1.48) and (1.44)), and vS v

1 s2 (β, t˜) = ∈ (0, 1) × (0, +∞). (4.14) , 1 + s 1 − s2 1 + s 1 − s2

The mapping chain (4.13) can be interpreted as layers of a learning machine with a peculiar structure, in the spirit of [31]. This provides a general multilayer scheme that guarantees the CFAR property, as depicted in Figure 4.4. After the input layer with the raw data {z, z1 , . . . , zK }, the first two hidden layers serve as

Coluccia: “chapter4_v2” — 2022/10/7 — 13:07 — page 151 — #15

152

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

an invariant feature extractor [31]: to guarantee the CFAR constraint, the first hidden layer rules out any other uses of the secondary data different from the construction of the sample matrix inverse S−1 , while the second hidden layer rules out any data compression function different from a specific transformation involving two statistics s1 and s2 .8 The third hidden layer is just an invertible mapping to the (β, t˜) variables, which are mathematically convenient to capture invariant properties of the data; they represent the features, in the machine learning terminology.9 Note that the use of restricted network architectures is a leitmotif in machine learning, and is aimed at promoting certain specific properties starting from the very generic fully connected architecture. For instance, a CNN makes use of convolutional layers that act locally, and then implement max pooling to progressively reduce the number of distinct connections as the data progress through the layers toward the output. Local receptive fields are in general introduced in neural networks to specialize the processing [34]. Other examples are RNN, feature pyramid network (FPN), and residual neural network (ResNet). Weight sharing is also a possible strategy to restrict the architecture of a learning machine, reducing the number of degrees of freedom. Returning to the architecture of Figure 4.4 implementing (4.13), the fourth layer implements the actual decision function determining the peculiar behavior of the detector. Classically, the design of this layer is unsupervised and typically based on hypothesis testing tools such as GLRT, which yields the decision statistic. As seen in Chapter 3, however, machine learning techniques provide alternative (often supervised) ways to design radar detectors. Both such approaches are subsumed in the fourth layer, hence the CFAR-FP represents a unified framework. As a final note, although only the Gaussian case is addressed here (2-D feature space β-˜t ), different setups can be considered. For instance, under the persymmetric assumption, a 4-D feature space can be obtained by considering a maximal invariant statistic based on the Kelly’s GLRT for the persymmetric case, the persymmetric AMF, and the two eigenvalues of a suitably transformed data submatrix [35]. For the case of partially homogeneous environment, again under the persymmetric assumption, a 3-D feature space can be obtained by considering 8. These can be interpreted as energy (norm) of the received signal vector (s1 ) and its scalar product with the energy-normalized steering vector (s2 ). 9. This is essentially a reparametrization of the relationship between the maximal invariants (tKelly , tAMF ) identified in [24] and the two statistics (β, t˜). Although other equivalent maximal invariants might be adopted as features such as (tKelly , tAMF ) themselves, (s1 , s2 ), or the one in [25] (for more details, see [30]), the choice (β, t˜) is a convenient one. Such a statistical characterization of the maximal invariant has been used over the years to derive PFA and PD formulas for several detectors, also under mismatched conditions. In Section 4.4.2 it will be shown that it can also be very useful for both analysis and design purposes, thanks to the fact that the two statistics (β, t˜) are independent under H0 , with well-known distributions [21, 32, 33], and one of them is equivalent to Kelly’s detector, which is considered the benchmark for matched conditions.

Coluccia: “chapter4_v2” — 2022/10/7 — 13:07 — page 152 — #16

Hybrid Model-Based and Data-Driven Detection

153

.

Figure 4.4 Layered representation, from raw data to feature space (β, t˜ ), of the CFAR processing chain of a detector X with statistic tX and threshold ηX .

a maximal invariant statistic based on three normalized eigenvalues extracted from two suitably transformed data submatrices [36]. Higher-dimensional spaces are obtained for range-spread targets and/or non-Gaussian clutter [37].10 Finally, for the problem of detecting an unknown rank-one signal in white noise, which can model uncertainty in the steering vector, the maximal invariant is given by l − 1 scaled eigenvalues, where l corresponds to the number of beams or array sensors (in the particular case l = 2 a 1-D feature space consisting of a single line is obtained, divided in two parts by the detection threshold) [38]. Thus, the CFAR-FP framework is general and can be extended to maximal invariant feature spaces of arbitrary dimensions. 4.4.2 Characterizing Model-Based Detectors in CFAR-FP In Chapter 1 we discussed several model-based design ideas based on statistical tests with modified hypotheses, asymptotic arguments, approximations, and ad hoc strategies to promote robustness or selectivity in radar detectors. In Chapter 3 we added data-driven approaches, and in this chapter we have presented different hybrid strategies. The discussed CFAR-FP framework represents a further hybrid 10. As already mentioned, the reader is referred to [18, Ch. 9] and [18, Ch. 7] for range-spread targets and non-Gaussian clutter modeling and related detectors, respectively.

Coluccia: “chapter4_v2” — 2022/10/7 — 13:07 — page 153 — #17

154

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

Table 4.1 Classification Region Boundary in the CFAR-FP (β-˜t Plane) Name

Curve Equation

Kelly’s detector [32]

˜t = ηK

AMF [39]

˜t = ηAMF β

ACE [40]

˜t = − ηACE β + ηACE 1−η 1−η

ED

˜t = (ηED + 1)β − 1

Kalson’s detector [41]

˜t = (1− Kalson )ηKalson β + Kalson ηKalson 1− η 1− η

ABORT [42]

˜t = −β + ηA 1−η

WABORT [43]

˜t = ηWA − 1 β ˜t = ηRao

Rao’s test [44]

ROB [45]

ACE

ACE

Kalson Kalson

Kalson Kalson

(0 ≤ Kalson ≤ 1)

A

β−ηRao

 1  ˜t = ηROB β (ζ − 1) 1 − 1 ζ − 1, 1 β 1− ζ  ˜t = η − 1, ROB where ζ = K+1 (1 + ROB ) with ROB ≥ 0 N

β ∈ (0, 1 − ζ1 ]

β ∈ [1 − ζ1 , 1)

design approach that combines model-based features with a data-driven detection approach to obtain CFAR detectors with prescribed characteristics in terms of robustness or selectivity. While design strategies in the CFAR-FP will be discussed in detail in Section 4.4.3, this framework also provides a new look to classical model-based detectors, as illustrated in the following. Given a well-known CFAR detector X , because of the maximal invariant theory, its detection statistic tX (z, S) can often be rewritten in terms of β and t˜ only (assuming Gaussian background); then, by studying the equation tX (β, t˜) = ηX , where ηX is the threshold for a chosen PFA , a decision region in the CFAR-FP can be identified, possibly depending on some tunable parameter(s) X . Several cases are discussed in [30], and some are listed in Table 4.1. For instance, the decision boundaries associated to tKelly = ηKelly , tAMF = ηAMF , and tACE = ηACE are straight lines with zero (horizontal line), positive, and negative slopes, respectively. Insights can be gained by studying such decision boundaries.11 Figure 4.5 shows the point clusters obtained by generating the random variable z under three 11. For instance, recalling (4.10), the crossing point between the Kelly and AMF lines can be easily obtained by solving the system t˜ = ηKelly (4.15) t˜ = βηAMF

Coluccia: “chapter4_v2” — 2022/10/7 — 13:07 — page 154 — #18

Hybrid Model-Based and Data-Driven Detection

Figure 4.5

155

H 0 or H 1 data in the CFAR-FP under both matched and mismatched conditions. The parameters are those in Figures 4.2 and 4.3 but for P FA = 10−4 , and two different mismatches cos2 θ = 0.83 and cos2 θ = 0.65 are considered.

different conditions—H0 , H1 under matched conditions, H1 under mismatched conditions according to the value of the relevant parameters for the Gaussian environment, which as said are actual SNR and mismatch level given in (4.12)— and Z as usual under the null hypothesis, then mapping the raw data to the β-˜t plane. More in general, Table 4.1 shows that Kelly’s, AMF, ACE, ED, Kalson’s, and ABORT detectors are all linear classifiers in the CFAR-FP: robust detectors have positive slope, while selective ones have negative slope, and Kelly’s detector is a horizontal line (zero slope). The corresponding decision boundaries are depicted in Figure 4.6 using the same parameters as for Figure 4.5. Note in Figure 4.5 that the position, shape, and orientation of the clusters change with γ and cos2 θ; in this respect, the positive or negative slope appearing in Figure 4.6 is suitable to classify mismatched points as H1 or H0 , respectively, trading off to some extent the achievable PD (under matched conditions) for a same SNR. Similarly, a horizontal line can effectively separate the H0 cluster, since it looks horizontally spread, from any H1 cluster (under matched condition) that lies in the upper part of the CFAR-FP. A precise analytical characterization of the trajectories and which returns β = ηKelly /ηAMF ; as an example, after setting the thresholds for PFA = 10−4 (see Figure 4.5) one gets ηKelly ≈ 0.41, hence ηKelly ≈ 0.69 (intercept), ηAMF ≈ 1.48 (slope), and finally ηKelly /ηAMF ≈ 0.46 (abscissa of the intersection point). By comparison, in Figure 4.3 the crossing point is slightly higher, ηKelly /ηAMF ≈ 0.32/0.63 ≈ 0.51, due to the higher PFA .

Coluccia: “chapter4_v2” — 2022/10/7 — 13:07 — page 155 — #19

156

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

Figure 4.6

Decision boundaries of some well-known detectors in the CFAR-FP, from Table 4.1.

shapes of the clusters parameterized by γ and cos2 θ is possible [30] by exploiting the statistical distributions of the variables t˜ and β, which turns out to be very useful for both analysis and design purposes.12 While design strategies will be discussed in Section 4.4.3, we summarize in the following a few insights from the analytical characterization provided in [30]. Point clusters in the CFAR-FP exhibit in general a slanted elliptical shape whose axes can be obtained from the mean and covariance matrix of the vector [β t˜]T , thereby enabling understanding of the cluster positions as a function of γ and cos2 θ (besides N and K ). The cluster under H0 has a major axis parallel to the abscissas and a quite compressed minor axis (for typical values of K and N ); the resulting shape is thus a horizontal stretched ellipsis, as visible in Figure 4.7 12. Such a statistical characterization has been often exploited to derive analytical formulas for the performance of CFAR detectors. For instance, the PFA of Kelly’s detector can be evaluated as ˜ 0] = PFA (ηKelly ) = P[tKelly > ηKelly |H0 ] = P[˜t > η|H

1 (1 + η) ˜ K −N +1

(4.16)

where η˜ = ηKelly /(1 − ηKelly ). The general characterization provided in [46], see also [20, 21, 32, 33, 47], is that t˜ given β is complex noncentral F-distribution with 1 and K − N + 1 complex degrees of freedom and noncentrality parameter γβ cos2 θ, while β is complex noncentral Beta distribution with K − N + 2 and N − 1 complex degrees of freedom and noncentrality parameter γ (1 − cos2 θ ). This characterization, parameterized in γ and cos2 θ, encompasses also H0 (for γ = 0, i.e., SNR = −∞ dB) and H1 under matched conditions (for cos2 θ = 1).

Coluccia: “chapter4_v2” — 2022/10/7 — 13:07 — page 156 — #20

Hybrid Model-Based and Data-Driven Detection

Figure 4.7

157

Shapes and iso-SNR trajectories of the cluster ellipses for different SNR and cos2 θ, for the same clusters of Figure 4.5.

(as well as Figures 4.5 and 4.6). By increasing the number of secondary data K , the H0 cluster shrinks and its center migrates toward the bottom-right corner of the CFAR-FP. This obviously reflects the greater available information that makes the H0 and H1 clusters better separated. The position and shape of the clusters in the CFAR-FP depend on the actual SNR γ and level of mismatch cos2 θ. Specifically, iso-SNR (fixed γ , as function of cos2 θ ) trajectories can be drawn, which describe how the center of the cluster migrate under different conditions, while the shape changes (see solid lines in Figure 4.7). Iso-mismatch curves (fixed cos2 θ, as a function of γ ) can be defined analogously. For increasing γ , the H1 cluster under matched conditions migrates vertically, and its minor axis expands while rotating counterclockwise, as visible in Figure 4.7 (see [30] for an analytical proof ). Other receivers have a nonlinear decision boundary in the CFAR-FP, which can be useful to achieve a different trade-off between PD (under matched conditions) and behavior under mismatched conditions. The W-ABORT for instance (see Table 4.1) has a hyperbolic boundary t˜ ∝ 1/β, and similarly the Rao’s test [44]. Indeed, a nonlinear curve can follow more closely the cluster shapes, hence it may better isolate the H0 cluster from the union of the matched and mismatched H1 clusters if robustness is of interest or the union of the H0 and mismatched-H1 clusters from the matched-H1 cluster if selectivity is desired.

Coluccia: “chapter4_v2” — 2022/10/7 — 13:07 — page 157 — #21

158

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

As discussed in Chapter 1, enhancing either robustness or selectivity often comes at the price of a certain PD loss under matched conditions compared to Kelly’s receiver, with different trade-offs. W-ABORT for instance has strong selectivity, which however comes at the price of a reduced detection power under matched conditions. Some detectors can achieve a diversified trade-off through a tuning parameter, which controls the decision curve (e.g., the slope for a linear boundary or the curvature for a hyperbola-like one). The case of the random-signal robustified (ROB) detector is interesting, since its decision region is nonlinear until β = 1 − KN+1 (1 + ROB )−1 , then saturates to a constant (horizontal line).13 This two-region behavior explains the capability (observed in [45]) to achieve the same PD of Kelly’s detector and at the same time achieve strong robustness.14 Finally, the CFAR-FP offers a direct interpretation of the locally most powerful invariant (MPI) detector discussed in [25], and allows for its parametric generalization, which unlike the MPI detector is implementable and can be tuned to steer its behavior toward selectivity or robustness [30]. 4.4.3 Design Strategies in the CFAR-FP From the discussion above, it should be clear that the CFAR-FP framework can be exploited as a hybrid model-based/data-driven design tool. We discuss now different criteria to obtain a radar detector with desired behavior. Ad hoc detectors can be designed by empirically drawing the desired decision boundary in the CFAR-FP; then the equation corresponding to the decision statistic can be obtained through curve fitting. Unfortunately, such a naive strategy lacks a mechanism to control the PFA : indeed, to guarantee a chosen value, one has to iteratively modify by trial and error the boundary until the PFA constraint is fulfilled. This time-consuming procedure must be repeated if any of the parameters are changed. A more methodological approach is desirable. To this end, consider that any detector admitting an explicit function for its decision boundary can be equivalently expressed in terms of the statistic t(z, S) = t˜ − f (β, ) (4.17) 13. For example, the transition point is located at β ≈ 0.6 for the parameters in the example of Figure 4.6 (where ROB = 0.2). 14. In fact, the benchmark performance of Kelly’s detector can be explained by its horizontal decision boundary, a peculiarity among detectors with linear decision boundary. This tends to well separate the H0 cluster from any H1 cluster (under matched conditions). On the other hand, Kelly’s detector follows a GLRT design that does not take into account the possibility of a mismatch in its hypotheses, as opposed to other model-based detectors, which in fact exhibit an oblique-linear or even nonlinear boundary. From the analysis of the clusters’ positions in [30], it follows that a large enough value of β indicates that data under test is unlikely to belong to any mismatched H1 cluster (those in the left-most half of the CFAR-FP). Thus, the two-region behavior of the ROB detector is in agreement with its robustness capability with almost no loss of detection power.

Coluccia: “chapter4_v2” — 2022/10/7 — 13:07 — page 158 — #22

Hybrid Model-Based and Data-Driven Detection

159

with f (·) as a known function of β parameterized in a vector , including tunable parameters in addition to the detection threshold η. A possible strategy is to choose a model for f depending on a small number of parameters set upfront, leaving η as the only one to be optimized to satisfy a PFA = ψ constraint. So, for a detector X , the final test can be always written as H1

t˜ − f (β, X ) ≷ ηX . H0

Simple choices for f are a polynomial f = (β− )2 − 2 3

p

i=1 i β

i

with low order p or the

Gaussian function f = 1 e with tunable height 1 , location 2 , and width 3 parameters. For simplicity, special cases of such functions with only one degree of freedom are preferable (i.e., in which X is a scalar X ). In such cases only X must be manually set, while ηX is obtained by inverting the PFA formula. The simplest option y = LIN β with decision boundary t˜ − LIN β = ηLIN ,

LIN ∈ R.

A criterion to guide the setting of the slope LIN can make use of reference directions in the CFAR-FP. In particular, in [30] closed-form expressions for the iso-SNR lines are provided,15 which have an intuitive interpretation: specifically, knowing how the H1 cluster migrates for increasing mismatch (cos2 θ) allows one to try to keep it above threshold thereby obtaining a robust detector. Conversely, setting the decision boundary orthogonally to a iso-SNR line will produce a H0 decision for mismatched signals, thus yielding a selective behavior. In conclusion, LIN can be set equal to either the slope of an iso-SNR line (for chosen cos2 θ) to obtain a robust detector, or its negative reciprocal (perpendicular direction) to obtain a selective detector. More details can be found in [30]. Through the hybrid design strategies discussed above, it is possible to achieve the same performance of model-based receivers as well as remarkable behaviors that have no counterpart in classical detectors. Recent work has further developed the design possibilities of the CFAR-FP by formulating an optimal design problem and providing a low-complexity approximate algorithm for its resolution, which can yield to customized detectors with desired specifications [48].

4.5 Summary This chapter discussed various forms of hybridization between model-based and data-driven concepts for the task of adaptive radar detection by mixing the two paradigms to gain advantage from both, while still compensating for their respective disadvantages. Aspects related to dataset acquisition/generation and 15. These expressions are parametric in N , K , so the resulting detection statistic is general.

Coluccia: “chapter4_v2” — 2022/10/7 — 13:07 — page 159 — #23

160

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

algorithmic procedures for detector design were analyzed, with emphasis on the way the functional dependency between input (CUT) and output (decision) is shaped, as well as on the way adaptiveness is implemented. After reviewing hybrid model-based and data-driven ideas in signal processing and communications, the problem of designing data-driven and hybrid detectors that are robust to signal mismatches or can conversely reject unwanted signals was specifically addressed. Several strategies are analyzed, that make use of raw data, well-known CFAR detectors, or maximal invariant statistics. These were practically demonstrated by adopting one of the simplest data-driven classifiers as a detection tool, namely the k-nearest neighbors. Finally, a reinterpretation of model-based detectors in a suitable CFAR feature space was discussed, which enables a more intuitive interpretation under matched/mismatched conditions using general concepts adopted in machine learning such as data clusters, decision boundary, and linear/nonlinear classifier. Such a framework was then exploited as a hybrid model-based/data-driven design approach.

References [1]

Brennan, L. E., and I. S. Reed, “Theory of Adaptive Radars,” IEEE Transactions on Aerospace and Electronic Systems, Vol. 9, No. 2, March 1973, pp. 237–252.

[2] Tsymbal, A., “The Problem of Concept Drift: Definitions and Related Work,” The University of Dublin, Trinity College, Department of Computer Science, Dublin, Ireland, Tech. Rep. TCD-CS2004-15, 2004. [3] Widmer, G., and M. Kubat, “Learning in the Presence of Concept Drift and Hidden Contexts,” Machine Learning, No. 23, 1996, pp. 69–101. [4] Triebe, O., H. Hewamalage, P. Pilyugina, N. Laptev, C. Bergmeir, and R. Rajagopal, “Neuralprophet: Explainable Forecasting at Scale,” arXiv:2111.15397, November 2021. [5]

Shlezinger, N., J. Whang, Y. C. Eldar, and A. G. Dimakis, “Model-Based Deep Learning,” arXiv:2012.08405, 2020.

[6]

Farsad, N., N. Shlezinger, A. J. Goldsmith, and Y. C. Eldar, “Data-driven Symbol Detection via Model-Based Machine Learning,” arXiv:2002.07806, 2020.

[7]

Shlezinger, N., N. Farsad, and A. J. Eldar, Yonina C. Goldsmith, “Model-Based Machine Learning for Communications,” arXiv:2101.04726, 2021.

[8]

Balatsoukas-Stimming, A., and C. Studer, “Deep Unfolding for Communications Systems: A Survey and Some New Directions,” arXiv:1906.05774, 2019.

[9]

Coluccia, A., A. Fascista, and G. Ricci, “A k-Nearest Neighbors Approach to the Design of Radar Detectors,” Signal Processing, Vol. 174, 2020, p. 107609.

[10]

Cover, T., and P. Hart, “Nearest Neighbor Pattern Classification,” IEEE Transactions on Information Theory, Vol. 13, No. 1, 1967, pp. 21–27.

Coluccia:

“chapter4_v2” — 2022/10/7 — 13:07 — page 160 — #24

Hybrid Model-Based and Data-Driven Detection

161

[11]

Guo, Z., P. Shui, X. Bai, S. Xu, and D. Li, “Sea-surface Small Target Detection Based on K-NN with Controlled False Alarm Rate in Sea Clutter,” Journal of Radars, Vol. 9, 2020, No. R20055.

[12]

Guo, Z.- X., and P.-L. Shui, “Anomaly Based Sea-Surface Small Target Detection Using K-Nearest Neighbor Classification,” IEEE Transactions on Aerospace and Electronic Systems, Vol. 56, No. 6, 2020, pp. 4947–4964.

[13]

Coluccia, A., A. Fascista, and G. Ricci, “A KNN-Based Radar Detector for Coherent Targets in Non-Gaussian Noise,” IEEE Signal Processing Letters, Vol. 28, 2021, pp. 778–782.

[14]

Coluccia, A., A. Fascista, and G. Ricci, “Robust CFAR Radar Detection Using a K-Nearest Neighbors Rule,” in 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2020), 2020, pp. 4692– 4696.

[15]

Coluccia, A., and G. Ricci, “Radar Detection in K-Distributed Clutter Plus Thermal Noise Based on KNN Methods,” in 2019 IEEE Radar Conference (RadarConf ), 2019, pp. 1–5.

[16]

Gandhi, P., and V. Ramamurti, “Neural Networks for Signal Detection in Non-Gaussian Noise,” IEEE Transactions on Signal Processing, Vol. 45, No. 11, 1997, pp. 2846–2851.

[17]

Conte, E., M. Longo, and M. Lops, “Modelling and Simulation of Non-Rayleigh Radar Clutter,” IEE Proceedings F–Radar and Signal Processing, Vol. 138, No. 2, April 1991, pp. 121–130.

[18]

Greco, M., and A. De Maio (eds.), Modern Radar Detection Theory, Edison, NJ: SciTech Publishing, 2015.

[19]

Conte, E., M. Lops, and G. Ricci, “Adaptive Detection Schemes in Compound-Gaussian Clutter,” IEEE Transactions on Aerospace and Electronic Systems, Vol. 34, No. 4, October 1998, pp. 1058–1069.

[20]

Richmond, C. D., “Performance of the Adaptive Sidelobe Blanker Detection Algorithm in Homogeneous Clutter,” IEEE Transactions on Signal Processing, Vol. 48, No. 5, May 2000, pp. 1235–1247.

[21]

Bandiera, F., D. Orlando, and G. Ricci, Advanced Radar Detection Schemes Under Mismatched Signal Models. Synthesis Lectures on Signal Processing No. 8, Morgan & Claypool Publishers, 2009.

[22]

Lehmann, E. L., Testing Statistical Hypotheses, Second Edition, Springer-Verlag, 1986.

[23]

Scharf, L. L., Statistical Signal Processing: Detection, Estimation and Time Series Analysis, Reading, MA: Addison-Wesley, 1991.

[24]

Bose, S., and A. O. Steinhardt, “Maximal Invariant Framework for Adaptive Detection with Structured and Unstructured Covariance Matrices,” IEEE Transactions on Signal Processing, Vol. 43, September 1995, pp. 2164–2175.

[25]

Bose, S., and A. O. Steinhardt, “Optimum Array Detector for a Weak Signal in Unknown Noise,” IEEE Transactions on Aerospace and Electronic Systems, Vol. 32, July 1996, pp. 911– 922.

[26]

Conte, E., A. De Maio, and C. Galdi, “CFAR Detection of Multidimensional Signals: An Invariant Approach,” IEEE Transactions on Signal Processing, Vol. 51, No. 1, January 2003, pp. 142–151.

Coluccia:

“chapter4_v2” — 2022/10/7 — 13:07 — page 161 — #25

162

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

[27]

Kraut, S., L. L. Scharf, and R. W. Butler, “The Adaptive Coherence Estimator: A Uniformly Most Powerful Invariant Adaptive Detection Statistic,” IEEE Transactions on Signal Processing, Vol. 53, No. 2, February 2005, pp. 427–438.

[28]

Liu, J., Z.- J. Zhang, Y. Yang, and H. Liu, “A CFAR Adaptive Subspace Detector for FirstOrder or Second-Order Gaussian Signals Based on a Single Observation,” IEEE Transactions on Signal Processing, Vol. 59, No. 11, November 2011, pp. 5126–5140.

[29]

De Maio, A., “Robust Adaptive Radar Detection in the Presence of Steering Vector Mismatches,” IEEE Transactions on Aerospace and Electronic Systems, Vol. 41, No. 4, October 2005, pp. 1322–1337.

[30]

Coluccia, A., A. Fascista, and G. Ricci, “CFAR Feature Plane: A Novel Framework for the Analysis and Design of Radar Detectors,” IEEE Transactions on Signal Processing, Vol. 68, 2020, pp. 3903–3916.

[31]

Haykin, S., Neural Networks and Learning Machines, Third Edition, Upper Saddle River, NJ: Pearson Prentice Hall, 2009.

[32]

Kelly, E. J., “An Adaptive Detection Algorithm,” IEEE Transactions on Aerospace and Electronic Systems, Vol. 22, No. 2, 1986, pp. 115–127.

[33]

Kelly, E. J., “Performance of an Adaptive Detection Algorithm; Rejection of Unwanted Signals,” IEEE Transactions on Aerospace and Electronic Systems, Vol. 25, No. 2, 1989, pp. 122–133.

[34]

Bhattacharya, T., and S. Haykin, “Neural Network-Based Radar Detection for an Ocean Environment,” IEEE Transactions on Aerospace and Electronic Systems, Vol. 33, No. 2, 1997, pp. 408–420.

[35]

De Maio, A., and D. Orlando, “An Invariant Approach to Adaptive Radar Detection under Covariance Persymmetry,” IEEE Transactions on Signal Processing, Vol. 63, March 2015, pp. 521–529.

[36]

Ciuonzo, D., D. Orlando, and L. Pallotta, “On the Maximal Invariant Statistic for Adaptive Radar Detection in Partially Homogeneous Disturbance with Persymmetric Covariance,” IEEE Signal Process. Letters, Vol. 23, December 2016.

[37] Tang, M., Y. Rong, X. R. Li, and J. Zhou, “Invariance Theory for Adaptive Detection in Non-Gaussian Clutter,” IEEE Transactions on Signal Processing, Vol. 68, 2020. [38]

O. Besson, S. Kraut, and L. L. Scharf, “Detection of an unknown rank-one component in white noise,” IEEE Transactions on Signal Processing, Vol. 54, July 2006.

[39]

Robey, F., D. Fuhrmann, E. Kelly, and R. Nitzberg, “A CFAR Adaptive Matched Filter Detector,” IEEE Transactions on Aerospace and Electronic Systems, Vol. 28, 1992, No. 1, pp. 208–216.

[40]

Conte, E., M. Lops, and G. Ricci, “Asymptotically Optimum Radar Detection in Compound Gaussian Noise,” IEEE Transactions on Aerospace and Electronic Systems, Vol. 31, No. 2 1995, pp. 617–625.

[41]

Kalson, S., “An Adaptive Array Detector with Mismatched Signal Rejection,” IEEE Transactions on Aerospace and Electronic Systems, Vol. 28, No. 1, 1992, pp. 195–207.

Coluccia:

“chapter4_v2” — 2022/10/7 — 13:07 — page 162 — #26

Hybrid Model-Based and Data-Driven Detection

163

[42]

Pulsone, N. B., and C. M. Rader, “Adaptive Beamformer Orthogonal Rejection Test,” IEEE Transactions on Signal Processing, Vol. 49, March 2001, pp. 521–529.

[43]

Bandiera, F., O. Besson, and G. Ricci, “An ABORT-Like Detector with Improved Mismatched Signals Rejection Capabilities,” IEEE Transactions on Signal Processing, Vol. 56, No. 1, 2008, pp. 14–25.

[44]

De Maio, A., “Rao Test for adaptive Detection in Gaussian Interference with Unknown Covariance Matrix,” IEEE Transactions on Signal Processing, Vol. 55, No. 7, July 2007, pp. 3577–3584.

[45]

Coluccia, A., G. Ricci, and O. Besson, “Design of Robust Radar Detectors through Random Perturbation of the Target Signature,” IEEE Transactions on Signal Processing, Vol. 67, No. 19, 2019, pp. 5118–5129.

[46]

Kelly, E. J., and K. Forsythe, “Adaptive Detection and Parameter Estimation for Multidimensional Signal Models,” Lincoln Laboratory, MIT, Lexington, MA, Tech. Rep. 848, April 1989.

[47]

Kelly, E. J., “Adaptive Detection in Non-Stationary Interference-Part iii,” Lincoln Laboratory, MIT, Lexington, MA, Tech. Rep. 761, August 1987.

[48]

Coluccia, A., A. Fascista, and G. Ricci, “Design of Customized Adaptive Radar Detectors in the CFAR Feature Plane,” arXiv:2203.12565, 2022.

Coluccia: “chapter4_v2” — 2022/10/7 — 13:07 — page 163 — #27

Coluccia: “chapter4_v2” — 2022/10/7 — 13:07 — page 164 — #28

5 Theories, Interpretability, and Other Open Issues This final chapter is devoted to an articulate discussion on some important open issues of data-driven techniques, which should be carefully considered in making design choices, in particular for mission-critical tasks such as radar detection. The focus will be in particular on cutting-edge studies aimed at understanding the generalization capabilities and more generally develop foundational theories of deep learning, as well as on state-of-the-art results and open problems regarding explainability and interpretation. Other relevant aspects, such as adversarial attacks, stability, sustainability, marginal return, and patentability, are also outlined.

5.1 Challenges in Machine Learning Deep learning and other machine learning approaches have proved themselves to be very effective for several problems. The signal processing community, in particular, has witnessed the widespread of (DNNs) as a disruptive approach to classical problems in speech and image recognition [1–3]. More in general, DNNs have shown impressive capabilities as classifiers and generative models, often outperforming shallow techniques such as linear methods and SVM. As will be discussed in detail in the progress of this chapter, however, a number of open issues remain to be solved; specifically, most advanced data-driven algorithms and especially deep learning methods are still unsatisfactory with respect to three broad issues: • Interpretability/explainability; 165

Coluccia: “chapter5_v2” — 2022/10/7 — 13:07 — page 165 — #1

166

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

• Data, computation, and energy requirements (hence cost and sustainability); • Stability and robustness to adversarial attacks. Indeed, DNNs, as opposed to some traditional machine learning techniques (in particular those based on linear models), lack interpretability. On the performance side, the accuracy of deep learning can be very high, but considerably more training data, power consumption, and computing resources are required compared to more traditional approaches and humans [4] (more detail will be given in Section 5.3.4). Furthermore, there is an inherent lack of robustness against adversarial attacks, which may easily deceive sophisticated DNNs (this will be covered in Section 5.3.1). All such aspects are crucial for a context like radar, which mostly pursues safety, security, and defense applications. Radar systems have traditionally been an expensive technology mostly used in military application, but technological advances also brought on the development of civil and commercial radar applications with low-cost (and typically reduced range) [5]. In addition, the black box nature of deep learning enables the design of radar applications without requiring thorough radar knowledge. It goes without saying that a more accessible technology paves the way not only to civil applications, with more possibilities for lowfunded organizations/companies, but also to new threats, since it widens the pool of those that can misuse radars. Clearly, in defense and security contexts, technological systems need to be robust, trustworthy, and interpretable—radars being no exception—but such requirements arise also in safety applications, such as in the automotive sector, where the safety of the intended functionality (SOTIF) concept requires the absence of unreasonable risk due to hazards resulting from functional insufficiencies of the intended functionality (or by reasonably foreseeable misuse by persons). Situational awareness of vehicles is in fact critical to safety, and when this is derived from complex sensors and processing algorithms, especially emergency intervention systems (e.g., emergency braking systems) and advanced driver assistance systems (ADAS), it must fulfill a SOTIF standard.1 As radars are intensively used in ADAS, and machine learning and artificial intelligence at large are playing an increasingly important role in this and other sectors, robustness, trustworthiness, and interpretability are indeed important issues to be considered. This discussion does not apply to the detection task only: other radar functions, apparently closer to the kinds of problems well-suited to DNN-based classification, are also not exempt. For the task of automatic target recognition (ATR), for instance, model-based techniques have been adopted since the mid1990s and are still used because of their strong electromagnetic, signal processing, 1. For instance, ISO/PAS 21448:2019. Risks are classified according to the automotive safety integrity level (ASIL) defined by the ISO 26262 standard.

Coluccia: “chapter5_v2” — 2022/10/7 — 13:07 — page 166 — #2

Theories, Interpretability, and Other Open Issues

167

and statistical decision theory principles [6]. Compared to such approaches, DNNs provide significant performance improvement, but huge amounts of data are required, considering the variability of targets, environments, and sensors, which is a major challenge partially circumvented by the use of synthetic data to better cover the variety of conditions (see Section 3.2.2; additional aspects will be touched on in Section 5.3.2). Moreover, the need for accurate confidence predictions is very important in applications where the consequences of a wrong decision have a dramatic impact: also for ATR, thus, combining model-based and data-driven tools is a promising path, as observed several times throughout this book, although several challenges need to be addressed [6]. However, the key point is that currently, the reasons why DNNs work so well are still not completely clear. Several studies have reported evidence, but sometimes new results are only partially in agreement with previous beliefs, ultimately resulting in chasing a moving target. On the one hand, many applicative successes of DNNs are based on a mix of black box deep architectures [7], heuristics, and experience/intuition. On the other hand, many of the underlying ideas are reminiscent of well-established concepts such as signal decomposition, convolution/deconvolution, (matched) filtering, correlation, projection, low-rank and sparse representation, and many more. As will be discussed in detail in Section 5.2, CNNs have clear signal processing roots, and can be explained in terms of sparse representation [8], hierarchical wavelet filtering [9], and graph signal processing [10]. Feedback appearing in RNNs and reinforcement learning are analogous to those found in adaptive filtering and control system theory (and also dynamic programming), and unsupervised learning, such as autoencoders and generative adversarial networks (GANs) are strictly linked to information theory (coding) and sparse representation. It is thus not surprising that, in the recently started collective effort to “open the box” and give theoretical foundation to DNNs, signal processing and information theory methodologies are among the most effective tools. But theoretical guarantees certainly lag behind practice in the field of artificial intelligence (AI) at large. Furthermore, practitioners are by far more numerous than scientists, thus most of the solutions found in the real world may have legitimately followed more a twistthe-knob rationale than a fully-aware design, based on the ultimate important fact—for a commercial product—that it works. It is necessary to close this gap; several cutting-edge directions toward such a goal are discussed in the following.

5.2 Theories for (Deep) Neural Networks It has been observed that the grand task of attaching a theory to deep learning behaves like a magic mirror, in which every researcher sees themselves.2 This 2. These are words by Ron Kimmel (Computer Science Department at the Technion, Israel), strengthened by David Donoho (Department of Statistics at Stanford University), as reported in [8].

Coluccia: “chapter5_v2” — 2022/10/7 — 13:07 — page 167 — #3

168

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

explains why many different interpretations have been attempted. In this section we review several of the most promising ideas. The concept of an (artificial) NN is usually introduced and justified by brain-inspired arguments, which more specifically would suggest the need for deep architectures for extracting complex structure and building internal representation from rich sensory inputs (natural image and its motion, speech, etc.) [1]. In fact, for instance, the NN-based models called GANs have been purposely introduced to answer the quest for modeling the distribution of complex data in high dimensions. For such a task, parametric models (e.g., a mixture of Gaussians) fail to capture the complexity of the data, while nonparametric models (e.g., kernel density estimation, nearest-neighbor) fail to be effective due to the high dimensionality [11]. More intriguingly, neuromorphic algorithms and hardware platforms are emerging, which are more realistically inspired to biological neural networks (brain); in particular, in spiking neural networks (SNNs) the space-time computing capabilities of spiking neurons are exploited for a more sophisticated yet energy-efficient processing [12, 13]. At the same time, SNNs also pose new challenges in terms of understanding the behavior and interpreting the decisions of artificial machines. This is already a general major issue in (non-spiking) DNNs, which exhibit problematic behaviors such as confidently classifying unrecognizable images [14] and misclassifying imperceptible perturbations of correctly classified images, which can be exploited for adversarial attacks [15, 16] (we will discuss this issue more in detail in Section 5.3.1). Also, the way such networks are trained seems far from how humans learn: huge datasets are needed for supervised learning, whose processing requires a lot of energy and computational resources (this is covered in Section 5.3.2), and in many applications such a massive amount of data is very difficult to obtain. It has been pointed out [4] that, indeed, the most challenging problem in machine learning is currently unsupervised learning, whose development is significantly behind that of supervised and reinforcement learning. It is nevertheless apparent that, in general, the full understanding of some basic building blocks of NNs (and DNNs) is pivotal toward comprehension of the mysteries of datadriven learning. In the following several aspects along this challenging path are discussed. 5.2.1 Network Structures and Unrolling NNs can be considered as particular cases of network structures. The latter have a long history in many fields of science and engineering, and they are regularly found in the radar and related contexts; for example, in digital filters (tapped delay line), beamformers, and feedback control systems and trackers. Actually, any processing that involves parallelizable subtasks admits some network architecture. At the same time, even machine learning algorithms that are not of the NN type

Coluccia: “chapter5_v2” — 2022/10/7 — 13:07 — page 168 — #4

Theories, Interpretability, and Other Open Issues

169

can be recast as a suitable network, for instance KNN admits direct [17] or clocked (more stable) [18] network implementations.3 As discussed in Section 4.4.1, the use of network structures with proper restrictions (architecture) is a leitmotif in machine learning, and is aimed at promoting certain specific properties starting from the very generic fully connected architecture. For instance, convolutional layers in CNNs act locally, and then progressively reduce the number of connections through the layers, which is a more specialized processing typical of local receptive fields [19]. Similar considerations apply to weight sharing and to special structures appearing in RNNs, FPNs, and ResNets. Also the mapping chain in (4.13) of maximal invariant based CFAR detectors can be interpreted, as discussed, as layers of a learning machine with a peculiar structure. Haykin [20] has provided a unified view under the concept of neural network, where the type of neurons, the structure of the connections, and the presence/absence of a nonlinearity can accommodate several signal processing and learning tools. Other relationships exist between NNs and more general concepts in stochastic processes and system theory. As an example, it has been shown [21] that if the neuron activation function is the hyperbolic tangent tanh(·), every first order nonlinear autoregressive exogenous (NARX) model can be transformed into a RNN and vice versa. Moreover, if the neuron activation function is a piecewise linear function, every NARX model (irrespective to its order) can also be transformed into a RNN and vice versa. On the other hand, generally speaking, NNs can be deemed just quite naive feedforward schemes. Adopting the words of François Chollet4,5 : Neural networks are a sad misnomer. They’re neither neural nor even networks. They’re chains of differentiable, parameterized geometric functions, trained with gradient descent (with gradients obtained via the chain rule). A small set of highschool-level ideas put together.

One of the consequences of this relatively simple conceptual structure is that NNs can be described algorithmically in a quite straightforward way: this provides an alternative interpretation of NNs as unrolling (or unfolding) of iterative algorithms [22, 23] (see also Section 4.2.2). The idea of unrolling is illustrated in Figure 5.1. Since iterative algorithms generally involve much less parameters than NNs, unrolled networks are highly parameter efficient and require fewer training data; 3. More in detail, the network architecture of the KNN has a first layer of conventional neurons taking feedback from a second layer (latched logic AND neurons, which tessellate the feature space by intersection of hyperplanes), while the third layer contains variable threshold neurons. 4. AI engineer and deep learning specialist at Google, creator of Keras (one of the most widely used libraries for deep learning in Python), and a main contributor to the TensorFlow machine learning framework. 5. Posted on Twitter, January 12, 2018.

Coluccia: “chapter5_v2” — 2022/10/7 — 13:07 — page 169 — #5

170

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

Figure 5.1 Conceptual scheme of algorithm unrolling: an iterative algorithm repeatedly implementing a certain transformation h(·; θ) parametrized in θ is equivalent to a sequence of transformations h 1 (·; θ 1 ), h2 (·; θ 2 ), . . . , that map the input to the input, where the parameters computed in the previous iteration becomes those of the current iteration.

moreover, they naturally inherit prior structures and domain knowledge instead of learning that information from intensive training data [23]. Unrolled networks can share parameters across all layers, which makes them more parameter efficient; however, in doing so they resemble RNNs, hence may similarly suffer from gradient explosion and vanishing problems, which complicates their training (see Section 2.4.3.2). An alternative approach is to use layer-specific parameters: in that case, the networks slightly deviate from the original iterative algorithm and may not completely inherit its theoretical benefits (e.g., convergence guarantees), but on the other hand higher representation power and easier training is gained compared to RNNs [23]. The reverse viewpoint may be also insightful; that is, to interpret network structures as a certain iterative algorithm. For instance, a standard NN, namely a multilayer perceptron (MLP) with q hidden layers (see Section 2.4.1), can be associated with the simple iteration of a linear transformation followed by a nonlinearity (activation function) ϕ(·), that is the following generic pseudocode: for i = 0 to q x (i+1) ← ϕ W (i+1) x(i) + b(i+1) end y ← x (q+1)

Coluccia: “chapter5_v2” — 2022/10/7 — 13:07 — page 170 — #6

Theories, Interpretability, and Other Open Issues

171

which is an iterative vector-valued form, for a general number q of hidden layers and M -dimensional feature vectors, of the recursive definition given in (2.30). This enables a better understanding of the network behavior, and may ease a rigorous theoretical analysis. Furthermore, architectural enhancement and performance improvements of NNs may result from incorporating domain knowledge associated with iterative techniques [23]. Indeed, many traditional iterative algorithms have a fixed pattern in their iteration steps: a linear mapping followed by a nonlinear operation. Therefore, the abstract algorithm above represents a broad class of iterative algorithms which, in turn, can be identified as deep networks with a structure similar to MLPs. The same technique is applicable to other networks, such as CNNs and RNNs, by replacing the linear operations with convolutions and by adopting shared parameters across different layers [23]. The translation of such recent research ideas to the context of radar detection is still largely unexplored. For detection in the context of digital communications, a model-driven deep learning approach to MIMO detection is devised in [24] by unfolding an iterative algorithm and adding some trainable parameters. Since the number of trainable parameters is much fewer than for vanilla deep learning, the resulting detector can be rapidly trained with a much smaller data set. 5.2.2 Information Theory, Coding, and Sparse Representation Several explanations that have been attempted to grasp a fundamental understanding of DNNs rely on concepts from information theory, coding, and sparse representation. Nonlinear system theory and signal processing methodologies are also advocated as possible analytical tools to open the black box. For instance, deep learning can be interpreted as a parameter estimation problem, whose stability and well-posedness can be analyzed through the tools of nonlinear dynamical systems. In particular, the exploding and vanishing gradient phenomenon (see Section 2.4.3.2) can be related to the stability of a discrete differential equation; based on that, stabilizing strategies can be conceived by restricting the architecture while keeping sufficiently general applicability [25]. More on the signal processing side, wavelets and invariances can play an important role in the understanding of DNNs [9]. Signals are in fact generally mapped into some intermediate representation used for constructing models, in a way that is often required to be time-shift-invariant.6 Indeed, several studies have pointed out that CNNs are computing progressively more powerful invariants as depth increases (e.g., [27–29]); in this respect, the analysis of wavelet scattering 6. The joint time-frequency scattering transform—a time-shift-invariant representation that characterizes the multiscale energy distribution of a signal in time and frequency—can be computed through wavelet convolutions and nonlinearities; as such, it may be implemented as a CNN whose filters are not learned but calculated from wavelets [26].

Coluccia: “chapter5_v2” — 2022/10/7 — 13:07 — page 171 — #7

172

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

networks may explain some important properties of CNNs [30]. Still, there is certainly a difference between natural image processing and radar data processing (and target detection), so it is unclear to what extent such results can also directly apply to the latter domain. A closely related interpretation of CNNs makes use of sparse coding [31]. Hierarchical structures naturally found in many types of data (e.g., images) can be captured by nested sparse features; that is, convolutional dictionaries. More specifically, in [8, 32] the applicability of a multilayer convolutional sparse coding model is demonstrated for several applications in an unsupervised setting. This represents a bridge between matrix factorization, sparse dictionary learning, and sparse autoencoders. The idea is that a noisy signal y = x + e can be projected to the model by solving a (deep coding) pursuit problem to obtain x = Dc, with c a sparse vector. A solution to this problem, under certain assumptions, is the basis pursuit7 : cˆ = arg min c1 subject to Dc − y2 ≤ δ. c

(5.1)

By iterating this problem over multiple dictionaries and (sparse) parameter vectors, a layered sparse representation is obtained (for details, refer to [8, 32]). Several further links with information-theoretic concepts and tools have been attempted. The study of the dynamics of learning through the lens of information theory has received a fair amount of attention. A possible view is that deep learning is a matter of representation learning: each layer of a DNN can be seen as a set of summary statistics that contain some but not all of the information present in the input, while retaining as much information about the target output as possible [33, 34]. More specifically, the information bottleneck theory envisions that deep networks undergo two distinct phases consisting of an initial fitting phase and a subsequent compression phase. Such a theory provides a fundamental bound on the amount of input compression and target output information that any representation can achieve, and may thus serve as a method-agnostic ideal to which different architectures and algorithms may be compared [35]. However, the influence of the activation function, the exact role of the diffusion-like behavior of stochastic gradient descent in the compression phase, and its causal relation with the excellent generalization performance of DNNs, are still not fully understood and some controversies remain (for a discussion on this topic see [36–39]). 5.2.3 Universal Mapping, Expressiveness, and Generalization Among the aspects of DNNs that still need a comprehensive theoretical explanation, a crucial one is the tension between expressiveness (capability to capture 7.

See also Section 2.1.4 about signal representation and atomic norm decomposition.

Coluccia: “chapter5_v2” — 2022/10/7 — 13:07 — page 172 — #8

Theories, Interpretability, and Other Open Issues

173

the characteristics of the training data) and generalization power to unseen examples (see Section 2.2.1). The basic theory of such a point relies on the classical results of Cybenko [40] and others [41, 42], collectively referred to as universal (myopic) mapping : sufficiently large NNs—in terms of number of neurons, even with very shallow architecture in the number of layers—may approximate any function.8 Thus, in a sense, existence of a NN that can work well in practice is guaranteed for any problem, and in principle there is no need for deepness. But universal mapping actually refers to theoretical expressiveness power, not actual generalization capability. Furthermore, in practice the complexity and performance of training and execution can be very diverse according to the type of network, which explains why certain types of (deep) NNs are preferred in applications (and other ones remain purely theoretical). This can be considered, more generally, as a consequence of the so-called no-free-lunch theorem. Such a theoretical result states that, since most machine learning algorithms need to solve an optimization problem for the training, fundamental limits of optimization and search do apply to machine learning too, especially supervised learning: in particular, on average no learning algorithm can have lower test error (risk) than another one for all classes of problems. This partially limits the effectiveness of cross validation (see again Section 2.2.1) and the quest for the ultimate algorithm, since no algorithm exist that can perform the best on all problems. At the same time, it motivates the continuous development of new algorithms, as well as a better understanding of old ones, including testing them on different setups and considering their trade-offs in terms of accuracy and complexity [43]. Additionally, a general implication of the theorem is that a learning gain can be obtained only by incorporating some prior knowledge into the algorithm (i.e., data alone may be insufficient [44]). Designing an algorithm based on NNs requires a number of choices regarding the number of layers (shallow or deep network), number of neurons per layers (narrow or wide network), and which structure for the connections (fully connected or restricted networks). Most settings are empirical and result from trial-and-error procedures over many solutions with different hyperparameters, although as seen in Section 5.2.1, certain types of networks are more suitable for certain applications, being backed by well-established signal processing concepts (convolution, feedbacks, local filters, etc.). Deepness gives an exponential advantage in the number of neurons per layer [45, 46], which motivated the practical success of DNNs: the key point is that complicated processing can be obtained by a sequence of many simple steps. Considering small pieces and 8. This type of result can be traced back to Weierstrass theorem (1885), later generalized by Stone (1937), which states that any continuous function over a closed interval on the real axis can be expressed as an absolutely and uniformly convergent series of polynomials. For a more detailed excursus on this point, please refer to [20, p. 219]. See also the Kolmogorov-Arnold representation theorem and the representer theorem in Section 5.2.4.

Coluccia: “chapter5_v2” — 2022/10/7 — 13:07 — page 173 — #9

174

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

combining them toward higher levels of abstraction, as opposed to attempting to capture all the information at once, can indeed bring remarkable expressiveness power, and is in agreement with the sparse representation interpretation discussed in Section 5.2.2. The drawback is that the presence of nonlinearities and a large amount of parameters makes even very simple NNs hard to analyze theoretically. Besides expressiveness, another theoretical conundrum of (D)NNs is their impressive generalization capabilities. In classical statistical learning theory, one of the major issues to be considered—indeed, a concern also for the designer interested in data-driven methods for radar detection—is overfitting, since this typically reduces the generalization power (as discussed in Chapter 2). The case of overparametrized functions such as DNNs is, however, more complex, as discussed in the remainder of this section and also in Section 5.2.4. As seen in Section 2.3, local averaging techniques (e.g., nearest-neighbors or decision trees) can adapt to any (nonlinear) relationship between inputs and outputs, hence they are among the most widely adopted data-driven tools, especially for low-dimensional input data. Moving to high-dimensional settings, the classical theory [47] adopts empirical risk as a proxy for expected loss to minimize the (possibly regularized) error on training data over a certain set of functions H. This means that the learned input-output mapping can be expressed as flearned = arg min

f ∈H

NT

L(f (xi , yi ))

(5.2)

i=1

with L(f (xi , yi )) a loss function measuring the error in approximating yi through yî = f (xi ) for the training dataset {xi , yi } (input-output pairs), i = 1, . . . , NT .9 An important measure of quality for the solution is thus the generalization gap (i.e., the difference between the expected loss and the empirical loss). It can be shown that NT 1 L(f (xi , yi )) = 0 (5.3) lim EXY [L(f (x, y)] − NT →∞ NT i=1

and the convergence to zero is at a rate depending on the capacity (a measure of complexity, expressiveness power, richness, flexibility) of the function space H, for example, measured by the Vapnik-Chervonenkis (VC) dimension [47]. The classical theory also highlights the importance of the bias-variance trade-off (see Sections 2.2.1 and 2.2.2); that is, the need to find the sweet spot during the training, which may lead to preferring a small bias to lower variance (error) to balance underfitting and overfitting [48]. This is depicted in Figure 5.2, which is a more detailed version of Figure 2.14. The sweet spot can be sought via a 9. For example, squared loss (ˆy − y)2 for regression or zero-one loss I {ˆy = y} for classification (see Section 2.1.2).

Coluccia: “chapter5_v2” — 2022/10/7 — 13:07 — page 174 — #10

Theories, Interpretability, and Other Open Issues

Figure 5.2

175

Typical U-shape curve of the test risk (and associated bias-variance trade-off), with overfitting/underfitting regions and sweet spot highlighted.

suitable choice of H, for example, by selecting an appropriate NN architecture. Another typical way to avoid overfitting is via early stopping during the training phase. Bias can be also introduced and controlled via regularization, as learning (especially in high dimension) is usually an ill-posed problem (see Section 2.4.3.2): regularization adopts a penalty function J (f ) to prevent overfitting and/or promote certain properties of the solution (e.g., smoothness or sparsity), so the empirical risk is replaced by NT 1 L(f (xi , yi )) + λJ (f ). NT

(5.4)

i=1

Finally, notice that implicit biases may arise even in the absence of an explicit regularization [49]. Indeed, for the simplest algorithm used for training (gradient descent), a variety of behaviors can already be observed depending on the chosen loss, initialization, and step-size [50].10 On the other hand, we have seen that unrolling can be used to infer which algorithm a (D)NN is actually implementing, hence analyzing unrolled networks from a functional approximation perspective can provide some insights. As observed in [23], by parameter tuning and customization, a traditional iterative algorithm spans a relatively small subset of the functions of interest and thus has limited representation power (expressiveness). In other words, it is a reasonable approximation however that underfits the data: this limits the 10. The analysis of the training dynamics is in fact a further quite revealing research direction aimed at understanding how NNs actually work; to this end, the infinite-width limit is a great analytical tool together with other asymptotic and mean-field approaches (see more details in Section 5.2.5).

Coluccia: “chapter5_v2” — 2022/10/7 — 13:07 — page 175 — #11

176

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

performance, but provides good generalization in limited training scenarios. From a statistical learning perspective, iterative algorithms correspond to models with a high bias and a low variance. In contrast, unrolled networks usually better approximate the mapping at hand, thanks to their higher representation power (a consequence of the universal approximation capability of NNs), but a generic (D)NN typically consists of an enormous number of parameters (even billions), which constitutes a large subset in the function space. As a consequence, the training becomes a major challenge requiring a large amount of training samples, and also generalization becomes an issue, as (D)NNs are essentially models with a high variance and a low bias [23]. Since iterative algorithms are developed based on domain knowledge and already provide a reasonably accurate approximation of the mapping underlying the training data, by extending them via training on real data, unrolled networks can often obtain a highly accurate approximation while spanning a relatively small subset of the function space. This alleviates the training burden and the requirement of large training sets. Being in between generic networks and iterative algorithms, unrolled networks typically have a relatively low bias and variance simultaneously, hence generalize better in practice, providing an attractive balance [23]. Finally, it should also be highlighted that width also plays a role. In addition to information-theoretic bounds and relative complexity needed to approximate example functions between different architectures, topological constraints that the architecture of a NN imposes on the level sets of all the functions that it is able to approximate should be also considered [51]. Basically, a NN will fail to build closed decision regions when the width of the layers is less than or equal to the number of inputs, no matter the number of layers (deepness), for a broad family of activation functions [51]. 5.2.4 Overparametrized Interpolation, Reproducing Kernel Hilbert Spaces, and Double Descent Resuming the discussion about overfitting from the previous section, a theoretical conundrum is brought by DNNs: they are highly overparametrized functions trained to overfit or even interpolate data—with very little or no regularization empirical loss close to zero—but yield excellent performance. In other words, they generalize well. So is it possible to reconcile this behavior with the classical statistical learning theory? A first clue is that overparametrized models do not behave as underparametrized ones: while in fact in the latter case the loss landscape of the solution space exhibits (one, or typically more) local minima (dips), in the overparametrized case the loss landscape may contain entire valleys of minima that are all globally optimal, as shown in Figure 5.3. Thus, the training algorithm may stop at any of such points, but the provided solution might exhibit different test error

Coluccia: “chapter5_v2” — 2022/10/7 — 13:07 — page 176 — #12

Theories, Interpretability, and Other Open Issues

177

Figure 5.3 Differences in the loss landscape of (a) underparametrized and (b) overparametrized models.

and hence generalization capability. It has been noted [52] that the effectiveness of DNNs might be explained by the fact that, thanks to the training of an overparametrized model, they build a spline approximation that provides a set of signal-dependent, class-specific templates against which the input is compared via inner product (i.e., matched filter similar to model-based processing; see Chapter 1). Other remarkable links exist with related concepts in sampling theory [53] as well as interpolation, which might provide clues about the multifaceted behavior of DNNs. In particular, strong generalization of overfitted classifiers has been also identified (e.g., in kernel classifiers) not only DNNs; this might represent thus a more profound explanation for the observed behaviors, as discussed in the following. Kernel classifiers are two-layer NNs with a fixed first layer; as such, they are an example of restricted network architecture (see Sections 4.4.1 and 5.2.1). The remarkable aspects of kernel classifiers are that they exhibit great generalization power, which again questions the exact role of deepness [54]. Kernel machines can be viewed as linear regression (interpolants) in infinite-dimensional RKHS H∞ , which include positive-definite kernels such as Gaussian (also known as RBF; see Section 2.3.3), or Laplacian: flearned = arg

min

f ∈H∞ ,f (xi )=yi

f H∞ .

(5.5)

Note that SVM can be derived as a minimum-norm RKHS classifier (instead of maximizing the geometric soft margin of a hyperplane), and then applying the kernel trick. A remarkable result, known as the representer theorem (RT), states that it is effectively possible to reduce computationally cumbersome (or even infeasible) problems in high or infinite dimensions to optimization problems on coefficients [55–57], typically on a much lower dimensional space.11 This 11. Specifically, the minimizer of a regularized empirical risk functional defined over a RKHS can be represented as a finite linear combination of kernel products evaluated on the input points

Coluccia: “chapter5_v2” — 2022/10/7 — 13:07 — page 177 — #13

178

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

property makes kernel classifiers more amenable to theoretical analysis than arbitrary DNNs. As Mikhail Belkin well rendered it, A neural network is a little bit like a Rube Goldberg machine. You don’t know which part of it is really important.12

Therefore, using restricted (reduced) structures, such as kernel classifiers, may isolate the fundamental engine of a NN. There are numerous observations that very small (or even zero) regularization results in optimal performance, or otherwise put, that classifiers trained to have zero classification (or regression) error are near-optimal. This holds true for kernel classifiers similar to NNs, and is contrary to what is expected from the classical learning theory. Interpolation in particular fits data exactly, and thus has no regularization (hence no bias), but it cannot be expected that arbitrary interpolants then exhibit strong generalization. So the conclusion is that such a capability might be strictly related to properties of kernels and RKHS, as minimum-RKHS-norm interpolants (which, by the RT, are linear combinations of kernels supported on data points) can inject inductive bias by choosing f with special generalization properties. Although the translation of such results to commonly used DNNs is still an open problem, it is known (as discussed above) that the structure of NNs together with the training algorithm likely introduce inductive bias, similarly to kernels, which thus represent a promising tool for solving the mystery of DNNs. In this respect, a major achievement has been the identification of the so-called double descent phenomenon [58]. The double descent risk curve, depicted in Figure 5.4, incorporates the U-shaped risk curve of the classical regime (compared to Figure 5.2) together with the observed behavior from using high-capacity function classes. This represents the interpolating regime, starting from a threshold that marks the onset of the region where the learned functions have zero training risk. Such a region of overparametrized models, which requires a function space with high capacity, may be exactly where the special properties of DNNs arise. Translating all this research into the radar context will require overcoming further challenges. A major question is whether overfitting a learning machine can outperform optimal detection algorithms designed on physical models, at least when the latter fits the data well. The investigation of all these types of problems is still in its infancy. The adoption of restricted architectures can encode important domain knowledge into the learning process, thereby incorporating the required in the training set data; that is: flearned =

n

ai K(xi , ·)

(5.6)

i=1

with a = K−1 y obtained from the data by minimizing the norm f 2 = a Ka. 12. https://www.quantamagazine.org/a-new-link-to-an-old-model-could-crack-the-mystery-ofdeep-learning-20211011.

Coluccia: “chapter5_v2” — 2022/10/7 — 13:07 — page 178 — #14

Theories, Interpretability, and Other Open Issues

Figure 5.4

179

The double descent risk curve showing the transition between the classical regime of underparametrized models to the interpolation regime of overparametrized models.

bias, and is thus regarded as a very promising approach to combine model-based with data-driven tools for radar detection. A similar trend can be observed more generally in scientific contexts, with the recently introduced physics-informed neural networks (PINNs) [59]. By exploiting automatic differentiation to differentiate NNs with respect to their input coordinates and model parameters, PINNs are obtained where constrains with respect to any symmetry, invariance, or conservation principles originating from the physical laws that govern the observed data can be incorporated.13 Indeed, when data is backed by physical phenomena, a vast amount of prior knowledge is simply ignored in adopting a general-purpose, black-box DNN. Principled physical laws and other domain expertise can provide the prior information needed for regularization (i.e., constraining of the space of admissible functions to a manageable size by excluding physically-implausible or even impossible solutions). In return, encoding such structured information results enhances the information content of the data, enabling the learning algorithm to quickly steer itself toward a proper solution and generalize well even with a low amount of training samples [59]. At the same time, it should be clear that incorporating physical constraints does not necessarily make the results of a PINN more interpretable than those of a generic DNN, since the basic overparametrized architecture remains nonhuman-readable. Explainability of data-driven algorithms will be covered in Section 5.3.2.

13. This is done via general time-dependent and nonlinear partial differential equations.

Coluccia: “chapter5_v2” — 2022/10/7 — 13:07 — page 179 — #15

180

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

5.2.5 Mathematics of Deep Learning, Statistical Mechanics, and Signal Processing So far we have discussed some of the main theoretical efforts to try to understand machine learning and DNNs. They are part of an emerging field of mathematical analysis of learning, which adopts tools from different branches of mathematics to address questions not fully answered within classical learning theory [60]. In addition to the already highlighted aspects (generalization power of overparametrized NNs, role of deepness, apparent absence of the curse of dimensionality), important points to be fully understood are the successful optimization performance despite the non-convexity (and often non-smoothness) of the problem, and a thorough understanding of what features are learned, and which aspects of an architecture affect the performance, and how [60]. Besides efforts from mathematics and statistics, other analytical tools from statistical mechanics and signal processing enter the game. We already mentioned that idealized models of NNs can be useful to isolate some of the key behavior while allowing for tractability. One of them is the infinite-wide limit of a shallow (even single-layer) network, which is useful to assess the role of (implicit) bias (Section 5.2.3) and draws a connection with Gaussian processes [61].14 The infinite limit is also useful to come up with an idealized version of gradient descent, the gradient flow, with the aim of analyzing the dynamics of the training process. In fact, in the limit of infinite width of the hidden layer, the corresponding gradient flow can be proved to converge to the global optimum of the cost function. These types of limit regimes are often called mean field approximations. Although their main weakness is that they do not provide a way to assess how large is enough for the limit regime to be an accurate representation of reality (otherwise put, how fast the convergence is, and to what extent finite-size effects can be neglected), mean field theory is a standard tool in statistical mechanics and indeed can provide important theoretical understanding thanks to a more tractable mathematics. In such a field of theoretical physics applied to NNs and learning theory, stochastic machines (in particular Hopfield associative networks [62] and restricted Boltzmann machines, pioneered by Geoffrey Hinton [63]) are among the most basic elements [64]. More generally, the statistical mechanics of spin glasses (also named complex statistical mechanics [65, 66]) contributed to the mathematical foundation of certain classes of NNs [67]. Hinton’s work then met Bengio’s and LeCun’s research on DNNs, with well-known groundbreaking results [3]. Stochastic machines can be thought as the classical limit (storing pairwise correlation functions, as in the central limit theorem common to probability and 14. It has been shown that a single-layer fully connected NN with an iid prior over its parameters is equivalent to a Gaussian process in the limit of infinite width. This can be also related to the role of kernel functions discussed in Section 5.2.4.

Coluccia: “chapter5_v2” — 2022/10/7 — 13:07 — page 180 — #16

Theories, Interpretability, and Other Open Issues

181

statistical mechanics), and a broader theory where all higher-order correlations are accounted (so-called relativistic extension) might provide a contribution to the understanding of DNNs in the future, since it shows some deep-learninglike characteristics. One of the drawbacks of such approaches is that they treat neurons as stochastic units and do not consider explicitly the role of supervised learning. However, certain multilayer feedforward NNs, namely Deep Belief Networks (DBNs), can be effectively pretrained by treating each layer as an unsupervised RBM [68, 69], then fine-tuned by supervised backpropagation [70];15 this is consistent with the interpretation that the several layers of a DNN act as feature extractors for efficient (sparse) representation/coding (Section 5.2.2), before classification.16 Bengio [76] generalized this approach, which works well especially for small datasets and also to other unsupervised representation-learning techniques such as autoencoders [3]. This represents a further possible motivation to keep studying stochastic machines.

5.3 Open Issues 5.3.1 Adversarial Attacks Besides all the theoretical mysteries that still hamper a full understanding of DNNs’ functioning, a major issue with machine learning algorithms is that they are often vulnerable to adversarial manipulation of their input intended to cause incorrect classification [77–79]. This is particularly true for DNNs, which are highly vulnerable to attacks, even when the input modification is so subtle that a human observer does not even notice it [16, 80]. Moreover, black-box attacks can be conceived that exploit adversarial sample transferability on broad classes of machine learning classification algorithms, namely not only DNNs but also SVM, decision trees, KNN, and ensemble methods [81]. Adversarial attacks certainly pose security concerns, but may more generally be the symptom of a more profound limit of mainstream supervised learning. In fact, adversarial examples may appear to be in contradiction with generalization capabilities (especially for DNNs) and considering that adversarial data are quite indistinguishable from normal data. A possible explanation is that the former constitute a particular (yet large) set that occurs with extremely low probability, hence is never (or rarely) observed in the training set because it is far from the natural data manifold. As a consequence, the classifier is completely unaware of it, being a completely new concept that is just mapped on what was learned 15. It has also been shown [71] that RBMs (trained by contrastive divergence) are equivalent to Hopfield networks (trained by Hebbian learning) since they actually represent the same model. 16. Information theory and signal processing do indeed have connections with statistical physics [72, 73]. Such tools have shown to be able to explain the properties of certain DNNs (e.g., GANs [74, 75]). Additional connections have been highlighted in Section 5.2.2.

Coluccia: “chapter5_v2” — 2022/10/7 — 13:07 — page 181 — #17

182

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

during the training (thereby producing a wrong classification, since the classifier has never been trained on examples far from natural data). Furthermore, natural data manifolds are typically low-dimensional while noisy (adversarial) examples, generated with even small perturbations, are very high-dimensional and hence much harder to learn [82]. This phenomenon is receiving a tremendous amount of recent attention, and many attempts have been made to defend DNNs against adversarial examples; still, defending all possible methods of attack remains challenging [82]. Nonetheless, nonadversarial examples can also be problematic, since in the end the classification task is based on pure pattern recognition and does not truly represent intelligence. This makes it critical to really assess the reliability of machine learning. Many situations are not included in the training, but a human could easily tell what is the right action to take—and this applies to natural language processing as well as to autonomous driving17 and radar detection as well. 5.3.2 Stability, Efficiency, and Interpretability In addition to adversarial attack vulnerability, another major limitation of DNNs is their gross inefficiency [83]. Supervised learning is indeed very different from how humans or animals learn and represents more a kind of brute force approach rather than intelligence.18 The human brain builds models based on multisensor observations acquired in dynamic conditions, and many different perspectives are averaged (majority vote) to make decisions, as it happens in ensemble techniques in statistical learning. The multiple sensing modalities provide robustness (stability) to decisions, even under significant changes, which are heavily supported by a multitude of diverse information. It can thus be expected that a paradigm shift based on different perspectives (representations) and adaptiveness (continuous learning) is needed to gain stability and efficiency in machine learning in a fully unsupervised way [4]. It is not just size: in fact, although increasing the number of parameters often improves performance, it is not enough to get rid of the fragility and limitations of mainstream DNN approaches. Furthermore, once a NN has been trained, it is difficult to understand how and why decisions are made, which in many contexts is simply unacceptable given the important consequences that a decision can have—radar detection being one of those. An attempt to introduce interpretability/explainability (hence trust) in machine learning is in progress, under the names of explainable AI, interpretable AI, or transparent AI [84, 85]. 17. As of 2022, it has been reported that the autopilot software of a leading manufacturer of self-driving cars still confuses a rising moon low above the horizon with a yellow traffic light. This is an example of a rare situation, totally obvious for a human, that is rarely included in the training set. 18. Also reinforcement learning, which is a more sophisticated approach, still requires a huge number of interactions to learn, which is clearly very different from biological learning.

Coluccia: “chapter5_v2” — 2022/10/7 — 13:07 — page 182 — #18

Theories, Interpretability, and Other Open Issues

183

While there is a long way to go, it is still wise to generally avoid overusing machine learning for doing tasks for which we know the solution. It has been shown, for instance, that machine learning can be used to learn well-known signal processing algorithms, such as FFT or LMS. On the one hand, signal processing algorithms are computationally efficient, since they are specific to the particular problems they solve; on the other hand, machine learning algorithms are much more general and can perform other tasks using a different set of weights, but such a flexibility comes at a price of inefficiency, fragility (e.g., vulnerability to adversarial attacks), and lack of interpretability. One should not listen to the siren song of AI (and its hype) unconditionally: depending on the context, accepting inferior performance but preserving interpretability and efficiency of model-based approaches can instead be a preferable option.19 There is still a performance gap when machine learning systems go from the lab to the field, especially in high-entropy environment [87]—and again, radar detection is one of those. For DNNs, a common approach to try to overcome such limitations is to user larger datasets, hoping they will cover a wider range of scenarios and reduce the chances of failure in the real world. In this respect, an advantage of using synthetic data for training DNNs (Section 3.2.2) is that very diverse, even unusual or dangerous conditions can be represented and included in the training.20 However, in fact, even the most sophisticated DNNs are very far from the generalization capabilities of humans in complex contexts. Indeed, the human brain has the capability to correctly interpret impressively novel combinations of learned concepts, even very unlikely ones, as long as they obey to some high-level patterns acquired through continuous learning (even on completely different data and contexts). Concerns about the stability/robustness and interpretability/explainability of NNs extend more generally to the broad field of data science [88]. It has been observed, for instance, that by introducing a controlled amount of nonlinearity, standard tests of goodness-of-fit do not reject linearity until the nonlinearity is extreme [89]. This means that model-based approaches, hence model fitting, are also not immune from reliability issues. Similar considerations can be made for interpretability: when performing dimensionality reduction, for instance, 19. The trade-off between accuracy and interpretability is a more general one in statistical learning [86]; linear methods are generally less flexible than nonlinear ones (splines, SVM, boosting, NNs, etc.). However, with linear models it is much easier to understand which variables are more important to the decision compared to nonlinear methods. Moreover, inflexible models are typically less prone to overfitting, hence may better generalize, with the possible exception of DNNs (as discussed). 20. For instance, when training an autonomous vehicle, it would be good to have dangerous situations where people or animals are found along the streets to teach the system to cope with such situations, but this is clearly rare in a real training set. Similar situations can be imagined for radar tasks.

Coluccia: “chapter5_v2” — 2022/10/7 — 13:07 — page 183 — #19

184

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

few variables with lowest residual sum-of-squares (RSS) or lowest test error are selected, but many few-variable subsets have RSS within 1% of the lowest [90]. As a consequence, under a slight data perturbation, different few-variable models are learned and each one tells a different story about which variables are important, thereby questioning their interpretative value. Interpretation difficulties, one of the main criticisms of deep learning, are thus to some extent not unique of data-driven techniques. Leo Breiman coined the term “two cultures” to refer to the fact that the contrast between the modelbased and the data-driven paradigms should be framed into a more general debate about statistics and its tools, also on the epistemological side (see e.g., Rashomon effect [89]). For instance, Fisher’s “unbiasedness culture” should be put in context with the data-driven world. In this respect, a parallelism can also be attempted with the role of CFARness for the problem of radar detection: as observed in several cases, much better probability of detection can be obtained by accepting a small deviation from exact theoretical CFARness, which in a sense shares some resemblance with the bias-variance trade-off. This and other crucial aspects, such as the many-times observed accuracy vs model complexity trade-off, definitely need to be backed by a solid and comprehensive theory. 5.3.3 Visualization The well-known adage “a picture is worth a thousand words” may well apply to complex machine learning algorithms, because visualizing the information flow within the black box can significantly help in the quest for understanding datadriven methods. Actually, as Edsger Dijkstra once remarked, “a picture may be worth a thousand words, a formula is worth a thousand pictures”21 —therefore, a formal theory of deep learning would be even more important, but the latter is still missing while visualization is already here to help. And indeed visual inspection is highly valuable for both exploratory analysis and feedback on the learning dynamics and posttraining behavior. One of the simplest tools to this end is dimensionality reduction (see Sections 2.3.2 and 2.3.5), which provides visualization of DNNs that is scalable in number of dimensions and observations. By computing suitable projections, high-dimensional data can be represented in lower-dimensional (usually 2-D) spaces while trying to preserve the main characteristics of data distribution, relationships (distance) between points, and presence of clusters [91]. When depicted by scatterplots, projections not only can shed light on the data at hand, but may also aid existing approaches for understanding and improving NNs, for instance unveiling relationships between learned representations and similarities 21. “A first exploration of effective reasoning,” E.W. Dijkstra Archive, Center for American History, University of Texas at Austin, July 1996, https://www.cs.utexas.edu/users/EWD/ transcriptions/EWD12xx/EWD1239.html.

Coluccia: “chapter5_v2” — 2022/10/7 — 13:07 — page 184 — #20

Theories, Interpretability, and Other Open Issues

185

among neurons. Projections can be easily created using a fast (approximate) implementation of t-SNE ([92]; see also Section 2.3.5), which can show activation of intermediate layers, explain misclassifications, and help understanding the learning process by seeing how activations and cluster change during training. By using hue and brightness gradients in matrix or graph representations, it is possible to show in sequential frames how activation data flows through the network layers and how they evolve during the training. Several software tools have been developed for the purpose of data and algorithm visualization that can be very useful to understand machine learning and DNNs in particular. A noticeable online resource is Distill (https://distill.pub), which provides free articles full of visualization tools to (paraphrasing) “nurture the development of new ways of thinking through new notations, visualizations, and mental models that can deepen our understanding.” Examples from [93] are shown in Figure 5.5. 5.3.4 Sustainability, Marginal Return, and Patentability The success of DNNs in several applications has brought this machine learning approach to become an influential one in many engineering and data science contexts, over the time span of only a couple of decades. As discussed in Section 2.4.1, the concept behind DNNs is not new and dates back to Rosenblatt’s multilayer perceptron (1958), which however could not rely on the impressive computational power of today’s processors and GPUs. Decades of hardware improvements yielded impressive increase in computational performance, so the old idea became reality due to the lucky, favorable historical moment. Even Rosenblatt in his pioneering paper warned about the complexity of the NN computational model, and today’s deep-learning researchers are similarly nearing the frontier of what can be achieved [94], as discussed in more details in the following. Deep learning can be considered as the current arrival point of a longrunning trend in AI that has been moving from low-parametric systems based on rules, logics, and expert knowledge to overparameterized tools such as DNNs. The latter can fit any type of data, and therefore have unlimited flexibility (which explains their success in so many different domains) despite this pretty much meaning that there are more parameters to learn than available data for the training—but as we discussed, DNNs break the standard rule of overfitting and hence generalize well. The bad news is that this flexibility comes at an enormous computational price and, more important, the marginal cost of improvement is much larger than its return [94]. In fact, by extrapolating the gains of recent years, it can be predicted that the error level in the best DNNs might improve by just a few percent points in the next few years, while the computing resources and energy required for the training would be enormous, with a growth rate that

Coluccia: “chapter5_v2” — 2022/10/7 — 13:07 — page 185 — #21

186

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

Figure 5.5

Examples of visualization tools aimed at improving the understanding of neural networks, from [93], in particular for the task of handwritten digit recognition. From top to bottom: t-SNE applied to the MNIST dataset, breakdown across layers with grayscale gradient visualization of the activations in a CNN, and clustering of instances from the different classes and relationship with adversarial examples.

is polynomial. This, in turn, leads to the emission of a huge amount of carbon dioxide for the needed energy. It has been estimated that training a single DNN has the carbon footprint of five cars during their entire lifetimes [95]—so we could say that consuming brain energy in the design of model-based algorithms encoding expert knowledge not only can lead to more interpretable solutions, but is also a much more environmentally friendly option. There is actually some truth behind this idea. Models for which domain experts have established the relevant variables, are able to learn with much less

Coluccia: “chapter5_v2” — 2022/10/7 — 13:07 — page 186 — #22

Theories, Interpretability, and Other Open Issues

187

computation, and on limited data, the best values for those variables. Flexible models can provide better accuracy, since many possible variables are tested to discover the most important ones, possibly ending up in selecting those not identified by the experts in a model-based approach, but this requires a tremendous computational effort and huge amount of data. As a matter of fact, the environmental and economical sustainability of deep learning is highly questionable. Milestone achievements in this field (Google DeepMind’s AlphaGo, OpenAI’s GPT-3) ended up costing several tens of million dollars, and some alternative solutions were never investigated (nor was a bug fix that required retraining implemented) because the training cost would have been too high [94]. While mass-market scaling of a widely used model may well spread the training cost, for models that frequently need to be retrained, training cost will still unacceptably affect the final product’s price. A possible solution to overcome this showstopper is the use of dedicated processors (specialized GPU, FGPA, or ASIC hardware) that sacrifice the generality of the computing platform for the efficiency, but again diminishing returns are observed with current architectures. It can be predicted that a completely different paradigm, for example based on analog, neuromorphic (as mentioned at the beginning of Section 5.2), optical, or quantum systems, is needed. Other approaches instead try to combine the power of expert knowledge and reasoning with the flexibility of DNNs. In any case, green AI is likely to gain momentum in the near future [96]. A final issue is worth a mention: patentability. According to many laws on intellectual property, in fact, the inventor must be a human—but a DNN is the result of algorithmic training, not human intellectual effort, unless special innovations are introduced in its architecture. Moreover, sufficiency of disclosure or enablement is prescribed by a patent law: this is a requirement that a patent application should disclose a claimed invention in sufficient detail so that the “person skilled in the art” could carry out that claimed invention. In this respect, is a table listing all the weights of a NN comparable to a patentable design? The European Patent Office (EPO)’s guidelines report that invention using AI or machine learning must solve a technical problem to be patentable, just like any other computer-implemented invention—considered as mathematical methods, which are intrinsically “devoid of technical character” and thus not patentable [97]. How this will shape the future of technology based on AI, and likewise require developing new juridical concepts for intellectual property protection and rights, is certainly a further important piece in the giant puzzle of the AI revolution.

5.4 Summary This final chapter discussed several of the most important open issues, as well as criticisms, to the data-driven paradigm—especially based on DNNs as opposed

Coluccia: “chapter5_v2” — 2022/10/7 — 13:07 — page 187 — #23

188

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

to more traditional approaches. Promising ideas toward the development of a foundational theory of deep learning were illustrated in particular, aimed at understanding the excellent generalization capabilities of DNNs. At the same time, open problems regarding explainability and interpretation, adversarial attacks, stability, sustainability, marginal return, and patentability were discussed. All such issues need to be carefully considered in making design choices for mission-critical tasks such as radar detection. At the same time, the potential unleashed by data-driven methods, especially in combination with more principled (model-based) approaches, is likely going to provide significant advances in the radar field at large in the near future.

References [1] Yu, D., and L. Deng, “Deep Learning and Its Applications to Signal and Information Processing,” IEEE Signal Processing Magazine, No. 1, 2010, pp. 145–154. [2]

Hinton, G., “Deep Neural Networks for Acoustic Modeling in Speech Recognition,” IEEE Signal Processing Magazine, Vol. 29, No. 6, November 2012.

[3]

Goodfellow, I., Y. Bengio, and A. Courville, Deep Learning, Cambridge, MA: MIT Press, 2016.

[4]

Eldar, Y. C., A. O. Hero III, L. Deng, et al., “Challenges and Open Problems in Signal Processing: Panel Discussion Summary from ICASSP 2017 [panel and forum],” IEEE Signal Processing Magazine, Vol. 34, No. 6, 2017, pp. 8–23.

[5]

Greco, M. S., J. Li, T. Long, and A. Zoubir, “Advances in Radar Systems for Modern Civilian and Commercial Applications: Part 2 [from the guest editors],” IEEE Signal Processing Magazine, Vol. 36, No. 5, 2019, pp. 16–18.

[6]

Majumder, U., E. Blasch, and D. Garren, Deep Learning for Radar and Communications Automatic Target Recognition, Norwood, MA: Artech House, 2020.

[7]

Knight, W., “DARPA is Funding Projects that Will Try to Open Up AI’s Black Boxes,” MIT Technology Review, https://www. technologyreview.com/2017/04/13/152590/the-financialworld-wants-to-open-ais-black-boxes/.

[8]

Papyan, V., Y. Romano, J. Sulam, and M. Elad, “Theoretical Foundations of Deep Learning via Sparse Representations: A Multilayer Sparse Model and Its Connection to Convolutional Neural Networks,” IEEE Signal Processing Magazine, Vol. 35, No. 4, 2018, pp. 72–89.

[9]

Bruna, J., and S. Mallat, “Invariant Scattering Convolution Networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 35, No. 8, 2013.

[10]

Gripon, V., A. Ortega, and B. Girault, “An inside Look at Deep Neural Networks Using Graph Signal Processing,” in ITA Workshop, 2018.

[11]

Goodfellow, I. J., J. Pouget-Abadie, M. Mirza, et al., “Generative Adversarial Networks,” 2014, https://doi.org/10.48550/arXiv.1406.2661.

[12]

Merolla, P. A., et al., “A Million Spiking-Neuron Integrated Circuit with a Scalable Communication Network and Interface,” Science, Vol. 345, 2014.

Coluccia:

“chapter5_v2” — 2022/10/7 — 13:07 — page 188 — #24

Theories, Interpretability, and Other Open Issues

189

[13]

Maass, W., “To Spike or Not to Spike: That Is the Question,” Proceedings of the IEEE, Vol. 103, No. 12, December 2015.

[14]

Nguyen, A., J. Yosinski, and J. Clune, “Deep Neural Networks Are Easily Fooled: High Confidence Predictions for Unrecognizable Images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.

[15]

Kurakin, A., I. J. Goodfellow, and S. Bengio, “Adversarial Examples in the Physical World,” in 5th International Conference on Learning Representations (ICLR), 2017.

[16]

Szegedy, C., W. Zaremba, I. Sutskever, et al., “Intriguing Properties of Neural Networks,” ICLR, http://arxiv.org/abs/1312.6199, 2014.

[17]

Jain, A. K., and J. Mao, “A K-Nearest Neighbor Artificial Neural Network Classifier,” in Int. Joint Conf. Neural Networks, 1991.

[18]

Chen, Y. Q., R. I. Damper, and M. S. Nixon, “On Neural-Network Implementations of K-Nearest Neighbor Pattern Classifiers,” IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, Vol. 44, No. 7, 1997, pp. 622–0629.

[19]

Bhattacharya, T., and S. Haykin, “Neural Network-Based Radar Detection for an Ocean Environment,” IEEE Transactions on Aerospace and Electronic Systems, Vol. 33, No. 2, 1997, pp. 408–420.

[20]

Haykin, S. S., Neural Networks and Learning Machines, Third Edition, Upper Saddle River, NJ: Pearson Education, 2009.

[21]

Sum, J., W.- K. Kan, and G. Young, “A Note on the Equivalence of NARX and RNN,” Neural Compuing & Applications, Vol. 8, 1999, pp. 33–39.

[22]

Gregor, K., and Y. LeCun, “Learning Fast Approximations of Sparse Coding,” in Proceedings of the 27th International Conference on International Conference on Machine Learning (ICML’10), Madison, WI: Omnipress, 2010, pp. 399–406.

[23]

Monga, V., Y. Li, and Y. C. Eldar, “Algorithm Unrolling: Interpretable, Efficient Deep Learning for Signal and Image Processing,” IEEE Signal Processing Magazine, Vol. 38, No. 2, 2021, pp. 18–44.

[24]

He, H., C. -K. Wen, S. Jin, and G. Y. Li, “Model-Driven Deep Learning for Mimo Detection,” IEEE Transactions on Signal Processing, Vol. 68, 2020, pp. 1702–1715.

[25]

Haber, E., and L. Ruthotto, “Stable Architectures for Deep Neural Networks,” arXiv:1705.03341, 2017.

[26]

Anden, J., V. Lostanlen, and S. Mallat, “Joint Time-Frequency Scattering,” IEEE Transactions on Signal Processing, Vol. 67, No. 14, 2019, pp. 3704–3718.

[27]

LeCun, Y., Y. Bengio, and G. Hinton, “Deep Learning,” Nature, Vol. 521, No. 7553, 2015, p. 436.

[28]

Hinton, G., L. Deng, D. Yu, et al., “Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups,” IEEE Signal Processing Magazine, Vol. 29, No. 6, 2012, pp. 82–97.

[29]

Anselmi, F., J. Z. Leibo, L. Rosasco, J. Mutch, A. Tacchetti, and T. Poggio, “Unsupervised Learning of Invariant Representations in Hierarchical Architectures,” 10.48550/ARXIV.1311.4158, 2013.

Coluccia:

“chapter5_v2” — 2022/10/7 — 13:07 — page 189 — #25

190

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

[30]

Mallet, S., “Understanding Deep Convolutional Networks,” Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 2016, 20150203.

[31]

Papyan, V., Y. Romano, and M. Elad, “Convolutional Neural Networks Analyzed via Convolutional Sparse Coding,” The Journal of Machine Learning Research, 2017.

[32]

Sulam, J., V. Papyan, Y. Romano, and M. Elad, “Multilayer Convolutional Sparse Modeling: Pursuit and Dictionary Learning,” IEEE Transactions on Signal Processing, Vol. 66, No. 15, 2018, pp. 4090–4104.

[33] Tishby, N, and N. Zaslavsky, “Deep Learning and the Information Bottleneck Principle,” in IEEE Information Theory Workshop, 2015. [34]

Shwartz-Ziv, R., and N. Tishby, “Opening the Black Box of Deep Neural Networks via Information,” arXiv:1703.00810, 2017.

[35] Tishby, N., F. C. Pereira, and W. Bialek, “The Information Bottleneck Method,” in The 37th annual Allerton Conference on Communication, Control, and Computing, September 1999, pp. 368–377. [36]

Saxe, A. M., Y. Bansal, J. Dapello, et al., “On the Information Bottleneck Theory of Deep Learning,” in International Conference on Learning Representations, 2018.

[37]

Noshad, M., Y. Zeng, and A. O. Hero, “Scalable Mutual Information Estimation Using Dependence Graphs,” 10.48550/ARXIV.1801.09125, 2018.

[38]

Goldfeld, Z., E. van den Berg, K. Greenewald, et al., “Estimating Information Flow in Deep Neural Networks,” 2018, https://arxiv.org/abs/1810.05728.

[39]

Geiger, B. C., “On Information Plane Analyses of Neural Network Classifiers–A Review,” 10.48550/ARXIV.2003.09671, 2020, https://arxiv.org/abs/2003.09671.

[40]

Cybenko, G., “Approximation by Superpositions of a Sigmoidal Function,” Mathematics of Control, Signals and Systems, Vol. 2, No. 4, 1989, pp. 303–314.

[41]

Hornik, K., M. Stinchcombe, and H. White, “Multilayer Feedforward Networks are Universal Approximators,” Neural Networks, 1989.

[42]

Sandberg, I., and L. Xu, “Uniform Approximation of Multidimensional Myopic Maps,” IEEE Transactions on Circuits and Systems, 1997.

[43]

Murphy, K. P., Machine Learning : A Probabilistic Perspective, MIT Press, 2013.

[44]

Domingos, P., The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World, New York: Basic Books, 2015.

[45]

Cohen, N., O. Sharir, and A. Shashu, “On the Expressive Power of Deep Learning: A Tensor Analysis,” in 29th Annual Conference on Learning Theory (COLT), 2016.

[46]

Rolnick, D., and M. Tegmark, “The Power of Deeper Networks for Expressing Natural Functions,” in International Conference on Learning Representations, 2018.

[47] Vapnik, V. N., The Nature of Statistical Learning Theory, New York: Springer-Verlag, 1995. [48]

Hastie, T., R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, New York: Springer, 2001.

Coluccia:

“chapter5_v2” — 2022/10/7 — 13:07 — page 190 — #26

Theories, Interpretability, and Other Open Issues

191

[49]

Neyshabur, B., R. Tomioka, and N. Srebro, “In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning,” 10.48550/ARXIV.1412.6614, 2014.

[50]

Chizat, L., and F. Bach, “Implicit Bias of Gradient Descent for Wide Two-Layer Neural Networks Trained with the Logistic Loss,” in Proceedings of Machine Learning Research, 33rd Annual Conference on Learning Theory, Vol. 125, 2020, pp. 1–34.

[51]

Johnson, J., “Deep, Skinny Neural Networks Are Not Universal Approximators,” 2018, https://arxiv.org/abs/1810.00393.

[52]

Balestriero, R., and R. G. Baraniuk, “A Spline Theory of Deep Networks,” in 35th International Conference on Machine Learning, 2018.

[53]

Giryes, R., “A function Space Analysis of Finite Neural Networks with Insights from Sampling Theory,” arXiv:2004.06989, 2020.

[54]

Belkin, M., S. Ma, and S. Mandal, “To Understand Deep Learning We Need to Understand Kernel Learning,” in 35th International Conference on Machine Learning, 2018.

[55]

Aronszajn, N., “Theory of Reproducing Kernels,” Transactions of the American Mathematical Society, 1950.

[56]

Kimeldorf, G., and G. Wahba, “A Correspondence between Bayesian Estimation of Stochastic Processes and Smoothing by Splines,” The Annals of Mathematical Statistics, 1970.

[57]

Scholkopf, B., R. Herbrich, and A. J. Smola, “A Generalized Representer Theorem,” in Proceedings of the Annual Conference on Learning Theory, 2001.

[58]

Belkin, M., D. Hsu, S. Ma, and S. Mandal, “Reconciling Modern Machine-Learning Practice and the Classical Bias-Variance Trade-Off,” Proceedings of the National Academy of Sciences, 2019.

[59]

Raissi, M., P. Perdikaris, and G. E. Karniadakis, “Physics Informed Deep Learning (Part i): Data-Driven Solutions of Nonlinear Partial Differential Equations,” 10.48550/ARXIV.1711.10561, 2017.

[60]

Berner, J., P. Grohs, G. Kutyniok, and P. Petersen, “The Modern Mathematics of Deep Learning,” 10.48550/ARXIV.2105.04026, 2021.

[61]

Lee, J., Y. Bahri, R. Novak, S. S. Schoenholz, J. Pennington, and J. Sohl-Dickstein, “Deep Neural Networks as Gaussian Processes,” https://arxiv.org/abs/1711.00165, 2017.

[62]

Hopfield, J., “Neural Networks and Physical Systems with Emergent Collective Computational Abilities,” Proceedings of the National Academy of Sciences, Vol. 79, No. 8, 1982.

[63]

Ackley, D., G. Hinton, and T. Sejnowski, “A Learning Algorithm for Boltzmann Machines,” Cognitive Science, Vol. 9, 1985.

[64]

Amit, D., H. Gutfreund, and H. Sompolinsky, “Storing infinite Numbers of Patterns in a Spin-Glass Model of Neural Networks,” Physical Review Letters, Vol. 55, No. 14, 1985.

[65]

Coolen, A., R. Kuehn, and P. Sollich, Theory of Neural Information Processing Systems, Oxford University Press, July 2005.

[66]

Nishimori, H., Statistical Physics of Spin Glasses and Information Processing: An Introduction, Oxford University Press, 2001.

Coluccia:

“chapter5_v2” — 2022/10/7 — 13:07 — page 191 — #27

192

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

[67]

Mezard, M., G. Parisi, and M. Virasoro, Spin Glass Theory and Beyond: An Introduction to the Replica Method and Its Applications, Volume 9, Singapore: World Scientific, 1987.

[68]

Hinton, G. E., S. Osindero, and Y. W. Teh, “A fast Learning Algorithm for Deep Belief Nets,” Neural Computation, Vol. 18, No. 7, 2006.

[69]

Hinton, G. E., “Learning Multiple Layers of Representation,” Trends in Cognitive Sciences, Vol. 11, 2007.

[70]

Bengio, Y., “Practical Recommendations for Gradient-Based Training of Deep Architectures,” arXiv:1206.5533, 2012.

[71]

Barra, A., A. Bernacchia, E. Santucci, and P. Contucci, “On the Equivalence of Hopfield Networks and Boltzmann Machines,” Neural Networks, Vol. 34, 2012, pp. 1–9.

[72]

Merhav, N., D. Guo, and S. Shamai, “Statistical Physics of Signal Estimation in Gaussian Noise: Theory and Examples of Phase Transitions,” IEEE Transactions on Information Theory, Vol. 56, No. 3, March 2010.

[73]

Krzakala, F., M. Mezard, F. Sausset, Y. F. Sun, and L. Zdeborova, “Statistical-Physics-Based Reconstruction in Compressed Sensing,” Physical Review X, 2012.

[74]

Arora, S., et al., “Generalization and Equilibrium in Generative Adversarial Nets (GANS),” arXiv:1703.00573v5, 2017.

[75]

Biau, G., et al., “Some Theoretical Properties of GANS,” arXiv:1803.07819v1, 2018.

[76]

Bengio Y. Y., et al., “Greedy Layer-Wise Training of Deep Networks,” Proceedings of Advances in Neural Information Processing Systems, Vol. 19, 2006.

[77] Yampolskiy, R., Artificial Intelligence Safety and Security, CRC Press, 2018, https:// books.google.ca/books?id=ekkPEAAAQBAJ. [78]

Dalvi, N., P. Domingos, S. Sanghai, et al., “Adversarial Classification,” in Proceedings of the Tenth ACM SIGKDD international Conference on Knowledge Discovery and Data Mining, 2004, pp. 99–108.

[79]

Brown, T. B., D. Mane, A. Roy, M. Abadi, and J. Gilmer, “Adversarial Patch,” ArXiv, Vol. abs/1712.09665, 2017.

[80]

Goodfellow, I. J., J. Shlens, and C. Szegedy, “Explaining and Harnessing Adversarial Examples,” http://arxiv.org/abs/1412.6572, 2014.

[81]

Papernot, N., P. McDaniel, and I. Goodfellow, “Transferability in Machine Learning: From Phenomena to Black-Box Attacks Using Adversarial Samples,” http:// arxiv.org/abs/1605.07277, May 2016.

[82]

Jalal, A., A. Ilyas, C. Daskalakis, and A. G. Dimakis, “The Robust Manifold Defense: Adversarial Training Using Generative Models,” arXiv:1712.09196, July 10, 2019.

[83] Waldrop, M. M., “What Are the Limits of Deep Learning?” Proceedings of the National Academy of Sciences, Vol. 116, No. 4, January 22, 2019. [84]

Gunning, D., “Explainable Artificial Intelligence (XAI),” https://www.darpa.mil/program/ explainableartificial-intelligence, 2017.

Coluccia:

“chapter5_v2” — 2022/10/7 — 13:07 — page 192 — #28

Theories, Interpretability, and Other Open Issues

193

[85]

Arrieta, A. B., N. Diaz-Rodriguez, J. Del Ser, et al., “Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges Toward Responsible AI,” Information Fusion, Vol. http://arxiv.org/abs/1910.10045, 2020, pp. 82–115.

[86]

James, G., D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning: With Applications in R, Springer, 2013, https://faculty.marshall.usc.edu/gareth-james/ISL/.

[87]

Bengio, Y., Y. Lecun, and G. Hinton, “Deep Learning for AI,” Communications of the ACM, Vol. 64, No. 7, July 2021, pp. 58–65.

[88]

Donoho, D., “50 Years of Data Science,” Journal of Computational and Graphical Statistics, Vol. 26, No. 4, 2017.

[89]

Breiman, L., “Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author),” Statistical Science, Vol. 16, No. 3, 2001.

[90]

Breiman, L., “The Heuristics of Instability in Model Selection,” The Annals of Statistics, Vol. 24, No. 6, 1996, pp. 2350–2383.

[91]

Rauber, P. E., S. G. Fadel, A. X. Falcao, and A. C. Telea, “Visualizing the Hidden Activity of Artificial Neural Networks,” IEEE Transactions on Visualization and Computer Graphics, Vol. 23, No. 1, 2017, pp. 101–110.

[92]

van der Maaten L., and G. Hinton, “Visualizing High-Dimensional Data Using t-SNE,” Journal of Machine Learning Research, Vol. 9, No. 11, 2008, pp. 2579–2605.

[93]

Li, M., Z. Zhao, and C. Scheidegger, “Visualizing Neural Networks with the Grand Tour,” Distill, 2020, https://distill.pub/2020/grand-tour.

[94] Thompson, N. C., K. Greenewald, K. Lee, and G. F. Manso, “Deep Learning’s Diminishing Returns: The Cost of Improvement Is Becoming Unsustainable,” https:// spectrum.ieee.org/amp/deeplearning-computational-cost-2655082754, 24 Sep. 2021. [95]

Strubell, E., A. Ganesh, and A. McCallum, “Energy and Policy Considerations for Deep Learning in NLP,” in 57th Annual Meeting of the Association for Computational Linguistics, 2019.

[96]

Schwartz, R., J. Dodge, N. A. Smith, and O. Etzioni, “Green AI,” 10.48550/ARXIV .1907.10597, 2019.

[97]

EPO guidelines, g_ii_3_3_1.htm.

https://www.epo.org/law-practice/legal-texts/html/guidelines2018/e/

Coluccia: “chapter5_v2” — 2022/10/7 — 13:07 — page 193 — #29

Coluccia: “chapter5_v2” — 2022/10/7 — 13:07 — page 194 — #30

List of Acronyms 1-D 2-D 3-D 4-D 1S 2S ABORT ACCA-CFAR ACE AED AESA ADAS AI ALR AMF AMFD ASB ASIC ASIL AWGN CA-CFAR CAD CARD CFAR CFAR-FP

One-dimensional Two-dimensional Three-dimensional Four-dimensional One-step Two-step Adaptive beamformer orthogonal rejection test Automatic censored cell averaging CFAR Adaptive coherence estimator Adaptive energy detector Active electronically scanned array Advanced driver assistance system Artificial intelligence Average likelihood ratio Adaptive matched filter Adaptive matched filter with de-emphasis Adaptive sidelobe blanker Application specific integrated circuit Automotive safety integrity level Additive white Gaussian noise Cell-averaging CFAR Conic acceptance detector Conic acceptance-rejection detector Constant false alarm rate CFAR feature plane

195

Coluccia: “bm_v2” — 2022/10/7 — 13:07 — page 195 — #1

196

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

CNN CPI DL DNN DOA ED EV FFT FPGA FMCW GAN GLRT GO-CFAR GPU I/Q KNN LDA LFM LMS LR LRT LSTM MAP MED MF MIMO ML MLP MMED MSE MTI NARX NN NP OS-CFAR PAM PCA PD PDF PFA PINN PRF

Convolutional neural network Coherent processing interval Deep learning Deep neural network Direction of arrival Energy detector Eigenvalue Fast Fourier transform Field programmable gate array Frequency-modulated continuous-wave Generative adversarial network Generalized likelihood ratio test Greatest-of CFAR Graphics processing unit In-phase and quadrature k-nearest neighbors Linear discriminant analysis Linear frequency modulation Least mean square Likelihood ratio Likelihood ratio test Long short-term memory Maximum a-posteriori probability Minimum eigenvalue detector Matched filter Multiple-input multiple-output Maximum likelihood Multilayer perceptron Maximum-minimum eigenvalue detector Mean squared error Moving target indicator Nonlinear autoregressive exogenous Neural network Neyman-Pearson Ordered statistics CFAR Pulse amplitude modulation Principal component analysis Probability of detection Probability density function Probability of false alarm Physics-informed neural network Pulse repetition frequency

Coluccia: “bm_v2” — 2022/10/7 — 13:07 — page 196 — #2

List of Acronyms

PRT PSK QAM QDA RBF RBM ReLU ResNet RF RKHS RMB RNN ROB RSS SAR SCM SCR SD SDR SEI SINR SNR SO-CFAR SOTIF STAP STFT SVM t-SNE UMP UMPI VC W-ABORT WVD

Pulse repetition time Phase shift keying Quadrature amplitude modulation Quadratic discriminant analysis Radial basis function (Gaussian kernel) Restricted Boltzmann machine Rectified linear unit Residual neural network Radiofrequency Reproducing-kernel Hilbert space Reed-Mallett-Brenann rule Recurrent neural network Random-signal robustified detector Residual sum-of-squares Synthetic aperture radar Sample covariance matrix Signal-to-clutter ratio Subspace detector Software-defined radio Specific emitter identification Signal-to-interference plus noise ratio Signal-to-noise ratio Smallest-of CFAR Safety of the intended functionality Space-time adaptive processing Short-time Fourier transform Support vector machine t-distributed stochastic neighbor embedding Uniformly most powerful UMP invariant Vapnik-Chervonenkis Whitened ABORT Wigner-Ville distribution

Coluccia: “bm_v2” — 2022/10/7 — 13:07 — page 197 — #3

197

Coluccia: “bm_v2” — 2022/10/7 — 13:07 — page 198 — #4

List of Symbols Vectors/matrices are denoted by boldface lower-case/upper-case letters. A hat over a variable xˆ denotes an estimate of x. R C Cn×m

√ j = −1 ex or exp{x}

set of real numbers set of complex numbers Euclidean space of (n × m)-dimensional complex matrices shorthand for Cn×1 (n-dimensional complex vectors) imaginary unit exponential function

sign(x) =

signum function

Cn

det(·) Tr{·} (·)∗ (·) (·)† {·} x |x| ⊗ I (·)

+1, x ≥ 0, −1, x < 0.

determinant trace complex conjugate transpose conjugate transpose (Hermitian) real part Euclidean norm of the complex vector x ∈ Cn modulus of the complex number x ∈ C Kronecker product indicator function (1 if the argument is true, 0 otherwise)

199

Coluccia: “bm_v2” — 2022/10/7 — 13:07 — page 199 — #5

200

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

I 0 1 PM = M(M † M)−1 M † ⊥ =I−P PM M P(·) E[·] VAR[·] x ∼ CN (m, M)

x ∼ χn2 b c1 , . . . , cM E ei f fD H H0 H1 K(·, ·) K L(·) i M

m N NT NS NT n n o PFA PD

identity matrix of proper dimensions null vector/matrix of proper dimensions vector/matrix of ones of proper dimensions projector onto the subspace spanned by the columns of M projector onto the orthogonal complement probability statistical expectation (mean) statistical variance n-dimensional complex normal distribution of x with mean m ∈ Cn , positive definite covariance matrix M ∈ Cn×n , and PDF given by 1 −(x−m)† M −1 (x−m) π n det(M ) e (central) chi-square distribution with n degrees of freedom, and CDF denoted by Fχn2 (x) bias term output decision classes energy of the signal ith canonical vector decision function (input-output mapping) Doppler frequency shift (in hertz) space function for learning null hypothesis (noise-only, absence of target) alternative hypothesis (signal-plus-noise, presence of target) kernel function number of secondary data loss function label of ith training data number of classes in decision space (M = 2 for binary detection) dimension of the observation space length of the generic primary data z number of training samples number of sensors (antennas) number of transmit pulses complex random vector of the overall noise dimension of the feature space m-dimensional vector of observed data probability of false alarm probability of detection

Coluccia: “bm_v2” — 2022/10/7 — 13:07 — page 200 — #6

List of Symbols

R S = ZZ † s T T1 /T2 t(z), t(z, S) v w x y, yi Z = [z1 · · · zK ] z z1 , . . . , zK α (β, t˜) γ η θ λmax (·) λmin (·) νS νT χ(·, ·) σ2 σs2 τ

201

covariance matrix of the data (specific subscript may be present) scatter matrix useful signal vector training set (possibly with superscript 0 or 1 to denote the class) eigenvalue-based detector decision statistic (specific subscript may be present) generic (space, time, or space-time) steering vector weight vector n-dimensional feature vector output variable (decision) matrix of secondary data vector signal from the cell under test (primary data) secondary data complex amplitude of the primary data z maximal invariant statistics for CFAR detectors in Gaussian environment SNR value for the actual steering vector vector of parameters for tunable detectors (specific subscript may be present) detection threshold angle between nominal and actual steering vectors (in mismatch level cos2 θ) maximum eigenvalue of the matrix argument minimum eigenvalue of the matrix argument normalized spatial frequency of the target (depending on DOA) normalized Doppler frequency of the target ambiguity function noise power power of the useful signal round-trip time between radar and target (twice the time-of-fly)

Coluccia: “bm_v2” — 2022/10/7 — 13:07 — page 201 — #7

Coluccia: “bm_v2” — 2022/10/7 — 13:07 — page 202 — #8

About the Author Angelo Coluccia received an MSc degree summa cum laude in telecommunication engineering in 2007 and a PhD degree in information engineering in 2011, both from the University of Salento, Lecce, Italy. He is currently an associate professor of telecommunications in the Department of Engineering, University of Salento, Lecce, Italy. He has been a research fellow at Forschungszentrum Telekommunikation Wien (Vienna, Austria), and has held a visiting position at the Department of Electronics, Optronics, and Signals of the Institut Superiéur de l’Aéronautique et de l’Espace (ISAE-Supaero, Toulouse, France). His research interests are in the area of multichannel, multisensor, and multiagent statistical signal processing for detection, estimation, localization, and learning problems. Relevant application fields are radar, wireless networks (including 5G and beyond), and emerging network contexts (including intelligent cyber-physical systems, smart devices, and social networks). He is a Senior Member of IEEE, a member of the Sensor Array and MultichannelTechnical Committee for the IEEE Signal Processing Society, a member of the Data Science Initiative for the IEEE Signal Processing Society, and a member of the Technical Area Committee in Signal Processing for Multisensor Systems of EURASIP (European Association for Signal Processing).

203

Coluccia: “bm_v2” — 2022/10/7 — 13:07 — page 203 — #9

Coluccia: “bm_v2” — 2022/10/7 — 13:07 — page 204 — #10

Index Acronyms list, this book, 195–97 Activation function, 76 Adaptive beamformer orthogonal rejection test (ABORT), 33, 36–37 Adaptive coherence estimator (ACE) detectors, 32–33 Adaptive detectors CARD, 36–37 model-based design of, 33–42 random-signal approach, 37–42 subspace approach, 35–36 tunable receivers, 34–35 W-ABORT, 36–37 Adaptive matched filter (AMF) about, 18 with de-emphasis (AMFD), 34 detectors, 32–33, 147–49 mismatch, 31–32 Adaptiveness, 37 Adaptive radar detection about, 2 CFAR, 13–15, 62 in colored noise, 25–42 model-based, 1–43 radar processing and, 1–9 stages of processing, 2 structured signal in white noise and, 18–25 summary, 42–43 unstructured signal in white noise and, 9–18

Adaptive sidelobe blanker (ASB), 39 Additive white Gaussian noise (AWGN), 102 Advanced driver assistance systems (ADAS), 166 Adversarial attacks, 181–82 Algorithm unrolling, 169–70 Ambiguity function, 4 Anomaly detection, 84, 123–24 Array processing, 5 Autoencoders, 91–92 Automatic modulation recognition, 100–102 Automatic target recognition (ATR), 166–67 Autoregressive (AR) models, 125 Autoregressive moving average (ARMA) models, 125 Average likelihood ratio (ALR), 19, 115–16 Backpropagation, 88 Bagging, 82 Bias-shifting procedure, 110 Bias-variance trade-off, 71–72 Binary (antipodal) modulation, 53–54 Binary classification confusion matrix and, 62 decision trees and, 111 example, 63 NN architecture for, 87 output layer and, 86 radar detection versus, 61–64 205

Coluccia:

“index_v2” — 2022/10/7 — 21:27 — page 205 — #1

206

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

two-dimensional feature vectors for, 59 See also Classification Blind detection, 17 Boosting, 82 CART algorithm, 112 Cell-averaging CFAR (CA-CFAR) about, 14 performance degradation, 115 threshold behavior illustration, 15 CFAR detectors about, xii generalized CFAR property and, 30, 42 obtaining, 154 zero loss and, 41 CFAR feature plane (CFAR-FP) about, 151 characterizing model-based detectors in, 153–58 classification region boundary in, 154 data in, 155 decision boundaries in, 156, 157, 158 decision region in, 154 design strategies in, 158–59 framework, 158 invariant feature extractor and, 152 maximal invariant, 151–53 model-based detection in, 151–59 optimal design problem and, 159 point clusters in, 156 position and shape of clusters in, 157 reference directions, 159 as unified network, 152 CFAR processing NN for, 114–17 NP, ALR, GLRT and, 116 SVM for, 114–17 Chirps, 50 Classification binary, 61–64, 86, 87, 111 for complex data, 98 DNN-based, 166 image-like, 121–22 region boundary in CFAR-FP, 153–58 signal, 60, 100–109 SVM, 78–80 Classification algorithms, 66–72 Classifiers about, 50

Coluccia:

data-driven, 72–85 decision regions and, 55–61 kernel, 177 one-class, 124 RKHS, 177–78 Clustering methods, 84 Clusters, 63 Clutter, 3 CNN-based detectors, 121, 122 Coherent processing intervals (CPIs) about, 3 fast-time/slow-time matrix, 4 temporal domain of, 7 Coherent radars about, 3 generalities and terminology for, 2–5 target detection and performance metrics, 8–9 Colored noise, 25–42 Complex statistical mechanics, 180 Concept drift, 137 Confusion matrix, 57, 62 Conic/acceptance/rejection (CARD) detectors, 37 Constant false alarm rate (CFAR) about, xi, 9 adaptive detection, 13–15 cell-averaging (CA-CFAR), 14–15, 115 generalized property, 28, 42 See also CFAR detectors Controlled PFA decision-tree based detection with, 111–12 guaranteed, 123 SVM-based detection with, 110–11 Convolutional neural networks (CNNs) about, 89–90 antenna selection problem and, 100 complex-valued temporal radio signals and, 101 interpretation of, 172 intuition behind, 101 as multilayer perceptron, 90 1-D, 108 processing roots, 167 for sea clutter, 121 time frequency maps and, 121 2-D, 108–9 wavelet scattering networks, 171–72 Correct detection, 8

“index_v2” — 2022/10/7 — 21:27 — page 206 — #2

Index

Correlated signal model, in white noise, 15–18 Covariance matrix estimation, 27 power level from, 26 sample (SCM), 29–30 Cross-entropy, 56 Cross-validation, 70 Data augmentation, 91, 105 Data-driven, defined, 55 Data-driven classifiers about, 72–73 k-nearest neighbors, 73–74 methods for dimensionality reduction/classification and, 75–77 SVM and kernel methods and, 77–81 See also Classifiers Data-driven radar applications, 97–100 Datasets acquisition/generation, 159–60 annotation, 104–5 biasedness, 103–4 data collection and, 102 testing performance and properties, 105–7 unbalanced, 103–4 Decision boundaries in CFAR-FP, 156, 157, 158 k-nearest neighbors (KNN), 73–74 well-known CFAR detectors, 156 Decision problems binary, 49–50 general formation of, 49–50 M-ary, 50–55 plain modulation decoding versus, 100–101 Decision regions, 57 Decision rules, 55 Decision statistics, 8 Decision tree-based detection, 111–12 Decision trees, 81–82, 111 Decoupled processing, 27–28 Deep Belief Networks (DBNs), 181 Deep learning about, 89 applications, 99 as arrival point, 185 convolutional neural networks (CNNs) and, 89–90, 99–100

Coluccia:

207

deep neural networks (DNNs) and, 89, 90–91 mathematics of, 180–81 Deep learning-based detection about, 120–21 as image-like classification task, 121 on raw data, 122–23 Deep neural networks (DNNs) about, 89, 90–91 alternatives to, 92 building blocks of, 168 expressiveness and, 173–74 generalization and, 174–75 hyperparameters, 91 interpretability and, 166 overfitting and, 90 theories for, 167–81 understanding of, 171 universal mapping and, 173 vanishing gradient problem and, 91 visualization of, 184–85 widespread use of, 165 Deep unfolding, 142 Degrees of freedom (DOF), 68–70 Dependency between input (CUT), 160 Detection performance, 12 Detectors ACE, 32–33 adaptive, model-based design of, 33–42 AMF, 32–33, 147–49 CARD, 37 CFAR, xii, 30, 41, 120, 121, 137, 142, 147–51, 154, 169 CNN-based, 121–22 energy (ED), 9–16, 29 GLRT-based, 26 Kelly’s, 31–32, 147–49 KNN-based, 145–47 most powerful invariant (MPI), 158 PCA-based, 124 square-law, 11 subspace, 35–36 uniformly most powerful (UMP), 11, 17 W-ABORT, 36–37 Wald-type, 126 well-known CFAR, 147–51 Digital modulation constellations, 55 Dimensionality reduction, 58, 84 Direction of arrival (DOA), 3, 5

“index_v2” — 2022/10/7 — 21:27 — page 207 — #3

208

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

Discriminant analysis, 75 Distill, 185 Doppler effect, 4 Doppler quantization, 32 Double descent phenomenon, 178, 179 Empirical risk minimization, 56 Energy detector (ED) about, 9 adaptive, 13 colored noise and, 29 nonadaptive, 14 as spectrum sensing tool, 16 static energy and, 10 Exclusive OR (XOR), 63 Expected risk minimization, 56 Expressiveness, 173–74 False alarms, 8, 9 Fast Fourier transform (FFT), 5 Feature engineering, 59, 89 Feature learning, 88–89 Feature pyramid network (FPN), 152 Feature selection, 59 Feature space about, 58 based on well-known statistics or raw data, 142–51 CFAR, 151–59 data clusters in, 148, 150 maximal invariant, 151 signal representation in, 61 three-dimensional, 61, 148 two-dimensional, 150 Feature vector binary classification and, 59 quasi-whitened raw data as, 144–47 well-known CFAR statistics, 147–51 Feedforward networks, 85, 181 Frequency-modulated continuous wave (FMCW), 99 Gaussian function, 159 Gaussian kernel, 80 Generalization (D)NNs and, 174–75, 183 about, 56 capability, 67, 174–75 Generalization gap, 174 Generalized CFAR property

Coluccia:

about, 28 CFAR detectors, 30, 42 feature spaces with, 117–20 obtaining detectors with, 119 performance and, 145–46 statistics with, 120 Generalized correlated non-Gaussian models, 125 Generalized likelihood ratio test (GLRT) about, xi, 13, 20–21 applying, 22 capability of, 24 CFAR processing and, 115–16 general hypothesis testing problem via, 28–31 procedure, 22 Generative adversarial networks (GANs), 167 Gini criterion, 82 Green AI, 187 Hidden context, 138 Hidden layers, 86 Hybridization about, 123 CFAR-FP and, 151–59 dimensions of, 139–40 feature spaces and, 142–51 general purpose of, 140–41 in signal processing and communications, 140–42 summary, 159–60 trend, 140 Hyperparameters, 70, 91 Hypothesis testing model, 38 Information bottleneck theory, 172 Interpolating regime, 178 Interpretability, 182–84 Invariant feature extractor, 152 Isolation forest, 124–25 Kelly’s detector, 147–49 Kennel trick, 64 Kernel classifiers, 177 Kernel discriminant analysis, 80 Kernel perceptron, 80 Kernel trick, 79 K-nearest neighbors (KNN) about, 73 classifier, deriving, 142

“index_v2” — 2022/10/7 — 21:27 — page 208 — #4

Index

decision boundaries, 73–74 disadvantages, 74 as local method, 73 as nonparametric approach, 142–44 in radar detector design, 144 KNN-based detectors Gaussian case, 147 general scheme for, 145 mismatched conditions and, 145 PFA and, 146 Learned input-output mapping, 174 Learning approaches, 66–72 deep, 89–93 feature, 88–89 nonparametric, 142–44 radar application for, 97–127 reinforcement, 85, 125–26 semisupervised, 84–85 statistical, 66–71 summary, 126–27 supervised, 66–67, 109–23 unsupervised, 84, 123–25 Learning rate, 88 Learning rule, 67 Likelihood ratio test (LRT), 11, 20 Linear discriminant analysis (LDA), 59, 75–76 Long short-term memory (LSTM) networks, 92, 101 Machine learning challenges in, 165–67 performance gap, 183 radar applications, 97–100 See also Learning Mapping f, obtaining, 53 learned input-output, 174 m-dimensional observation space, 54 optimal, derivation of, 55 universal, 173 Margin, 77, 78 Marginal return, 187 Massive MIMO radars, 126 MATLAB toolbox, 106, 109 Maximal invariant feature space, 151 Maximum a posteriori probability (MAP) rule, 141

Coluccia:

209

Maximum-minimum eigenvalue detector (MMED), 16–17, 24 Mean field, 180 Minimum eigenvalue detector (MED), 16–17 Model-based detection in CFAR feature space, 151–59 decision theory and, 112 Most powerful invariant (MPI) detector, 158 Moving target indicator (MTI), 28 Multihypothesis decision problems, 50–55 Multilayer perceptrons (MLPs), 66, 90, 170–72 Network structures, 168–71 Neural networks (NNs) about, 85 alternatives to, 92 autoencoders, 91 building blocks of, 168 for CFAR processing, 114–17 deep learning and, 89–93 expressiveness and, 173–74 feature learning, 88–89 feedforward, 85, 181 general architecture, 87 generalization and, 174–75 multilayer perceptron and, 86 as network structures, 168–71 performance improvement of, 171 physics-informed (PINNs), 179 spiking (SNN), 168 theories for, 167–81 as unrolling, 169–70 See also Deep neural networks (DNNs); Recurrent NNs (RNNs) NeuralProphet timeseries forecasting method, 140–41 Neyman-Pearson (NP) approach about, xi, 11–12 revision of, 112–14 support vector machine (SVM) and, 113 No-free-lunch theorem, 173 Noise subspace, geometrical representation, 23 Nonlinear autoregressive exogenous (NARX) model, 169 Observations, 8 Open issues adversarial attacks, 181–82

“index_v2” — 2022/10/7 — 21:27 — page 209 — #5

210

Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches

stability, efficiency, interpretability, 182–84 sustainability, marginal return, and patentability, 185–87 visualization, 184–85 Organization, this book, xi–xii Overcomplete representation, 65 Overfitting, 67, 68, 69, 90, 174–75 Overparameterized interpolation, 176–77 Parallel sheaf of (hyper)planes, 110 Parameter estimation, 71 Patentability, 187 Pattern recognition, 50 PCA-based detector, 124 Peak-to-average-power ratio (PAR), 125 Perception criterion, 77 Performance metrics, 1, 8–9, 69, 70 Physics-informed neural networks (PINNs), 179 Principal component analysis (PCA), 59, 76 Probability of detection, 9 Probability of false alarm, 9 Pulse-Doppler matched filterbank, 25, 26 Pulse repetition frequency (PRF), 3 Pulse repetition time (PRT), 3 Quadratic discriminant analysis (QDA), 75–76 Quasi-whitened raw data, 144–47 Radar applications (machine learning) about, 97 data-driven, 97–100 general scheme for processing steps in, 98 Radar cross section (RCS), 3 Radar datacube, 7 Radial basis function (RBF), 80 Random forest algorithm, 81 Random-signal approach, 37–42 Rao’s test, 157 Raw data derived values as functions of, 58 detection on, 122–23 direct use of, 59 quasi-whited, as feature vector, 144–47 RF signal collection and, 102 Recognition problems, 50 Recurrent NNs (RNNs), 85, 92–93, 108, 170 Regression problems, 56

Coluccia:

Regularization methods, 90–91 Reinforcement learning, 85, 125–26 Representer theorem (RT), 177 Residual sum-of-squares (RSS), 184 Restricted Boltzmann machines (RBMs), 93 Retraining, 115–16, 137–39 Return, 3 Safety of the intended functionality (SOTIF), 166 Sample covariance matrix (SCM), 29–30 Scatter matrix, 29–30, 144 Second-order model, 37 Semisupervised learning, 84–85 Signal classification about, 60, 100 automatic modulation recognition and, 100–102 datasets and, 102–7 radar signals and radiation sources, 107–9 Signal representation, overcomplete, 65 Signal-to-interference plus noise ratio (SINR), 109 Signal-to-noise (SNR) ratio, 3, 33, 144–45, 150 Slow time, 3 Soft-margin, 78 Software-defined radio (SDR), 60, 103 Space-time adaptive processing (STAP) about, 2, 3, 6–7 as linear filtering technique, 7 vectorized 2-D filter and, 19 Spatial steering vector, 6 Specific emitter identification (SEI), 107–8 Spectrum sensing energy detector (ED) and, 16 features, 119 SVM, 111 Spiking neural networks (SNNs), 168 Square-law detector, 11 Stability/robustness, 183–84 Statistical learning, 66–71 Stochastic gradient descent, 88 Stochastic machines, 180–81 Structured signal in white noise about, 18 generalized likelihood ratio test (GLRT) and, 20–24

“index_v2” — 2022/10/7 — 21:27 — page 210 — #6

Index

matched filter and, 18–20 steering vector known and, 25 unknown rank-one, 24–25 Subspace approach, 35–36 Sufficient statistics, 53 Supervised learning, 66–67, 109–23 Support vector machine (SVM) about, 78 for CFAR processing, 114–17 classification, 78–80 controllable classifier, 123 importance of, 78–79 for minimax and NP classification, 113 as minimum-norm RKHS classifier, 177 spectrum sensing, 111 Sustainability, 185–87 SVM-based detection, 110–11, 114 Swerling I-II targets, 113 Symbols list, this book, 199–201 Synaptic weights, 77, 86 Synthetic aperture radar (SAR), 8 Target detection as image-like classification task, 121 as pattern recognition problem, 50 performance metrics and, 8–9 point, 2 in range domain, 4–5 reinforcement learning and, 100 steering vectors and, 25 Temporal steering vector, 5 Time frequency maps, 121 Training algorithms, 66–67 balanced, 67 DNNs, 90–91 error, 68–69 process, 55 statistical curve fitting and, 70 Training error, 55

211

Tunable receivers, 34–35 UMP invariant (UMPI) test, 25, 64 “Unbiasedness culture,” 184 Underfitting, 174–75 Uniformly most powerful (UMP) detectors, 11 Universal approximation, 64, 65–66 Universal mapping, 173 Unrolling, 169–70 Unsupervised learning about, 84, 123–24 isolation forest, 124–25 one-class classifiers, 124 PCA-based detector, 124 See also Learning Vanishing gradient problem, 91 Vapnik-Chervonenkis (VC) dimension, 174 Variability index CFAR (VI-CFAR), 115 Vectors feature, 59, 144–51 steering, 5, 6, 25 support, 78 Velocity determination, 3 Visualization, 184–85, 186 W-ABORT, 36–37, 157–58 Waveform matched filter, 52 Well-known CFAR detectors decision boundaries, 156 as feature vector, 147 general property of, 149 See also CFAR detectors White noise correlated signal model in, 15–18 structured signal in, 18–25 unknown rank-one signal in, 24–25 unstructured signal in, 9–18 Wigner-Ville distribution (WVD), 121

Coluccia: “index_v2” — 2022/10/7 — 21:27 — page 211 — #7

Coluccia: “index_v2” — 2022/10/7 — 21:27 — page 212 — #8

Artech House Radar Library Joseph R. Guerci, Series Editor Adaptive Antennas and Phased Arrays for Radar and Communications, Alan J. Fenn Adaptive Radar Detection: Model-Based, Data-Driven, and Hybrid Approaches, Angelo Coluccia Adaptive Signal Processing for Radar, Ramon Nitzberg Advanced Techniques for Digital Receivers, Phillip E. Pace Advances in Direction-of-Arrival Estimation, Sathish Chandran, editor Airborne Early Warning System Concepts, Maurice W. Long Airborne Pulsed Doppler Radar, Second Edition, Guy V. Morris and Linda Harkness, editors Aspects of Modern Radar, Eli Brookner Aspects of Radar Signal Processing, Frank F. Kretschmer, Bernard Lewis and Wesley W. Shelton Basic Radar Analysis, Mervin C. Budge, Jr. and Shawn R. German Basic Radar Tracking, Mervin C. Budge, Jr. and Shawn R. German Bayesian Multiple Target Tracking, Second Edition, Lawrence D. Stone, Roy L. Streit, Thomas L. Corwin, and Kristine L Bell Beyond the Kalman Filter: Particle Filters for Tracking Applications, Branko Ristic, Sanjeev Arulampalam, and Neil Gordon Cognitive Radar: The Knowledge-Aided Fully Adaptive Approach, Second Edition, Joseph R. Guerci Coherent Radar Performance Estimation, James A. Scheer Computer Simulation of Aerial Target Radar Scattering, Recognition, Detection, and Tracking, Yakov D. Shirman, editor Control Engineering in Development Projects, Olis Rubin Countermeasures for Aerial Drones, Garik Markarian and Andrew Staniforth

Deep Learning Applications of Short-Range Radars, Avik Santra, and Souvik Hazra Deep Learning for Radar and Communications Automatic Target Recognition, Uttam K. Majumder, Erik P. Blasch, and David A. Garren Design and Analysis of Modern Tracking Systems, Samuel Blackman and Robert Popoli Detectability of Spread-Spectrum Signals, George Dillard Detecting and Classifying Low Probability of Intercept Radar, Second Edition, Phillip E. Pace FMCW Radar Design, M. Jankiraman Fourier Transforms in Radar and Signal Processing, Second Edition, David Brandwood Fundamentals of Short-Range FM Radar, Igor V. Komarov and Sergey M. Smolskiy Handbook of Computer Simulation in Radio Engineering, Communications, and Radar, Sergey A. Leonov and Alexander I. Leonov Handbook of Radar Measurement, David K. Barton High-Resolution Radar, Second Edition, Donald R. Wehner High Resolution Radar Cross-Section Imaging, Dean L. Mensa Highly Integrated Low-Power Radars, Sergio Saponara, Maria Greco, Egidio Ragonese, Giuseppe Palmisano, and Bruno Neri Introduction to LabVIEW™ FPGA for RF, Radar, and Electronic Warfare Applications, Terry Stratoudakis An Introduction to Passive Radar, Second Edition, Hugh D. Griffiths and Christopher J. Baker Introduction to Radar using Python and MATLAB®, Lee Andrew Harrison Introduction to RF Equipment and System Design, Pekka Eskelinen Linear Systems and Signals: A Primer, JC Olivier Meter-Wave Synthetic Aperture Radar for Concealed Object Detection, Hans Hellsten

Methods and Techniques of Radar Recognition, Victor Nebabin The Micro-Doppler Effect in Radar, Second Edition, Victor C. Chen Microwave Radar: Imaging and Advanced Concepts, Roger J. Sullivan Millimeter-Wave Radar Clutter, Nicholas C. Currie, Robert D. Hayes, and Robert Trebits Millimeter-Wave Radar Targets and Clutter, Gennadiy P. Kulemin MIMO Radar: Theory and Application, Jamie Bergin and Joseph R. Guerci Modern Radar Systems, Second Edition, Hamish Meikle Modern Radar System Analysis, David K. Barton Modern Radar System Analysis Software and User's Manual, Version 3.0, David K. Barton Monopulse Principles and Techniques, Second Edition, Samuel M. Sherman and David K. Barton Motion and Gesture Sensing with Radar, Jian Wang and Jaime Lien MTI and Pulsed Doppler Radar with MATLAB®, Second Edition, D. Curtis Schleher Multifunction Array Radar, Dale Billetter Multiple-Target Tracking with Radar Applications, Samuel S. Blackman Multitarget-Multisensor Tracking, Applications and Advances, Volume II, Yaakov Bar-Shalom and Robert F. Popoli, editors Multitarget-Multisensor Tracking: Applications and Advances Volume III, Yaakov Bar-Shalom and William Dale Blair, editors Non-Line-of-Sight Radar, Brian C. Watson and Joseph R. Guerci Photonic Aspects of Modern Radar, Edward N. Toughlian and Henry Zmuda Practical Phased Array Antenna Systems, Eli Brookner Practical Simulation of Radar Antennas and Radomes, Herbert L. Hirsch

Precision FMCW Short-Range Radar for Industrial Applications, Boris A. Atayants, Viacheslav M. Davydochkin, Victor V. Ezerskiy, Valery S. Parshin, and Sergey M. Smolskiy Principles of High-Resolution Radar, August W. Rihaczek Principles of Radar and Sonar Signal Processing, François Le Chevalier Radar Cross-Section Analysis & Control, Asoke K. Bhattacharyya and D. L. Sengupta Radar Cross Section, Second Edition, Eugene F. Knott, et al. Radar Detection and Tracking Systems, Shaheen A. Hovanessian Radar Engineer’s Sourcebook, William C. Morchin Radar Equations for Modern Radar, David K. Barton Radar Evaluation Handbook, David K. Barton, et al. Radar for Fully Autonomous Driving, Matt Markel, Editor Radar Meteorology, Henri Sauvageot Radar Range Performance Analysis, Lamont V. Blake Radar Reflectivity of Land and Sea, Third Edition, Maurice W. Long Radar Resolution and Complex-Image Analysis, August W. Rihaczek and Stephen J. Hershkowitz Radar RF Circuit Design, Second Edition, Nickolas Kingsley and Joseph R. Guerci Radar Signal Processing and Adaptive Systems, Ramon Nitzberg Radar Signals: An Introduction to Theory and Application, Charles E. Cook Radar System Analysis, Design, and Simulation, Eyung W. Kang Radar System Analysis and Modeling, David K. Barton Radar System Design and Analysis, Shaheen A. Hovanessian Radar System Performance Modeling, Second Edition, G. Richard Curry Radar Technology, Eli Brookner Radar Technology Encyclopedia, David K. Barton and Sergey A. Leonov, editors

Radar Vulnerability to Jamming, Stephen Lothes Radars, Volume 1: Monopolse Radar, David K. Barton Radars, Volume 2: The Radar Equation, David K. Barton Radars, Volume 3: Pulse Compression, David K. Barton Radars, Volume 4: Radar Resolution & Multipath Effects, David K. Barton Radars, Volume 5: Radar Clutter, David K. Barton Radars, Volume 6: Frequency Agility and Diversity, David K. Barton Radars, Volume 7: CW and Doppler Radar, David K. Barton Radio Wave Propagation Fundamentals, Second Edition, Artem Saakian Range-Doppler Radar Imaging and Motion Compensation, Jae Sok Son, et al. Robotic Navigation and Mapping with Radar, Martin Adams, John Mullane, Ebi Jose, and Ba-Ngu Vo Russian-English and English-Russian Dictionary of Radar and Electronics, William F. Barton and Sergey A. Leonov Secondary Surveillance Radar, Michael C. Stevens Signal Detection and Estimation, Second Edition, Mourad Barkat Signal Processing in Noise Waveform Radar, Krzysztof Kulpa Signal Processing for Passive Bistatic Radar, Mateusz Malanowski Solid-State Radar Transmitters, Edward D. Ostroff Space-Based Radar Handbook, Leopold J. Cantafio, Richard K. Moore, and Earl E. Swartzlander Jr. Space-Time Adaptive Processing for Radar, Second Edition, Joseph R. Guerci Spaceborne Weather Radar, Toshiaki Kozu and Robert Meneghini Special Design Topics in Digital Wideband Receivers, James Tsui Statistical Signal Characterization, Herbert L. Hirsch Surface-Based Air Defense System Analysis, Robert H. MacFadzean Synthetic Array and Imaging Radars, Shaheen A. Hovanessian

Systems Engineering of Phased Arrays, Rick Sturdivant, Clifton Quan, and Enson Chang Techniques of Radar Reflectivity Measurement, Nicholas C. Currie Theory and Practice of Radar Target Identification, August W. Rihaczek and Stephen J. Hershkowitz Time-Frequency Signal Analysis with Applications, Ljubiša Stanković, Miloš Daković, and Thayananthan Thayaparan Time-Frequency Transforms for Radar Imaging and Signal Analysis, Victor C. Chen and Hao Ling Transmit Receive Modules for Radar and Communication Systems, Rick Sturdivant and Mike Harris

For further information on these and other Artech House titles, including previously considered out-of-print books now available through our In-Print-Forever® (IPF®) program, contact: Artech House

Artech House

685 Canton Street

16 Sussex Street

Norwood, MA 02062

London SW1V HRW UK

Phone: 781-769-9750

Phone: +44 (0)20 7596-8750

Fax: 781-769-6334

Fax: +44 (0)20 7630-0166

e-mail: [email protected]

e-mail: [email protected]

Find us on the World Wide Web at: www.artechhouse.com