Deep Biometrics 3030325822, 9783030325824, 9783030325831

This book highlights new advances in biometrics using deep learning toward deeper and wider background, deeming it “Deep

1,362 114 9MB

English Pages 322 Year 2020

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Voice Biometrics 9781785619007, 9781785619014

455 27 9MB Read more

Machine Learning and Biometrics 9781774691045

This comprehensive guide provides a detailed overview of modern biometrics, which allows a person to be identified and a

157 78 7MB Read more

Selfie Biometrics: Advances And Challenges 303026971X, 9783030269715, 3030269728, 9783030269722

This book highlights the field of selfie biometrics, providing a clear overview and presenting recent advances and chall

1,100 134 14MB Read more

Biometrics: A Very Short Introduction (Very Short Introductions) [1st ed.] 9780192536792, 0198809107, 9780198809104

We live in a society which is increasingly interconnected, in which communication between individuals is mostly mediated

377 102 1MB Read more

The Marketing Analytics Practitioner's Guide: Volume 2: Product, Advertising, Packaging, Biometrics, Price and Promotion 9811274487, 9789811274480

As the use of analytics becomes increasingly important in today's business landscape, The Marketing Analytics Pract

219 77 37MB Read more

Deep Night

644 60 298KB Read more

Deep writing

740 58 250KB Read more

Deep State

From bestselling author James Stewart, the definitive story of the war between President Trump and America's princi

730 98 771KB Read more

Physical Biometrics for Hardware Security of DSP and Machine Learning Coprocessors 183953821X, 9781839538216

Physical Biometrics for Hardware Security of DSP and Machine Learning Coprocessors presents state-of-the art explanation

226 14 24MB Read more

When Biometrics Fail: Gender, Race, and the Technology of Identity 9780822394822

This book examines the proliferation of surveillance technologies—such as facial recognition software and digital finger

167 35 1MB Read more

Deep Biometrics
3030325822, 9783030325824, 9783030325831

Author / Uploaded
Richard Jiang
Chang-Tsun Li
Danny Crookes
Weizhi Meng
Christophe Rosenberger

Categories
Computers
Security

Table of contents :
Preface......Page 6
Contents......Page 8
1 Introduction......Page 10
2.1 Datasets for the Age Estimation......Page 12
2.2 Evaluation Metrics for Age Estimation Models......Page 13
2.3.1 Customised Loss Functions for Age Estimation......Page 14
2.3.3 Age Estimation with Multi-Task Learning......Page 17
3 Age Synthesis......Page 19
3.2 Deep Learning Based Age Synthesis Methods......Page 21
3.3 Future Research on Age Synthesis......Page 22
4 Age-Invariant Face Recognition......Page 23
4.1 Deep Learning Based Age-Invariant Face Recognition Methods......Page 24
4.2 Future Research Trends on Age-Invariant Face Recognition......Page 25
References......Page 26
1 Introduction......Page 30
2.1 Datasets with Soft Biometrics Annotations......Page 31
2.3.1 Supervised Attribute Assisted Verification Based Person Re-ID......Page 33
2.3.2 Supervised Attribute Assisted Identification Based Person Re-ID......Page 35
3.1 Semi-supervised Cross-Dataset Attribute Learning via Co-training......Page 36
3.2.1 Attribute Consistency Adaptation Method......Page 38
3.2.2 MMD Based Feature Alignment Adaptation Method......Page 40
4 Performance Comparison......Page 42
5 Existing Issues and Future Trends......Page 43
References......Page 44
1 Introduction......Page 46
2.1 Traditional Methods......Page 48
2.2 Deep Learning Methods......Page 50
3.2 Stacked Hourglass with Intermediate Supervision......Page 51
3.4 Depth Network for 3D Landmarks......Page 53
4 Evaluation......Page 54
References......Page 57
1 Introduction......Page 59
2 Related Work......Page 60
3.1 Frame Extraction......Page 62
3.2 IQM Calculation......Page 64
3.3 Selected Image Quality Measures (IQMs)......Page 65
3.4.1 Datasets......Page 66
3.4.2 Evaluation Protocol......Page 67
3.4.3 Results on Replay Attack Database......Page 68
3.4.4 Results on Replay Mobile Database......Page 69
4.1 Proposed Scheme......Page 70
4.3 Training......Page 71
4.4.2 Results......Page 72
5 Discussion......Page 74
References......Page 75
1 Introduction......Page 78
2 Related Work......Page 79
2.1 Bag of Features......Page 80
2.2 Speeded-Up Robust Features......Page 81
2.3 Histogram of Oriented Gradients......Page 82
3.1 Bag of Features......Page 83
3.2 Histogram of Oriented Gradients Feature Descriptor......Page 85
4.1 Artificial Neural Networks......Page 86
4.2 Convolutional Neural Networks......Page 87
4.3 Pairwise Neural Network Structures......Page 88
5.1 Hardware and Software......Page 89
5.2 Benchmark Data Sets......Page 90
5.3 Computational Experiments......Page 91
6 Discussion......Page 97
7 Conclusions and Further Work......Page 100
References......Page 101
1 Introduction......Page 105
1.1 Motivation Factor......Page 108
2.1 Current Literature......Page 109
3.2 Face Recognition Components......Page 112
3.2.2 Deep Feature Extraction......Page 113
3.2.3 Face Matching......Page 114
4 Face Recognition Datasets......Page 115
5.1 Deep Learning Architectures......Page 117
5.2 Discriminative Loss Functions......Page 131
5.2.1 Euclidean Distance Loss......Page 132
5.2.2 Cosine Margin Loss......Page 133
5.3 Face Matching Through Deep Features......Page 134
5.4 Face Processing Using Deep Features......Page 135
6 Experimental Results......Page 136
6.1 Evaluation Rules......Page 137
6.2 Comparison of Experimental Results of Existing Facial Models......Page 138
7 Further Discussions......Page 139
7.1 Other Recognition Issues......Page 140
7.2 Open Research Questions......Page 142
8 Conclusion......Page 143
References......Page 144
1 Introduction......Page 147
2 Related Work......Page 149
3 Cost-Based Intelligent Mechanism......Page 151
4 Touch-Dynamics-Based Authentication Scheme......Page 153
5.1 Cloud-Based Scheme......Page 154
5.3 Session Identification......Page 155
6.1 Study Methodology......Page 156
6.3 Evaluation Results......Page 157
7 Discussion......Page 160
References......Page 161
1 Introduction......Page 166
2 Related Work......Page 168
2.1 Ear Recognition in Constrained Conditions......Page 169
2.2 Ear Recognition in Unconstrained Conditions......Page 170
3 COM-Ear: A Deep Constellation Model for Ear Recognition......Page 171
3.2 Overview of COM-Ear......Page 172
3.2.2 The Local Processing Path......Page 173
3.2.3 Combining Information......Page 174
3.4 Implementation Details......Page 175
4.1 Experimental Datasets......Page 176
4.2 Performance Metrics......Page 177
4.3 Training Details......Page 178
4.4 Ablation Study......Page 179
4.5.1 Comparison with Competing Methods......Page 180
4.5.2 Comparison with Results from the 2017 Unconstrained Ear Recognition Challenge (UERC)......Page 181
4.6 Robustness to Occlusions......Page 184
4.7 Qualitative Evaluation......Page 186
References......Page 191
1 Introduction......Page 196
2.1 Attribute Recognition for Person Retrieval Using Handcrafted Features......Page 199
2.2 Deep Convolutional Neural Network (DCNN)......Page 200
2.3 Attribute Recognition for Person Retrieval Using Deep Features......Page 201
3 Person Retrieval System Using Deep Soft Biometrics......Page 204
3.1 Height Estimation Using Camera Calibration......Page 205
3.2 Torso Color Detection......Page 207
3.4.1 AlexNet Training for Color Classification......Page 208
4.1 Dataset Overview......Page 209
4.2 Evaluation Metric......Page 211
4.3 Qualitative and Quantitative Results......Page 213
4.3.1 True Positive Cases of Person Retrieval......Page 214
4.3.2 Challenges in Person Retrieval......Page 215
5 Conclusion and Future Work......Page 217
References......Page 218
1 Introduction......Page 220
2 A Spectral Biometric System: Design Perspective......Page 223
3 Spectral Bands......Page 225
4 Biometric Trait: Face......Page 226
4.1 Databases for Face......Page 227
4.1.2 The Hong Kong Polytechnic University Hyperspectral Face Database (HK PolyU-HSFD)......Page 228
4.1.5 CASIA HFB Database......Page 229
4.2.1 Face Recognition Using Convolutional Neural Networks (CNNs)......Page 230
4.2.2 NIRFaceNet......Page 231
5 Biometric Trait: Iris......Page 232
5.1.1 The IIT Delhi Iris Database......Page 233
5.1.4 University of Notre Dame Dataset......Page 234
5.1.6 Cross Sensor Iris and Periocular Database......Page 236
5.2.2 Deep Learning on IMP Database......Page 237
6 Biometric Trait: Palmprint......Page 238
6.1.1 PolyU Multispectral Palmprint Database......Page 239
6.2.1 PCANet Deep Learning for Palmprint Verification......Page 240
7 Conclusion and Future Work......Page 242
References......Page 243
1 Introduction......Page 249
2 Motivation of Our Work......Page 250
3.2 Blockchain Technology in Vehicle......Page 251
4 BBC-Based Credit Environment for IV Data Sharing......Page 253
4.2 Network Enabled Connected Devices......Page 254
4.3 BBC-Supported Intelligent Vehicles......Page 255
5 Privacy Issues in Biometric Blockchain......Page 257
6 Conclusion......Page 258
References......Page 259
1 Introduction......Page 261
2 Deep Learning......Page 263
3 Optical Flow......Page 264
3.1 Traditional Methods......Page 265
3.1.2 Patch Based Methods......Page 266
3.1.3 Patch Based with Variational Refinement......Page 267
3.2.1 Development of DL Based Optical Flow Estimation......Page 269
3.2.2 Other Important DL Based Methods......Page 274
3.4 Optical Flow Datasets......Page 277
3.6 Hybrid Methods......Page 279
3.6.1 Feature Based......Page 280
3.6.2 Domain Understanding......Page 281
3.7 Multi-Frame Methods......Page 282
4.1 Flow Estimation Benchmarks and Performance Assessment......Page 283
4.3 Biometrics Applications of Optical Flow......Page 284
References......Page 285
1 Introduction......Page 292
2.2 Iris PAD......Page 295
3 Data-Driven Methods for Presentation Attack Detection......Page 298
3.1 Face PAD......Page 300
3.2 Iris PAD......Page 301
3.3 Fingerprint PAD......Page 303
4.1 Architecture and Filter Optimization......Page 305
5 Challenges, Open Questions, and Outlook......Page 307
Appendix 1: Datasets and Research Work......Page 310
References......Page 311
Index......Page 315

Citation preview

Unsupervised and Semi-Supervised Learning Series Editor: M. Emre Celebi

Richard Jiang · Chang-Tsun Li Danny Crookes · Weizhi Meng Christophe Rosenberger Editors

Deep Biometrics

Unsupervised and Semi-Supervised Learning Series Editor M. Emre Celebi, Computer Science Department, Conway, AR, USA

Springer’s Unsupervised and Semi-Supervised Learning book series covers the latest theoretical and practical developments in unsupervised and semi-supervised learning. Titles – including monographs, contributed works, professional books, and textbooks – tackle various issues surrounding the proliferation of massive amounts of unlabeled data in many application domains and how unsupervised learning algorithms can automatically discover interesting and useful patterns in such data. The books discuss how these algorithms have found numerous applications including pattern recognition, market basket analysis, web mining, social network analysis, information retrieval, recommender systems, market research, intrusion detection, and fraud detection. Books also discuss semi-supervised algorithms, which can make use of both labeled and unlabeled data and can be useful in application domains where unlabeled data is abundant, yet it is possible to obtain a small amount of labeled data. Topics of interest in include: – – – – – – – – – –

Unsupervised/Semi-Supervised Discretization Unsupervised/Semi-Supervised Feature Extraction Unsupervised/Semi-Supervised Feature Selection Association Rule Learning Semi-Supervised Classification Semi-Supervised Regression Unsupervised/Semi-Supervised Clustering Unsupervised/Semi-Supervised Anomaly/Novelty/Outlier Detection Evaluation of Unsupervised/Semi-Supervised Learning Algorithms Applications of Unsupervised/Semi-Supervised Learning

While the series focuses on unsupervised and semi-supervised learning, outstanding contributions in the field of supervised learning will also be considered. The intended audience includes students, researchers, and practitioners.

More information about this series at http://www.springer.com/series/15892

Richard Jiang • Chang-Tsun Li • Danny Crookes Weizhi Meng • Christophe Rosenberger Editors

Deep Biometrics

Editors Richard Jiang School of Computing and Communication InfoLab21, Lancaster University Lancaster, UK Danny Crookes Electrical Engineering and Computer Science School of Electronics, Queen’s University Belfast, Belfast, UK

Chang-Tsun Li School of Information Technology Deakin University Waurn Ponds, VIC, Australia Weizhi Meng Technical University of Denmark Kgs. Lyngby, Denmark

Christophe Rosenberger GREYC research Lab ENSICAEN, Caen, France

ISSN 2522-848X ISSN 2522-8498 (electronic) Unsupervised and Semi-Supervised Learning ISBN 978-3-030-32582-4 ISBN 978-3-030-32583-1 (eBook) https://doi.org/10.1007/978-3-030-32583-1 © Springer Nature Switzerland AG 2020 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG. The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

Biometrics, accompanied by the rapid rise of machine learning and computing, has become one of the most widely used new technologies in today’s digital economy. Interestingly, this modern technology can be dated back to nearly 4000 years ago when the Babylonian Empire legislated the use of fingerprints to protect a legal contract against forgery and falsification by having the fingerprints impressed into the clay tablet on which the contract had been written. Nowadays, biometrics can be seen in a wide range of successful applications in biometric banking, internet of things, cloud computing, cybersecurity, medical biometrics, healthcare biometrics, and others. It has been estimated that the global biometric technology market size is likely to reach USD 59.31 billion by 2025. Following the boom in biometric technology, new challenges have appeared. In particular, with the increasing reliance on the Internet and mobile applications in smart cities, the volume of biometric data is increasing exponentially, which demands new techniques to tackle the challenges of biometric big data. This rapid development has become a great motivator for much wider and deeper research into biometrics. The terminology “Deep Biometrics” coined in this book refers to three aspects of the current research trend. First, with the great success of Deep Learning, a range of new techniques are now available for biometrics, with more robust performance in the task of biometric verification. In particular, deep neural networks (DNNs) have a capability to learn from large datasets, and it is likely that more training data will make them more accurate. This special attribute is a blessing for biometric big data, enabling biometric technology to benefit from the increase in data instead of suffering from it. In our book, we report new research progress in this direction, in chapters “Using Age Information as a Soft Biometric Trait for Face Image Analysis”, “Person Re-identification with Soft Biometrics Through Deep Learning”, “Atypical Facial Landmark Localisation with Stacked Hourglass Networks: A Study on 3D Facial Modelling for Medical Diagnosis”, “Anti-spoofing in Face Recognition: Deep Learning and Image Quality Assessment Based-Approaches”, “Deep Learning for Biometric Face Recognition: Experimental Study on Benchmark Data Sets”, “Deep Learning Models for Face Recognition: A Comparative Analysis”, “Constellationv

vi

Preface

Based Deep Ear Recognition”, “Person Retrieval in Surveillance Videos Using Deep Soft Biometrics”, and “Deep Spectral Biometrics: Overview and Open Issues”. Second, while biometrics has been widely applied to identify a specific human or even animal, some fundamental questions are not yet clearly answered. For example, why does each human or animal have a unique biometric appearance? Is this genetically defined from birth or is it formed during lifetime? If it is genetically defined, can any other body parts other than faces and fingerprints play the role of biometrics? In this book, we try to include a wider investigation into these deeper scientific enquiries. In chapter “Atypical Facial Landmark Localisation with Stacked Hourglass Networks: A Study on 3D Facial Modelling for Medical Diagnosis”, a deep model is developed to describe the possible underlying links with medical or genetic causes; in chapter “Constellation-Based Deep Ear Recognition”, the human ear is investigated as a new type of biometric; and in chapter “Developing CloudBased Intelligent Touch Behavioral Authentication on Mobile Phones”, touch behaviour is exploited as a new biometric method for mobile users. Third, with the successful development of biometric technology, many new applications have been successfully exploited. Chapter “Biometric Blockchain: A Secure Solution for Intelligent Vehicle Data Sharing” exploits a new type of blockchain, namely biometric blockchain, to enable the previously anonymous protocol with human identities. Chapter “Optical Flow Estimation with Deep Learning, a Survey on Recent Advances” attempts to combine optical flow and deep learning into this topic. Chapter “Developing Cloud-Based Intelligent Touch Behavioral Authentication on Mobile Phones” investigates a cloud-based mobile biometric application. Chapter “The Rise of Data-Driven Models in Presentation Attack Detection” reports a biometric method for protecting data from attacks over cloud platforms. The target audience for this book includes graduate students, engineers, researchers, scholars, forensic scientists, police forces, criminal solicitors, IT practitioners and developers who are interested in security and privacy-related issues on biometrics. The editors would like to express their sincere gratitude to all distinguished contributors who have made this book possible and the group of reviewers who have offered insightful comments to improve the quality of each chapter. A dedicated team at Springer Publishing has offered professional assistance to the editors from inception to final production of the book. We thank them for their painstaking efforts at all stages of production. We also thank our readers for picking up this book and sharing in the exciting research advances in biometrics. Bailrigg, Lancaster, UK

Richard Jiang

Contents

Using Age Information as a Soft Biometric Trait for Face Image Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Haoyi Wang, Victor Sanchez, Wanli Ouyang, and Chang-Tsun Li

1

Person Re-identification with Soft Biometrics Through Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shan Lin and Chang-Tsun Li

21

Atypical Facial Landmark Localisation with Stacked Hourglass Networks: A Study on 3D Facial Modelling for Medical Diagnosis . . . . . . . . Gary Storey, Ahmed Bouridane, Richard Jiang, and Chang-Tsun Li

37

Anti-Spoofing in Face Recognition: Deep Learning and Image Quality Assessment-Based Approaches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wael Elloumi, Aladine Chetouani, Tarek Ben Charrada, and Emna Fourati

51

Deep Learning for Biometric Face Recognition: Experimental Study on Benchmark Data Sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Natalya Selitskaya, S. Sielicki, L. Jakaite, V. Schetinin, F. Evans, M. Conrad, and P. Sant Deep Learning Models for Face Recognition: A Comparative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Arindam Chaudhuri

71

99

Developing Cloud-Based Intelligent Touch Behavioral Authentication on Mobile Phones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 Zhi Lin, Weizhi Meng, Wenjuan Li, and Duncan S. Wong Constellation-Based Deep Ear Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 Dejan Štepec, Žiga Emeršiˇc, Peter Peer, and Vitomir Štruc Person Retrieval in Surveillance Videos Using Deep Soft Biometrics . . . . . . 191 Hiren J. Galiyawala, Mehul S. Raval, and Anand Laddha

vii

viii

Contents

Deep Spectral Biometrics: Overview and Open Issues . . . . . . . . . . . . . . . . . . . . . . 215 Rumaisah Munir and Rizwan Ahmed Khan Biometric Blockchain: A Secure Solution for Intelligent Vehicle Data Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 Bing Xu, Tobechukwu Agbele, and Richard Jiang Optical Flow Estimation with Deep Learning, a Survey on Recent Advances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 Stefano Savian, Mehdi Elahi, and Tammam Tillo The Rise of Data-Driven Models in Presentation Attack Detection . . . . . . . . 289 Luis A. M. Pereira, Allan Pinto, Fernanda A. Andaló, Alexandre M. Ferreira, Bahram Lavi, Aurea Soriano-Vargas, Marcos V. M. Cirne, and Anderson Rocha Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313

Using Age Information as a Soft Biometric Trait for Face Image Analysis Haoyi Wang, Victor Sanchez, Wanli Ouyang, and Chang-Tsun Li

1 Introduction Biometrics aims to determine the identity of an individual by leveraging the users’ physiological or behavioural attributes [23]. Physiological attributes refer to the physical characteristics of the human body, like the face, iris, fingerprint, etc. On the other hand, behavioural attributes indicate the particular patterns of the behaviour of a person, which include gait, voice, keystroke dynamics, etc. Among all these biometrics attributes, the face is the most commonly used one due to its accessibility and the fact that face-based biometric systems require little cooperation from the subject. Besides the identity information, other ancillary information like age, race and gender (often referred to as soft biometrics) can also be retrieved from the face. Soft biometrics is the set of traits that provide some information to describe individuals, but do not have the capability to discriminate identities due to their lack of distinctiveness and permanence [22]. Although soft biometric traits alone cannot distinguish among individuals, they can be used in conjunction with the identity information to boost the recognition or verification performance or be leveraged in

H. Wang () · V. Sanchez University of Warwick, Coventry, UK e-mail: [email protected]; [email protected] W. Ouyang University of Sydney, Sydney, NSW, Australia e-mail: [email protected] C.-T. Li School of Information Technology, Deakin University, Waurn Ponds, VIC, Australia e-mail: [email protected]; [email protected] © Springer Nature Switzerland AG 2020 R. Jiang et al. (eds.), Deep Biometrics, Unsupervised and Semi-Supervised Learning, https://doi.org/10.1007/978-3-030-32583-1_1

1

2

H. Wang et al.

other scenarios. For example, locating persons-of-interest based on a combination of soft biometric traits by using surveillance footage. Compared to traditional biometrics, soft biometrics has the following merits. First, when the identity information is not available, soft biometrics can generate human-understandable descriptions to track the person-of-interest, such as in the 2013 Boston bombings [24]. Second, as the data abuse issue becomes more and more severe in the information age, using soft biometric traits to capture subjects’ ancillary information can preserve their identity while achieving the expected goals. For example, companies can efficiently recommend merchandises by merely knowing the age or the gender of their potential customers. Third, collecting soft biometric traits do not require the participation of the subject, which makes them easy to compute. Among all the soft biometric traits (age, gender, race, etc.) that can be obtained from face images, in this chapter, we focus on the age as it attracts the most attention from the research community, and can be used in various real-life applications. Specifically, the age-related face image analysis encompasses three areas: estimating the age (age estimation), synthesising younger or elder faces (age synthesis), and identifying or verifying a person across a time span (age-invariant face recognition). As to their real-life applications, the age estimation models can be widely embedded into the security control and surveillance monitoring applications. For example, such systems can run age estimation algorithms to prevent teenagers from purchasing alcohol and tobacco from vending machines or access adult-exclusive content on the Internet. The age synthesis models can be used, for example, to predict the outcome of cosmetic surgeries and generate special visual effects on characters of video games and films [12]. The age-invariant face recognition models can be used to efficiently track persons-of-interest like suspects or missing children over a long time span. Although the age-oriented face image analysis models can be used in a variety of applications, due to the underlying conditions of the individuals, such as their upbringing environment and genes, there are still several issues that remain unsolved. We will discuss these issues in the next section. After Krizhevsky et al. [25] demonstrated the robustness of the deep convolutional neural network (CNN) [26, 27] on the ImageNet dataset [10], CNN-based models have been widely deployed in computer vision and biometrics tasks. Some well-known CNN architectures are AlexNet [25], VGGNet [47], ResNet [17] and DenseNet [21]. In this chapter, we only focus on the CNN-based models for agerelated face image analysis and discuss their novelties and limitations. To provide a clear layout, we present the three areas of age-related face image analysis in individual sections. For each area, we first introduce its basic concepts, the available datasets and the evaluation methods. Then, we present a comprehensive review of recently published deep learning based methods. Finally, we discuss the future research trends by discussing the unaddressed issues in the existing deep learning based methods.

Using Age Information as a Soft Biometric Trait for Face Image Analysis

3

2 Age Estimation As the name suggested, the purpose of age estimation is to estimate the real age (cumulated years after birth) of the individual. The predicted age is mainly deduced based on the age-specific features extracted by the feature extractor. Since CNNs are powerful tools for extracting features, state-of-the-art age estimation methods are CNN-based. A simple block diagram of a deep learning based age estimation model can be found in Fig. 1. The first step in a deep learning based age estimation model is the face detection and alignment as the input image can contain other objects other than the face and a large amount of background. This step can be achieved by either a traditional computer vision algorithm like the histogram of oriented gradients (HOG) filter or a state-of-the-art face preprocessing model like a deep cascaded multi-task framework [57]. After the face is cropped from the original image, and normalised (the mean value is subtracted), it is fed into the CNN backbone to estimate the age. In order to attain a good performance, the CNN is often designed to employ one or more loss functions to optimise its parameters. We will see later in this section that the recent age estimation models either involve advanced loss functions or change the network architecture to improve performance.

2.1 Datasets for the Age Estimation Among all the age-oriented datasets, the MORPH II dataset [44] is the most broadly used to evaluate age estimation models. This dataset contains more than 55,000 face images from about 13,000 subjects with ages ranging from 16 to 77 with an average age of 33. Each image in the MORPH II dataset is associated with identity, age, race and gender labels. The second most commonly used dataset to evaluate age estimation models is the FG-NET [9] dataset which contains 1002 images from

Fig. 1 A simplified diagram of a deep learning based age estimation model. Since we are only interested in the face region, the face should be located and aligned from the original image before fed into the CNN model. Illustration by Tian Tian

4

H. Wang et al.

Table 1 Most commonly used datasets to evaluate age estimation models Dataset MORPH II FG-NET CACD IMDB-WIKI

# Images 55,134 1002 163,446 523,051

# Subjects 13,618 82 2000 20,284

Age range 16–77 0–69 16–62 0–100

Noise-free label Yes Yes No No

Mugshot Yes No No No

82 subjects. However, due to the limited number of images, the FG-NET dataset is usually only used during the evaluation phase. Since the training of CNN-based models requires a large number of training samples, to meet this requirement, two large-scale age-oriented datasets have been built, the Cross-Age Celebrity Dataset (CACD) [7] and the IMDB-WIKI dataset [45]. The CACD contains more than 160,000 face images from 2000 individuals with ages ranging from 16 to 62. The IMDB-WIKI dataset contains 523,051 face images (460,723 images from IMDB and 62,328 images from Wikipedia) from 20,284 celebrities. However, both datasets contain noisy (incorrect) labels. The details of these four datasets are tabulated in Table 1.

2.2 Evaluation Metrics for Age Estimation Models There are two evaluation metrics commonly used for age estimation models. The first one is the mean absolute error (MAE), which measures the average absolute difference between the predicted age and the ground truth: M MAE =

i=1 ei

M

,

(1)

where ei is the absolute error between the predicted age lî and the input age label li for the i-th sample. The denominator M is the total number of testing samples. The other evaluation metric is the cumulative score (CS), which measures the percentage of images that are correctly classified in a certain range: CS(n) = −

Mn × 100%, M

(2)

where Mn is the number of images whose predicted age lî is in the range of [li − n, li + n], and n indicates the number of years.

Using Age Information as a Soft Biometric Trait for Face Image Analysis

5

2.3 Deep Learning Based Age Estimation Methods Due to the appearance differences among different images of the same individual, extracting age-specific features and predicting the precise age can be onerous. Due to the extraordinary capability of CNN for feature extraction, Wang et al. [49] first employ a CNN to tackle the age estimation problem. In [49], the authors design a two-layer CNN to extract the age-specific features and use manifold learning algorithms (support vector regression (SVR) and support vector machines (SVMs)) to compute the final output. Their results show a dramatic improvement on the MORPH II dataset compared to the methods that use traditional machine learning [6, 13, 56]. As aforementioned, recent deep learning based attempts for age estimation can be classified into two categories. The first category is about improving the accuracy by leveraging customised loss functions rather than using conventional classification loss functions, such as the cross-entropy loss. The second category boosts the estimation performance by modifying the network architecture of a plain CNN model. We first review the recent age estimation works based on these two categories. Then, we discuss some works that involve multi-task learning frameworks to learn age information along with other tasks.

2.3.1

Customised Loss Functions for Age Estimation

Traditionally, the age estimation problem can be treated as a multi-class classification problem [39] or a regression problem [37]. Rothe et al. [45] propose a formulation that combines regression and classification for this particular task. Since age estimation usually involves a large number of classes (approximately 50–100) and based on the fact that the discretisation error becomes smaller for the regressed signal when the number of classes becomes larger, they compute the final output value by using the following equation: E(O) =

n

pi yi ,

(3)

i=1

where O is the output from the final layer of the network after a softmax function, yi is the discrete year representing the i-th class and n indicates the number of classes. Evaluation results demonstrate that this method outperforms both conventional regression and classification in the ChaLearn LAP 2015 apparent age estimation challenge [11] and other benchmarks. Recent solutions for age estimation have shown that there is an ordinal relationship among ages and leveraged this relationship to design customised loss functions. The ordinal relation indicates that the age of an individual increase as time elapses since ageing is a non-stationary process. Specifically, in [31], the authors construct a label ordinal graph based on a set of quadruplets from training batches and use

6

H. Wang et al.

a hinge loss to force the topology of this graph to remain constant in the feature space. On the other hand, Niu et al. [37] treat the age estimation problem as an ordinal regression problem [29]. The ordinal regression is a type of classification method which transforms the conventional classification into a series of simpler binary classification subproblems. In [37], each binary classification subproblem is used to determine whether the estimated age is younger or elder than a specific age. To this end, the authors replace the final output layer with n binary classifiers, where n equals the number of classes. Let us assume that there are N samples {xi , yi }N i=1 , where xi is the i-th input image and yi is the corresponding age label, and T binary classifiers (tasks). The loss function to optimise the multi-output CNN can then be formulated as: Em = −

N T 1 t t λ 1{oi = yit }wit log(p(oit |xi , W t )), N

(4)

i=1 t=1

where oit indicates the output of the t-th binary linear layer, yit indicates the label for the t-th task of the i-th input and wit indicates the weight of the i-th image for the t-th task. Moreover, W t is the weight parameter for the t-th task, and λt is the importance coefficient of the t-th task. Chen et al. [8] take a step further by training separate networks for each age group so that each network can learn specific features for the target age group rather than sharing the common features as in [37]. Experiments show that this separate training strategy leads to a significant performance gain on the MORPH II dataset under both evaluation metrics. Li et al. [30] also consider the ordinal relation among ages in their work. However, instead of applying the age estimation model on the entire dataset, they take the different ageing pattern of different races and genders into consideration and leverage the domain adaptation methodology to tackle the problem. As stated in their paper, it is difficult to collect and label sufficient images of every population (one particular race or gender) to train the network. Therefore, an age estimation model that is trained on the population with an insufficient number of images would have lower accuracy than models trained on other populations. In their work, they first train an age estimation model under the ranking based formulation on the source population (the population with sufficient images). Then, they fine-tune the pre-trained model on the target population (the population with a limited number of images) by adopting a pairwise loss function to align the age-specific features of the two populations. The loss function used for feature alignment is: s

t

N N {1 − lij (η − d(xîs , xˆjt )) · ω(yis , yjt )},

(5)

i=1 j =1

where xîs and xˆjt are the high-level features extracted from the network, yis and yjt are the labels of the images from the source and target populations, respectively. d(·) is the Euclidean distance. η and ω(·) are a predefined threshold value and a

Using Age Information as a Soft Biometric Trait for Face Image Analysis

7

weighting function, respectively. lij is set to 1 if yis = yjt or -1 otherwise. The basic idea behind this function is that when the two images have the same age label, the model tries to minimise: d(xîs , xˆjt )) − 1,

(6)

which reduces the Euclidean distance between two features. When the two images have different labels, i.e. yis = yjt , the model tries to minimise: 3 − d(xîs , xˆjt )), ω(yis , yjt )

(7)

where ω(yis , yjt ) is a number smaller than one. This pushes the two features away from each other with a large distance value. In addition, the distance value is proportional to the age difference between the two images. Another research trend based on customised loss functions is to involve joint loss functions to optimise the age estimation model. Current works that involve joint loss functions include [20] and [40]. Hu et al. [20] study the problem where the labelled data are not sufficient. In that work, the authors use the Gaussian distributions as the labels rather than specific numbers, which allows the model to learn the similarity between adjacent ages. Since the labels are distributions, they use the Kullback–Leibler (KL) divergence to minimise the dissimilarity between the output probability and the label. The KL divergence can be formulated as: DKL (P Q) = Ex∼P [log(P ) − log(Q)],

(8)

where P and Q are two distributions. Besides the KL divergence, their model also involves an entropy loss and a cross-entropy loss. The entropy loss is used to make sure the output probability only has one peak since an image can only be associated with one specific age. The cross-entropy loss is used to consider the age difference between images for the non-labelled datasets. Moreover, for the non-labelled datasets, their model accepts two images as input simultaneously. For example, for two images a and b, where a is K years younger than b, then the age of a should not be larger than K. For the image a, the authors split the output layer into two parts, the first part is the neurons with the indices 0 to K, and the second part is the neurons with the indices K to M, where M is the total number of classes. Based on the aforementioned assumption, the sum of the values in the second part should be 0 while the sum of the values in the first part should be a positive number. The authors treat this problem as a binary classification problem and use the crossentropy loss to minimise the probability error. Pan et al. [40] also use the Gaussian distribution to represent the age label. In addition, it proposes a mean-variance loss to penalise the mean and variance value of the predicted age distribution. The mean-variance loss is used alongside the classification loss to optimise the model, which currently achieves the best performance on the MORPH II dataset and the FG-NET dataset under the MAE metric.

8

H. Wang et al.

Other worth noting works that also use customised loss function are [33] and [18]. Liu et al. [33] consider both the ordinal relation among ages and the age distribution and involve the metric learning method to cluster the age-specific features in the feature domain. On the other hand, He et al. [18] adopt the triplet loss [46] from the conventional face recognition task and uses it for age estimation.

2.3.2

Modifying the Network Architecture for Age Estimation

Instead of using plain CNN models (a stack of convolutional layers), some works modify the network architecture to design efficient age estimation models, which is another trending research topic to boost the estimation performance. Yi et al. [55] design a multi-column CNN for age estimation. They take the facial attributes (the eyes, nose, mouth, etc.) into consideration and train several sub-networks for each attribute. All the features extracted from different attributes are then fused before the final layer. Yi et al. [55] is also one of the earliest works that uses a CNN for age estimation. Recently, Wang et al. [50], inspired by advances in Neuroscience [5], have designed the fusion network for age estimation. Neuroscientist have discovered that when the primate brain is processing the facial information, different neurons respond to different facial features [5]. Based on this discovery, the authors intuitively assume that the accuracy of the age estimation problem may be largely improved if the CNN learns from age-specific patches. Specifically, their model takes the face and several age-specific facial patches as successive inputs. The aligned face, which provides most of the information, is the primary input that is fed into the lowest layer to have the longest learning path. The selected agespecific patches are subsequently fed into the CNN, in a sequential manner. The patch selection is based on the AdaBoost algorithm. Moreover, the input feeding scheme at the middle-level layers can be viewed as shortcut connections that boost the flow of the age-specific features. The architecture of their proposed model can be found in Fig. 2. Taheri and Toygar [48] also fuse the information during the learning process. They design a fusion framework to fuse the low-level features, the middle-level features and the high-level features from a CNN to estimate the age.

2.3.3

Age Estimation with Multi-Task Learning

Another challenging research area is multi-task learning, which combines age estimation with other facial attribute classification problems or with face recognition. Multi-task learning is a learning scheme that can learn several tasks simultaneously, which allows the network to learn the correlation among all the tasks and saves training time and computational resources.

Using Age Information as a Soft Biometric Trait for Face Image Analysis

9

Fig. 2 The architecture of the fusion network in [50]. The selected patches (P1 to P5) are fed to the network sequentially as the secondary learning source. The input of patches can be viewed as shortcut connections that enhance the learning of age-specific features

Levi and Hassner [28] first design a three-layer CNN to classify both the age and the race. Recently, Hsieh et al. [19] design a CNN with ten layers for age estimation, gender classification and face recognition. Results show that this joint learning scheme can boost the performance of all three tasks. Similarly, Ranjan et al. [43] propose an all-in-one face analyser which can detect and align faces, detect smiles, and classify age, gender and identity simultaneously. They use a pre-trained network for face recognition and fine-tune it using the target datasets. Authors argue that the network pre-trained for the face recognition task can capture the fine-grained details of the face better than a randomly initialised one. Each subnetwork used for each task is then branched out from the main path based on the level of features on which they depend. Experimental results demonstrate a robust performance on all the tasks. Lately, Han et al. [16] also involve age estimation in a multi-task learning scheme for the face attribute classification problem. Different from the aforementioned works, they group attributes based on their characteristics. For example, since the age is an ordinal attribute, it is grouped with other ordinal attributes like the hair length. Rather than sharing the high-level features among all the attributes, each group of attributes has independent high-level features. Results of the aforementioned methods on the MORPH II dataset are tabulated in Table 2. The results are only reported based on the MAE metric since some of the works do not involve the CS metric. Note that although some works have reported better results by using a pre-trained network, for a fair comparison, we do not include those in the table.

10 Table 2 State-of-the-art age estimation results on the MORPH II dataset

H. Wang et al. Method Yi et al. [55] Niu et al. [37] Rothe et al. [45] Liu et al. [31] Han et al. [16] Chen et al. [8] Liu et al. [33] Taheri and Toygar [48] Wang et al. [50] Li et al. [30] Hu et al. [20] He et al. [18] Pan et al. [40]

Result 3.63 3.27 3.25 3.12 3.00 2.96 2.89 2.87 2.82 2.80 2.78 2.71 2.51

The results are based on the MAE metric (the lower, the better)

2.4 Future Research Trends on Age Estimation Although deep learning based age estimators have achieved much better results than models that use traditional machine learning methods, there are still some issues that have not been addressed yet. First, existing age-oriented datasets like the MORPH II dataset and the FG-NET dataset involve other variations like pose, illumination, expression (PIE) and occlusion. With these unexpected factors, extracting age-specific features is onerous. Alnajar et al. [1] show that the expression can downgrade the performance of the age estimation models, and proposes a graphical model to tackle the expression-invariant age estimation problem. Such disentangled age estimation problem has not been studied by using a CNN yet, which could be a possible future research trend. Another possible topic is to build large-scale noise-free datasets. Recent datasets for face recognition have several millions of training samples [4, 15]. However, the largest noise-free dataset for age estimation (the MORPH II dataset) has only 40,000–50,000 images for training based on different data partition strategies. Therefore, a larger noise-free dataset is needed to help to boost the age estimation performance further.

3 Age Synthesis Compared to age estimation, age synthesis has not gained much attention from the research community yet. Age synthesis methods aim to generate elder or younger faces by rendering facial images with natural ageing or rejuvenating effects. The

Using Age Information as a Soft Biometric Trait for Face Image Analysis

11

Fig. 3 A simplified block diagram of an age synthesis model. An age synthesis model usually comprises two processes: the ageing process and the rejuvenating process. Illustration by Tian Tian

synthesis is usually conducted between age categories (e.g. the 20s, 30s, 40s) rather than specific ages (e.g. 22, 25, 29) since there is no noticeable visual change of a face over a several-year span. A simplified block diagram of an age synthesis model can be found in Fig. 3. In Fig. 3, the generative model is usually an adversarial autoencoder (AAE) [34] or a generative adversarial network (GAN) [14] in deep learning based methods. The original GAN, which is introduced by Goodfellow et al., is capable of generating realistic images by using a minimax game. There are two components in the original GAN: a generator used to generate expected outputs and a discriminator used to discriminate the real images from the fake (generated) ones. The loss function used in the original GAN is: V (D, G) = min max Ex∼Pdata(x) log[D(x)] + Ez∼Pz log[1 − D(G(x))], G

D

(9)

where D and G, respectively, denote the discriminator and generator learning functions; and x and z, respectively, denote the real data and the input noise. In this model, the discriminator usually converges faster than the generator due to the saturation problem in the log loss. Several variations have been introduced to tackle this problem, including the Wasserstein GAN (WGAN) [3], the f-GAN [38] and the Least Squares GAN (LSGAN) [35]. Since the age synthesis models also require age information for the training phase, they can also rely on the datasets mentioned in Sect. 2.1 for training and evaluation. The most broadly used datasets to evaluate age synthesis models are the MORPH II dataset, the CACD and the FG-NET dataset. Typically, the MORPH II dataset and the CACD are used for both training and evaluation, and the FG-NET dataset is only involved in the evaluation phase due to its limited number of samples.

12

H. Wang et al.

3.1 Evaluation Methods for Age Synthesis Models Although age synthesis methods have attracted important attention from the research community, several challenges make the synthesis process hard to achieve. First, age synthesis benchmark datasets like the CACD involve other variations like the PIE and occlusion. With these unexpected factors, extracting age-specific features is onerous. Second, existing datasets do not have enough images covering a wide age range for each subject. For example, the MORPH II dataset only captures a time span of 164 days, on average, which may make the learning of long-term personalised ageing and rejuvenating features an unsupervised task. Third, the underlying conditions of the individuals, such as their upbringing environment and genes, make the whole synthesis process a difficult prediction task. Based on these aforementioned challenges, researchers have established two criteria to measure the quality of synthesised faces. One is the synthesis accuracy, under which synthesised faces are fed into an age classification model to test whether the faces have been transformed into the target age category. Another criterion is the identity permanence, which relies on face verification algorithms to test whether the synthesised face and the original face belong to the same person [54].

3.2 Deep Learning Based Age Synthesis Methods With the increasing popularity of deep learning, several age synthesis models have been proposed using various network architectures. Antipov et al. [2] first leverage a conditional GAN [36] to synthesise elderly faces. In their work, the authors first pre-train an autoencoder-shaped generator to reconstruct the original input. During the pre-training, they add an identity-preserving constraint on the latent features to force the identity information to remain constant during the transformation. The identity-preserving constraint is an L2 norm which can be formulated as: ZI∗P = argmin F R(x) − F R(x) ¯ ,

(10)

where x is the input image and x¯ is the reconstructed image, and F R(·) is a pretrained face recognition model [46] used to extract identity-specific features. After pre-training the generator, they fine-tune the network by using the age labels as conditions. Zhang et al. [58] also use the conditional adversarial learning scheme to synthesise elder faces by using a conditional adversarial autoencoder. Different from [2], they do not use a pre-trained face recognition model. Instead, they implement an additional discriminator to discriminate the latent features that belong to different subjects. Therefore, their model can be trained end-to-end.

Using Age Information as a Soft Biometric Trait for Face Image Analysis

13

Wang et al. [52] recently propose the Identity-Preserving Conditional GAN (IPCGAN). They use a similar strategy in [2], which tries to minimise the two identity-specific features from the input and the output in the feature space. To increase the synthesis accuracy, they pre-train an age estimator to estimate the age of the generated face and use the gradient from this pre-trained model to optimise the latent features through backpropagation. In this way, the latent features can learn more accurate age information. Yang et al. [54] use a GAN with a pyramid-shaped discriminator for age synthesis. The pyramid-shaped discriminator can discriminate multi-level age-specific features extracted from a pre-trained age estimator while conventional discriminators can only discriminate the high-level feature from the images. Following the previous works, they employ a pretrained face recognition model to preserve the identity information. Experimental results show that their method can generate realistic images with rich ageing and rejuvenating characteristics. It is worth noting that both [52] and [54] leverage the GAN function of the LSGAN. In the original GAN, when the distribution of the real data and the generated data are separated from each other, the gradient of the Jensen–Shannon divergence vanishes. LSGAN replaces the log loss of the original GAN by the L2 loss. The optimisation in the LSGAN can be seen as minimising the Pearson χ 2 divergence, which efficiently solves the saturation problem in the original GAN loss while converging much faster than other distance metrics, such as the Wasserstein distance. Taking the ageing process as an example, the loss function in LSGAN is: LD = Ex∼PoldX [(D(x) − 1)2 ] + Ex∼PyoungX [D(G(x))2 ],

(11)

LG = Ex∼PyoungX [(D(G(x) − 1)2 ],

(12)

where LD is used to optimise the discriminator and LG is used to optimise the generator. Examples of ageing result of [54] can be found in Fig. 4. The authors divide the data into four categories according to the following age ranges: 30–, 31–40, 41–50 and 51+. In the figure, the left entry of each set of images is the original face from the dataset and the other three images are the generated results.

3.3 Future Research on Age Synthesis The most important topic that none of the above works cover is standardising the evaluation methods of age synthesis models. Early attempts [2, 58] mainly use subjective evaluation methods by taking surveys. Recent works [52, 54] evaluate their model based on the two criteria mentioned in Sect. 3.2, but they use different evaluation models. Specifically, Yang et al. [54] use a commercial face recognition and age estimation tool, while Wang et al. [52] use their pre-trained face recognition

14

H. Wang et al.

Fig. 4 Ageing results of [54]. The first two rows are obtained on the CACD and the bottom two rows are obtained on the MORPH II dataset

and age estimation model. Such differences make related works hard to compare, which may hinder the development of further research. Moreover, from the previous section, we can see that it is common to use a pre-trained face recognition model or an age estimation model to guide the training process. However, those models may be noisy. According to [52], the age estimation accuracy of their age estimator is only about 30%. Due to the fact that the classification error is high (the classifier is noisy), the gradient for the age information is not accurate. The performance can then be boosted by developing other methods to guarantee the synthesis accuracy and keep the identity information simultaneously. New methods could also make the whole training process end-toend instead of pre-training several separate networks, which can save training time and computational resources.

4 Age-Invariant Face Recognition Although the accuracy of the conventional face recognition models (do not explicitly consider the intra-class variations, like the pose, illumination and expression variations, among the images of the same individual) is relatively high [42, 46], age-invariant face recognition (AIFR) is still a challenging task. The datasets commonly used for evaluation of AIFR models are the MORPH II dataset and the FG-NET dataset. Moreover, the CACD-VS, which is a noisefree dataset derived from the CACD for cross-age face verification, is also used for AIFR. The CACD-VS contains 2000 positive cross-age image pairs and 2000 negative pairs. In addition, researchers also test their AIFR models on the conventional face datasets such as the Labeled Faces in the Wild (LFW) dataset to demonstrate the generalisation ability of their models.

Using Age Information as a Soft Biometric Trait for Face Image Analysis

15

The evaluation criteria for AIFR models are the same as those for the conventional face recognition models, which are the recognition accuracy and the verification accuracy.

4.1 Deep Learning Based Age-Invariant Face Recognition Methods Different from conventional face recognition methods, which need to consider only the inter-class variation (the appearance and feature difference among different subjects), AIFR models also need to consider the intra-class variation, which is the age difference among the images of the same subject. Wen et al. [53] is the first work that involves a CNN for AIFR. In this work, the authors propose the latent feature fully connected layer (LF-FC) and the latent identity analysis (LIA) to extract the age-invariant identity-specific features. The LIA is formulated as: v=

d

Ui xi + v, ¯

(13)

i=1

where Ui is the corresponding matrix in which the columns span the subspace of different variations that need to be learned, xi is the normalised latent variables from the CNN, and v¯ is the mean of all the facial features. The output v is the set of ageinvariant features. As stated in [53], each set of facial features can be decomposed into different components based on different supervised signals. Therefore, Eq. (13) can be rewritten as: v = Uid xid + Uag xag + Ue xe + v, ¯

(14)

where Uid xid represents the identity-specific component used to achieve AIFR, Uag xag represents the age-specific component which encodes the age variation, and Ue xe represents the noise component. The authors then use the expectationmaximization (EM) algorithm to learn the parameters of the LIA. Note that the LIA is only used to optimise the linear layer in the network, i.e. the LF-FC layer. Parameters in the convolutional layers are optimised by using the stochastic gradient descent (SGD) algorithm. Since the convolutional layers and the LF-FC layer are trained to learn different features (the convolutional layers learn the conventional facial features, and the LF-FC layer learns the age-invariant features), the authors use a coupled learning scheme to optimise the network. Concretely, when optimising the convolutional layers, they freeze the LF-FC layer (fix the parameters), and when optimising the LF-FC layer, they freeze the convolutional layers.

16

H. Wang et al.

Table 3 State-of-the-art results of various AIFR models on the MORPH II dataset

Method Wen et al. [53] Wang et al. [51] Zheng et al. [59] Wang et al. [51]

Table 4 State-of-the-art results of various AIFR models on the CACD-VS dataset

Method Wen et al. [53] Wang et al. [51]

Accuracy 98.5% 99.2%

Table 5 State-of-the-art results of various AIFR models on the LFW dataset

Method Wen et al. [53] Wang et al. [51]

Accuracy 99.1% 99.4%

#Test subjects 10,000 10,000 3000 3000

Accuracy 97.51% 98.55% 98.13% 98.67%

Zheng et al. [59] propose the age estimation guided convolutional neural network (AE-CNN) for AIFR. The basic idea of this work is to obtain the age-specific features from an age estimation loss and remove them from the global facial features. The remove operation is done by subtraction. Recently, Wang et al. [51] propose the orthogonal embedding CNN in which the global features from the last fully connected layer are decomposed into two components, the age-specific component (features) xage and the identity-specific component (features) xid . Instead of considering the global features as a linear combination of xage and xid , they model these two components in an orthogonal manner which is inspired by the A-Softmax [32]. The state-of-the-art results on three benchmarks can be found in Tables 3, 4, and 5. The reported numbers are accuracies in percentage. By using an advanced architecture (a ResNet-like model) and a customised loss function, Wang et al. [51] achieve the best performance on the MORPH II dataset, the CACD-VS datasets and the LFW dataset.

4.2 Future Research Trends on Age-Invariant Face Recognition Although recent AIFR models can attain good results, these results could be further improved if larger age-oriented datasets are available for training and testing. Instead of building the dataset from the ground up, age synthesis methods can be used to enlarge and augment existing datasets by generating the images of each subject at different ages or age groups. As a result, the training process could benefit from more training samples, and higher accuracy could be achieved. According to [41], there are two types of approaches for AIFR. One is the generative approach in which synthesised faces are generated to match the target

Using Age Information as a Soft Biometric Trait for Face Image Analysis

17

age. Then the recognition is performed based on the synthesised faces. Another is the discriminative approach which aims to discriminate faces at different ages by discovering the hidden relation among ages. Existing deep learning based methods belong to the second category. Therefore, the benefits of using a deep learning based generative approach are twofold (enlarge the existing datasets and tackle the problem from a different perspective).

5 Conclusions Age is the most commonly used soft biometric trait in computer vision and biometrics tasks. The age-oriented models can be employed in a variety of real-life applications. However, due to the complexity of the ageing pattern and the diversity among individuals, age-related face image analysis remains challenging. In this chapter, we divided the age-related face image analysis into three areas based on their applications. The three areas are age estimation, age synthesis and age-invariant face recognition. We discussed each area in detail by first discussing their main concepts and the commonly used datasets for evaluation. Then, we discussed the recent research works and analysed the remaining issues and unsolved problems. We also presented the possible future research topics. For the case of age estimation, researchers currently tackle the problem from two different angles. Existing works either design customised loss functions to compute the estimated age or modify a basic CNN architecture. Important issues in the area of age estimation are disentangling other unexpected variations like the PIE and occlusion and constructing large-scale noise-free datasets to help to boost the estimation performance further. In the case of age synthesis, researchers often adopt GANs or AAEs to generate aged or rejuvenated faces based on the input faces. However, existing works do not have a unified approach to evaluate their models. Therefore, an evaluation standard needs to be proposed. In the case of age-invariant face recognition, researchers usually design models to discover the hidden relation among ages. In other words, existing works follow a discriminative approach, thus opening opportunities for the development of models that follow a generative approach.

References 1. F. Alnajar, Z. Lou, J.M. Álvarez, T. Gevers, et al., Expression-invariant age estimation, in BMVC (2014) 2. G. Antipov, M. Baccouche, J.-L. Dugelay, Face aging with conditional generative adversarial networks, in 2017 IEEE International Conference on Image Processing (ICIP) (IEEE, Piscataway, 2017), pp. 2089–2093 3. M. Arjovsky, S. Chintala, L. Bottou, Wasserstein generative adversarial networks, in International Conference on Machine Learning (2017), pp. 214–223

18

H. Wang et al.

4. Q. Cao, L. Shen, W. Xie, O.M. Parkhi, A. Zisserman, Vggface2: a dataset for recognising faces across pose and age, in 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018) (IEEE, Piscataway, 2018), pp. 67–74 5. L. Chang, D.Y. Tsao, The code for facial identity in the primate brain. Cell 169(6), 1013–1028 (2017) 6. K.-Y. Chang, C.-S. Chen, Y.-P. Hung, Ordinal hyperplanes ranker with cost sensitivities for age estimation, in 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, Piscataway, 2011), pp. 585–592 7. B.-C. Chen, C.-S. Chen, W.H. Hsu, Cross-age reference coding for age-invariant face recognition and retrieval, in European Conference on Computer Vision (Springer, Cham, 2014), pp. 768–783 8. S. Chen, C. Zhang, M. Dong, J. Le, M. Rao, Using ranking-CNN for age estimation, in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017) 9. T. Cootes, A. Lanitis, The FG-NET aging database (2008). http://www-prima.inrialpes.fr/ FGnet/ 10. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, ImageNet: a large-scale hierarchical image database, in IEEE Conference on Computer Vision and Pattern Recognition, 2009. CVPR 2009 (IEEE, Piscataway, 2009), pp. 248–255 11. S. Escalera, J. Fabian, P. Pardo, X. Baró, J. Gonzalez, H.J. Escalante, D. Misevic, U. Steiner, I. Guyon, ChaLearn looking at people 2015: apparent age and cultural event recognition datasets and results, in Proceedings of the IEEE International Conference on Computer Vision Workshops (2015), pp. 1–9 12. Y. Fu, G. Guo, T.S. Huang, Age synthesis and estimation via faces: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 32(11), 1955–1976 (2010) 13. X. Geng, Z.-H. Zhou, K. Smith-Miles, Automatic age estimation based on facial aging patterns. IEEE Trans. Pattern Anal. Mach. Intell. 29(12), 2234–2240 (2007) 14. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in Advances in Neural Information Processing Systems (2014), pp. 2672–2680 15. Y. Guo, L. Zhang, Y. Hu, X. He, J. Gao. MS-Celeb-1M: a dataset and benchmark for largescale face recognition, in European Conference on Computer Vision (Springer, Cham, 2016), pp. 87–102 16. H. Han, A.K. Jain, F. Wang, S. Shan, X. Chen, Heterogeneous face attribute estimation: a deep multi-task learning approach. IEEE Trans. Pattern Anal. Mach. Intell. 40(11), 2597–2609 (2018) 17. K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 770–778 18. Y. He, M. Huang, Q. Miao, H. Guo, J. Wang, Deep embedding network for robust age estimation, in 2017 IEEE International Conference on Image Processing (ICIP) (IEEE, Piscataway, 2017), pp. 1092–1096 19. H.-L. Hsieh, W. Hsu, Y.-Y. Chen, Multi-task learning for face identification and attribute estimation, in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, Piscataway, 2017), pp. 2981–2985 20. Z. Hu, Y. Wen, J. Wang, M. Wang, R. Hong, S. Yan, Facial age estimation with age difference. IEEE Trans. Image Process. 26(7), 3087–3097 (2017) 21. G. Huang, Z. Liu, L. Van Der Maaten, K.Q. Weinberger, Densely connected convolutional networks, in CVPR, vol. 1 (2017), p. 3 22. A.K. Jain, S.C. Dass, K. Nandakumar, Soft biometric traits for personal recognition systems, in Biometric Authentication (Springer, Berlin, 2004), pp. 731–738 23. A.K. Jain, A.A. Ross, K. Nandakumar, Introduction to Biometrics (Springer Science & Business Media, New York, 2011) 24. J.C. Klontz, A.K. Jain, A case study on unconstrained facial recognition using the Boston marathon bombings suspects. Michigan State University, Technical Report, 119(120), 1 (2013)

Using Age Information as a Soft Biometric Trait for Face Image Analysis

19

25. A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep convolutional neural networks, in Advances in Neural Information Processing Systems (2012), pp. 1097– 1105 26. Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, L.D. Jackel, Backpropagation applied to handwritten zip code recognition. Neural Comput. 1(4), 541–551 (1989) 27. Y. LeCun, Y. Bengio, G. Hinton, Deep learning. Nature 521(7553), 436 (2015) 28. G. Levi, T. Hassner, Age and gender classification using convolutional neural networks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (2015), pp. 34–42 29. L. Li, H.-T. Lin, Ordinal regression by extended binary classification, in Advances in Neural Information Processing Systems (2007), pp. 865–872 30. K. Li, J. Xing, C. Su, W. Hu, Y. Zhang, S. Maybank, Deep cost-sensitive and order-preserving feature learning for cross-population age estimation, in IEEE International Conference on Computer Vision (2018) 31. H. Liu, J. Lu, J. Feng, J. Zhou, Ordinal deep feature learning for facial age estimation, in 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017) (IEEE, Piscataway, 2017), pp. 157–164 32. W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, L. Song, SphereFace: deep hypersphere embedding for face recognition, in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1 (2017), p. 1 33. H. Liu, J. Lu, J. Feng, J. Zhou, Label-sensitive deep metric learning for facial age estimation. IEEE Trans. Inf. Forensics Secur. 13(2), 292–305 (2018) 34. A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, B. Frey, Adversarial autoencoders (2015), arXiv preprint arXiv:1511.05644 35. X. Mao, Q. Li, H. Xie, R.Y. Lau, Z. Wang, S.P. Smolley, Least squares generative adversarial networks, in 2017 IEEE International Conference on Computer Vision (ICCV) (IEEE, Piscataway, 2017), pp. 2813–2821 36. M. Mirza, S. Osindero, Conditional generative adversarial nets (2014), arXiv preprint arXiv:1411.1784 37. Z. Niu, M. Zhou, L. Wang, X. Gao, G. Hua, Ordinal regression with multiple output CNN for age estimation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 4920–4928 38. S. Nowozin, B. Cseke, R. Tomioka, f-GAN: Training generative neural samplers using variational divergence minimization, in Advances in Neural Information Processing Systems (2016), pp. 271–279 39. G. Ozbulak, Y. Aytar, H.K. Ekenel, How transferable are CNN-based features for age and gender classification?, in 2016 International Conference of the Biometrics Special Interest Group (BIOSIG) (IEEE, Piscataway, 2016), pp. 1–6 40. H. Pan, H. Han, S. Shan, X. Chen, Mean-variance loss for deep age estimation from a face, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018), pp. 5285–5294 41. U. Park, Y. Tong, A.K. Jain, Age-invariant face recognition. IEEE Trans. Pattern Anal. Mach. Intell. 32(5), 947–954 (2010) 42. O.M. Parkhi, A. Vedaldi, A. Zisserman, et al., Deep face recognition, in BMVC, vol. 1 (2015), p. 6 43. R. Ranjan, S. Sankaranarayanan, C.D. Castillo, R. Chellappa, An all-in-one convolutional neural network for face analysis, in 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017) (IEEE, Piscataway, 2017), pp. 17–24 44. K. Ricanek, T. Tesafaye, Morph: a longitudinal image database of normal adult ageprogression, in 7th International Conference on Automatic Face and Gesture Recognition, 2006. FGR 2006 (IEEE, Piscataway, 2006), pp. 341–345 45. R. Rothe, R. Timofte, L. Van Gool, Deep expectation of real and apparent age from a single image without facial landmarks. Int. J. Comput. Vis. 126(2–4), 144–157 (2018)

20

H. Wang et al.

46. F. Schroff, D. Kalenichenko, J. Philbin, FaceNet: a unified embedding for face recognition and clustering, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 815–823 47. K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition (2014), arXiv preprint arXiv:1409.1556 48. S. Taheri, Ö. Toygar, On the use of DAG-CNN architecture for age estimation with multi-stage features fusion. Neurocomputing 329, 300–310 (2019) 49. X. Wang, R. Guo, C. Kambhamettu, Deeply-learned feature for age estimation, in 2015 IEEE Winter Conference on Applications of Computer Vision (WACV) (IEEE, Piscataway, 2015), pp. 534–541 50. H. Wang, X. Wei, V. Sanchez, C.-T. Li, Fusion network for face-based age estimation, in 2018 25th IEEE International Conference on Image Processing (ICIP) (IEEE, Piscataway, 2018), pp. 2675–2679 51. Y. Wang, D. Gong, Z. Zhou, X. Ji, H. Wang, Z. Li, W. Liu, T. Zhang, Orthogonal deep features decomposition for age-invariant face recognition (2018), arXiv preprint arXiv:1810.07599 52. Z. Wang, X. Tang, W. Luo, S. Gao, Face aging with identity-preserved conditional generative adversarial networks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018), pp. 7939–7947 53. Y. Wen, Z. Li, Y. Qiao, Latent factor guided convolutional neural networks for age-invariant face recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 4893–4901 54. H. Yang, D. Huang, Y. Wang, A.K. Jain, Learning face age progression: A pyramid architecture of GANs (2017), arXiv preprint arXiv:1711.10352 55. D. Yi, Z. Lei, S.Z. Li, Age estimation by multi-scale convolutional network, in Asian Conference on Computer Vision (Springer, Berlin, 2014), pp. 144–158 56. Y. Zhang, D.-Y. Yeung, Multi-task warped Gaussian process for personalized age estimation, in 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, Piscataway, 2010), pp. 2622–2629 57. K. Zhang, Z. Zhang, Z. Li, Y. Qiao, Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process. Lett. 23(10), 1499–1503 (2016) 58. Z. Zhang, Y. Song, H. Qi, Age progression/regression by conditional adversarial autoencoder, in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2 (2017) 59. T. Zheng, W. Deng, J. Hu, Age estimation guided convolutional neural network for ageinvariant face recognition, in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops (2017), pp. 12–16

Person Re-identification with Soft Biometrics Through Deep Learning Shan Lin and Chang-Tsun Li

1 Introduction Person re-identification, also known as person Re-ID is the task of recognising and continuously identifying the same person across multiple non-overlapping cameras in a surveillance system. Typical methods for automatic person identification system are usually based on people’s hard biometric traits such as fingerprints, irises or faces. However, in the video surveillance system, target individuals are generally captured at a distance in an uncontrolled environment. Such settings introduce a lot of difficulty in obtaining these hard biometric traits due to the low resolution of the camera sensors, occlusion of the subjects, etc. [14]. Most of the existing research works on person Re-ID are focusing on extracting the local view-invariant features of a person and learning a discriminate distance metric for similarity analysis. Soft biometrics such as gender, age, hairstyle or clothing are mid-level semantic descriptions of a person which are invariant to illumination, viewpoint and pose. Hence, in recent years, soft biometrics have been used in conjunction with the identity information to aid many Re-ID methods as auxiliary information. Moreover, these semantic attribute labels bridge the gap between how machines and people recognise and identify human beings. By integrating the soft biometrics, the existing person Re-ID system can be extended to

S. Lin () University of Warwick, Coventry, UK e-mail: [email protected] C.-T. Li School of Information Technology, Deakin University, Waurn Ponds, VIC, Australia e-mail: [email protected]; [email protected] © Springer Nature Switzerland AG 2020 R. Jiang et al. (eds.), Deep Biometrics, Unsupervised and Semi-Supervised Learning, https://doi.org/10.1007/978-3-030-32583-1_2

21

22

S. Lin and C.-T. Li

a broader range of application such as text-to-person retrieval, image to description conversion. Although, many soft biometric attribute assisted person Re-ID methods have been proposed previously. Most of them are based on traditional machine learning approaches with handcrafted features. In [7–9], attributes are classified by SVM from low-level descriptors and integrated in the metric learning step of the person Re-ID. Su et al. [17] proposed low-rank attribute embedding which embeds the binary attributes to a continuous attribute space based on the local feature representations. Khamis et al. [5] have developed a method based on the attribute consistency for the same person and proposed the triplet loss for the attribute features. All these works are based on the small VIPeR and PRID datasets with only 21 binary attributes. In 2014, Deng et al. [1] released a large-scale pedestrian attribute dataset PETA which include images from multiple person Re-ID datasets. However, the PETA dataset did not contain an adequate number of training images for deep learning based person Re-ID methods. Until Lin et al. [10] released the annotations for two largest Re-ID datasets: Mareket-1501 [25] and DukeMTMCreID [27], soft biometrics start integrating into the deep person Re-ID methods. However, the size of the annotated attributes for Market-1501 and DukeMTMCreID is still relatively limited compared with the PETA datasets. In this chapter, we first list out the person Re-ID datasets with soft biometrics labels and introduce the performance evaluation metrics for person Re-ID task and attribute recognition task. Then, we present the recent soft biometrics based or assisted deep person Re-ID methods from three perspectives: supervised, semisupervised and unsupervised learning. Finally, we discuss the unaddressed problems of the soft biometric in person Re-ID and outline the potential future work.

2 Datasets and Evaluation 2.1 Datasets with Soft Biometrics Annotations Currently, there are approximately 30 publicly available person Re-ID datasets. However, only a small portion of them comes with soft biometric attributes annotations. Table 1 lists the broadly used person Re-ID datasets that come with soft biometric attributes. In [7], Layne et al. have annotated the oldest person ReID dataset VIPeR with 15 binary attributes including gender, hairstyle and some attire attributes. Later in [8], they increase the 15 attributes to 21 by introducing the dark and light colour labels for hair, shirt and pants, and extending the attribute annotations to other datasets such as PRID[4] and GRID[12]. Their PRID and GRID annotations are limited to those cross-cameras identities (200 identities for PRID and 250 identities for GRID). In 2014, Deng et al. released the pedestrian attributes recognition dataset—PETA[1]. It consists of 19,000 images selected from multiple surveillance datasets and provides every detail of soft biometric annotations with 61

Person Re-identification with Soft Biometrics Through Deep Learning

23

Table 1 Commonly used person Re-ID datasets with soft biometric attributes Datasets VIPeR1 [2] VIPeR2 [2] PRID [4] GRID [12] PETA [1]

# ID 632 632 934 1025 –

# Images 1264 1264 24,541 1275 19.000

# Attributes 15 21 21 21 105

Market-1501 [25]

1501

32,217

30

DukeMTMC-reID [27]

1812

36,441

23

Attributes annotation 15 binary attributes [7] 21 binary attributes [8] Annotated shared 200 identities [8] Annotated shared 250 identities [8] 61 binary attributes [1] 11 colours of four different regions Include Multiple ReID Datasets: VIPeR 1264 images PRID 1134 images GRID 1275images CUHK 4563 images 9 binary attributes [10] 4 different age groups 8 upper body colours 9 lower body colours 8 binary attributes [10] 8 upper body colours 7 lower body colours

VIPeR1 is the version 1 of the attribute annotations proposed in 2012 VIPeR2 indicated the second version of the attribute annotations with additional labels PETA dataset contains attributes from multiple Re-ID datasets

binary labels and 11 different colour labels for four different body regions. With a total of 105 attributes, the PETA dataset is one of the richest annotated datasets for pedestrian attribute recognition. As the PETA dataset includes many popular person Re-ID datasets such as VIPeR, PRID, GRID and CUHK, it has been integrated as soft biometric traits for some person Re-ID approaches [13, 16, 18, 19]. Due to the rapid development in deep learning approaches for person Re-ID, the dataset scales of VIPeR, PRID and GRID are too small for training the deep neural networks. The recently released two datasets such as Market-1501[25] and DukeMTMC-reID[27] with a decent size of IDs and bounding boxes provide a good amount of data for training and testing the deep Re-ID models. In [10], Lin et al. annotated these two large datasets with 30 and 23, respectively. All 1501 identities in Market-1501 have been annotated with nine different binary attributes, select from four different age group and indicate the colour for upper and lower body parts. DukeMTMC-reID dataset has been annotated with eight binary attributes with colours for upper and lower body. The detail statistics of the dataset can be found in Table 1.

24

S. Lin and C.-T. Li

2.2 Evaluation Metrics The cumulative matching characteristics (CMC) curve is the most common metric used for evaluating person Re-ID performance. This metric is adopted since Re-ID is intuitively posed as a ranking problem, where each image in the gallery is ranked based on its comparison to the probe. The probability that the correct match in the ranking equal to or less than a particular value is plotted against the size of the gallery set [2]. Due to the slow training time of deep learning models, the CMC curve comparisons for recent deep Re-ID methods are simplified to only comparing Rank-1,5,10,20 retrieval rates. However, the CMC curve evaluation is valid when only one ground truth matches for each given query image. The recent datasets such as Market-1501 and DukeMTMC-reID usually contain multiple ground truths for each query images. Therefore, Zheng et al. [25] have proposed the mean average precision (mAP) as a new evaluation metric. For each query image, the average precision (AP) is calculated as the area under its precision-recall curve. The mean value of the average precision (mAP) will reflect the overall recall of the person Re-ID algorithm. The performances of current person Re-ID methods are usually examined by combining the CMC curve for retrieval precision evaluation and mAP for recall evaluation. For the person Re-ID methods using the soft biometrics, the biometric attribute recognition accuracy is evaluated to prove that the proposed methods are effectively learning and utilising the given attribute information. For a comprehensive analysis of the attribute prediction, the soft biometric evaluation metrics usually include the classification accuracy for each individual attribute and an average recognition rate of all the attributes.

2.3 Supervised Deep Re-ID with Soft Biometrics Attributes The visual appearance of a person can be easily affected by the variant of illumination, postures and viewpoints in different cameras. The soft biometrics as semantic mid-level features are invariant across cameras and provide relevant information about the person’s identity. By integrating soft biometrics into person Re-ID models, the deep person Re-ID network should obtain more robust viewinvariant feature representations. Constructing new neural network structures which can fuse the identity information with attributes information is a crucial step for fully supervised deep person Re-ID.

2.3.1

Supervised Attribute Assisted Verification Based Person Re-ID

Triplet loss is one commonly used loss for person Re-ID task. By inputting an anchor image with a positive sample and a negative sample, the similarity

Person Re-identification with Soft Biometrics Through Deep Learning

25

GoogleNet

GoogleNet

LACRN

GoogleNet

LAttr Element-Wise Weighting

Fig. 1 Illustration of attribute-complementary Re-ID net (ACRN)

between the positive pair must be high, and the similarity between the negative pair should be low. Since the same person should have the same soft biometrics attributes, the triplet loss function can also be applied to the attributes features as well. Based on the attribute consistency, Schumann et al. [15] proposed an Attribute-Complementary Re-ID Net (ACRN) architecture which combines the image based triplet loss with the attribute label triplet loss. The overview of the ACRN architecture is illustrated in Fig. 1. The attribute recognition model is pre-trained from the PETA dataset. Then, the attribute predictions of the three input images will be used to compute the attribute branch triplet loss. With the summation of the triplet loss from the raw image branch, the overall loss function for ACRN can be expressed as following: LACRN =

1 fp p n fn di − di + m + γ (diatt − diatt ), N

(1)

2 p n fp p 2 fn p 2 where d = f a − f , d = fia − fin 2 , diatt = attia − atti 2 , diatt = a i n 2 i f p i 2 fin att − att . d and di denote the distance of between anchor positive i i 2 i p n images pair and anchor negative images pair. The terms diatt and diatt denote the distance of the predicted attribute representations. Performance comparison with other methods can be found in Table 3. With the additional attribute branch triplet loss, there is 2–4% increase in both Rank1 retrieval accuracy and mAP.

26

S. Lin and C.-T. Li Attribute Classification

Pool5

FC1

FC2

ResNet

FCM

Attribute 1 Attribute 2 Attribute M

Softmax

FC0

Person ID ID Identification

Fig. 2 Illustration of Attribute-Person Recognition (APR) network

2.3.2

Supervised Attribute Assisted Identification Based Person Re-ID

Unlike triplet loss based method which treats the person Re-ID problem as a verification task, Lin et al. [10] rethink the person Re-ID training as an identification task. They utilised the classification loss for learning people’s identities and attributes. With the soft biometric annotations of two largest person Re-ID datasets: Market1510 and DukeMTMC, their proposed method simultaneously learned the attribute and ID from the same feature maps from a backbone CNN network. The overview of the Attribute-Person Recognition (APR) network is demonstrated in Fig. 2. Each input image in APR network will pass through a ResNet-50 backbone feature extractor. The extracted feature maps will be used for both ID classification and attribute recognition. For K IDs in training, they use the cross-entropy loss for ID Classification: LI D = −

k=1

log(p(k))q(k),

(2)

K

where p(k) is the predicted probability with ground truth q(k). The attribute prediction is using M Softmax loss: Latt = −

j =1

log(p(j ))q(j ).

(3)

m

As the APR network is trained for both attribute prediction and identity classification, the overall loss function will be the weight summation of two losses: LAP R = λLI D +

i=1 1 Latt . M M

(4)

Person Re-identification with Soft Biometrics Through Deep Learning

27

By using only the classification loss for ID and attribute, the APR network is much easier to train with a quick and smooth converge compared with the triplet loss based network. The overall performances, shown in Table 3, are on a par with the triplet loss based ACRN method.

3 Semi-supervised Person Re-ID with Soft Biometrics Although the Market-1501 dataset and the DukeMTMC-reID dataset are labelled with 30 and 23 attributes, the dimensions of these soft biometrics labels are far from the 105 attributes PETA dataset. On the other hand, the PETA dataset does not provide enough training ID for deep learning approaches. If the PETA dataset attributes can be transferred to a new dataset, it could provide more detailed auxiliary information for person Re-ID task.

3.1 Semi-supervised Cross-Dataset Attribute Learning via Co-training To address this problem, Su et al. [18] proposed a Semi-supervised Deep Attribute Learning (SSDAL) algorithm. The SSDAL method uses a co-training strategy which extent the attributes in a given labelled dataset to a different unlabelled person Re-ID dataset by utilising the ID information from the new dataset. The process of SSDAL algorithm can be generalised into three stages: 1. Training on a PETA pedestrian attribute dataset labelled (excluding the target dataset) with soft biometric attributes. 2. Fine-tuning on a large dataset with only person IDs label using triplet loss on the predicted attributes based on the identity labels. 3. Updating the model predicts attribute labels for the combined dataset. The detailed illustration of three stages of SSDAL can be found in Fig. 3. Figure 4 shows the attribute classification accuracy on different stages. The SSDAL strategy does give 2–3% increase from the baseline recognition accuracy. However, the person Re-ID matching in SSDAL is purely based on the soft biometrics labels which is semi-supervised learned from PETA dataset. As a result, the SSDAL yielded a relatively poor Re-ID performance, as shown in Table 3.

28

S. Lin and C.-T. Li

Stage 1: Fully Supervised dCNN training FC7

Independent Dataset With

4096

Attribute Labels

FC8 k

dCNN

Sigmoid Cross Entropy

Stage 2: Fine-tuning using attribute triplet loss Dataset with Person ID

Predicted Attributes

FC7

Attribute Triplet Loss

dCNN

Negative

Positive

Anchor

Negative

Positive

Anchor

4096

+

Stage 3: Final Fine-tuning on combine dataset FC7

Independent Dataset

4096

Dataset with refined attributes

FC8 k

dCNN

+

Sigmoid Cross Entropy

Fig. 3 Illustration of semi-supervised deep attribute learning (SSDAL) network

ATTRIBUTE ACCURACY (%)

65 62.5 60 57.5 55 52.5 50 VIPER

PRID DATASET Stage1

Stage1&3

Stage1&2

Fig. 4 Attributes classification accuracy on different stage

GRID SSDAL

Person Re-identification with Soft Biometrics Through Deep Learning

29

3.2 Unsupervised Cross-Dataset Person Re-ID via Soft Biometrics Most of the recent person re-identification (Re-ID) models follow supervised learning frameworks which require a large number of labelled matching image pairs collecting from the video surveillance cameras. However, a real-world surveillance system usually consists of hundreds of cameras. Manually tracking and annotating persons among these cameras are extremely expensive and impractical. One way to solve this issue is transferring a pre-trained model learned from existing available datasets to a new surveillance system. As the unlabelled images can be easily obtained from the new CCTV system, this issue can be considered as an unsupervised cross-dataset transfer learning problem. In recent years, some unsupervised methods have been proposed to extract viewinvariant features and measure the similarity of images without label information [6, 20, 21, 24]. These approaches only analysed the unlabelled datasets and generally yielded poor person Re-ID performance due to the lack of strong supervised tuning and optimisation. Another approach to solving the scalability issue of Re-ID is unsupervised transfer learning via domain adaptation strategy. The unsupervised domain adaptation methods leverage labelled data in one or more related source datasets (also known as source domains) to learn models for unlabelled data in a target domain. Since the identity labels of the different Re-ID datasets are non-overlapping, the soft biometrics attributes become alternative shared domain knowledge between datasets. For example, the same set of attributes like genders, age groups or colour/texture of the outfits can be used as universal attributes for any pedestrians across different datasets.

3.2.1

Attribute Consistency Adaptation Method

One way to utilise the soft biometric attribute in cross-dataset adaptation is to create a distance metric to quantify how well the model fits a given domain. Wang et al. [22] proposed the Transferable Joint Attribute-Identity Deep Learning (TJ-AIDL) method and introduced the Identity Inferred Attribute (I I A). By reducing the discrepancy between Identity Inferred Attribute and actual soft biometric attributes, the pre-trained model can be adapted to a different dataset. The overall architecture of TJ-AIDL is depicted in Fig. 5. TJ-AIDL method proposed two separate branches for simultaneously learning individual features of people’s identity and appearance attributes. For training the identity branch, they utilised the Softmax Cross Entropy as the loss function defined as Lid = −

nbs 1 log(pid (I si , yis ), nbs i=1

(5)

30

S. Lin and C.-T. Li

predicted probability

Identity Branch Deep Feature

CNN

ID

Encoder

eIIA

Decoder

Identity Inferred Attribute (IIA) Space

ID

Patt CNN

Attribute Branch

Attributes

Latt,IIA

LID-transfer

Fig. 5 Transferable joint attribute-identity deep learning architecture

where pid (I si , yis ) specifies the predicted probability on the ground-truth class yis of the training image I si , and nbs denotes the batch size. For training the attribute branch, they considered the attributes recognition as a multi-label classification task. Thus, the sigmoid cross entropy loss function is used: Latt

nbs m 1 = (ai,j log(patt (I i , j )) + (1 − ai,j ) log(1 − patt (I i , j ))), (6) nbs i=1 j =1

where ai,j and patt (I i , j )) define the ground-truth label and the predicted classification probability on the j-th attribute class of the training image I i . The deep feature extracted from the identity branch will then be feed into a reconstruction auto-coder (IIA encoder-decoder) in order to transform into a low dimensional matchable space (IIA space) for the attribute counterpart. The reconstruction loss Lrec is the Minimum Square Error (MSE) between reconstructed images and the ground-true images. The concise feature representation (eI I A ) extracted from the hidden layer is aligned and regularised with the prediction distribution from attribute branch by the MSE loss: LI D−transf er = eI I A − p˜ att 2 .

(7)

For easy alignment purpose, an additional sigmoid cross entropy is applied the eI I A with pseudo attribute prediction from the attribute branch: Lattr,I I A = eI I A − p˜ att 2 .

(8)

The overall loss for the IIA encoder-decoder is the weighted summation of Lattr,I I A ,Lrec ,LI D−transf er . In order to transfer the identity knowledge to the

Person Re-identification with Soft Biometrics Through Deep Learning

31

Fig. 6 Attribute consistency between two branches. (a) Attribute consistency in the source domain. (b) Attribute consistency in the target domain

attribute branch, the identity knowledge transfer loss LI D−transf er is also added to the learning of the attribute classification. Supervised learning of TJ-AIDL model should be able to achieve small discrepancy between the Identity Inferred Attribute (I I A) and actual attributes. However, the model trained from the source dataset may not give the same attribute consistency in the target dataset, illustrated in Fig. 6. By reducing the discrepancy between two branches on the target dataset images, the source dataset trained TJAIDL model can be adapted to the target dataset in an unsupervised manner.

3.2.2

MMD Based Feature Alignment Adaptation Method

Another way to utilise the soft biometric attributes for cross-dataset adaptation is aligning the attribute features distribution between the source and the target dataset. Lin et al. [11] proposed a Multi-task Mid-level Feature Alignment (MMFA) network which learned the features representations from the source dataset and simultaneously aligning the attribute distribution to the target dataset based on the Maximum Mean Discrepancy (MMD). The overview of MMFA architecture is illustrated in Fig. 7. The MMFA network is fundamentally a siamese neural network structure. The source and target images as the two inputs will undergo two networks with the shared weights. The global max-pooling layer will extract the most sensitive feature maps from the last convolutional layer of the ResNet-50 network. These feature vectors from the global max-pooling will be forwarded to multiple independent fully connected layers for identity classification and attribute recognition. Similar

32

S. Lin and C.-T. Li

Fully Connected ID-FC Global Max-Pooling GMP

Softmax Source Data

Feature Extractor

Person Identities

Attr-FC-1

Attribute 1

...

...

...

Sigmoid

Attr-FC-M

Sigmoid Share Weight

Attribute M

MDAL ID-FC

Target Data

AAL-1

Attr-FC-1

...

Feature Extractor

...

AAL-M Attr-FC-M

Softmax

Supervised Identities Loss

MDAL

Mid-level Deep Feature Adaptation Loss

Sigmoid

Supervised Attribute Loss

AAL-M

Attribute M Adaptation Loss

Fig. 7 Multi-task Mid-level Feature Alignment (MMFA) architecture

to TJ-AIDL, the MMAF model also utilised the Softmax Cross Entropy for ID identification and sigmoid cross entropy for attribute classification: Lid

nS 1 =− log(pid (hid S,i , yS,i )) nS

(9)

i=1

Lattr = −

M nS 1 1 m m (aS,i · log(pattr (hattr S,i , m)) M nS m=1 i=1

m m + (1 − aS,i ) · log(1 − pattr (hattr S,i , m))),

(10)

id where pid (hid S,i , yS,i ) is the predicted probability on the identity features hS,i with m the ground-truth label yS,i and pattr (hattr S,i , m) is the predicted probability for the attrm m . m-th attribute features hS,i with ground-truth label aS,i Since the IDs of the pedestrians are not commonly shared between two datasets, the soft biometrics attributes become a good alternative shared domain labels for cross-dataset Re-ID adaptation. As each individual attribute has its own

Person Re-identification with Soft Biometrics Through Deep Learning

33

fully connected (FC) layer. The feature vectors obtained from these FC layers, M M 1 1 {Hattr , .., Hattr }, {Hattr , .., Hattr }, can be considered as the features of the S S T T corresponding attributes. Therefore, by aligning the distribution of every attribute features between the source and the target datasets, the model can unsupervised adapted to the target dataset. In MMFA models, it utilised the Maximum Mean Discrepancy (MMD) measure [3] to calculate the feature distribution distance of each attribute. The final loss for the attribute distribution alignment is the mean MMD distance of all attributes: LAAL =

M 1 m m 2 MMD(Hattr , Hattr ) . S T M

(11)

m=1

However, the 30 and 23 attributes provided by the current Market-1501 and DukeMTMC-reID dataset are insufficient to represent all mid-level features of the dataset. Thus, the MMFA also introduced the mid-level deep feature alignment loss LMDAL to reduce the despondency of the last Global Max-Pooling Layer (HS , HT ): LMDAL = MMD(HS , HT )2 .

(12)

RBF characteristic kernels with selected bandwidth α=1,5,10 are the main kernel functions used in all MMD loss. 1 attrm attrm attrm attrm 2 (13) k(hS,i , hT ,j ) = exp − − hT ,j . h 2α S,i The overall loss will be the weighted summation of all the mention losses above: Lall = Lid + λ1 Lattr + λ2 LAAL + λ3 LMDAL

(14)

4 Performance Comparison The attribute recognition accuracy is a good measurement for evaluating how well the attribute information is integrated into the person Re-ID system. Table 2 shows the mean attribute recognition accuracy for ACRN and APR approaches. The ACRN model trained from PETA dataset can achieve 84.61% mean recognition accuracy of all 105 attributes. The APR model can also accomplish over 85% recognition rate with 23/30 attributes. However, the SSDAL methods did not train on the entire Table 2 Mean attribute recognition accuracy (mA) for ACRN and APR

ACRN APR

PETA 84.61%

Market-1501

DukeMCMT-reID

88.16%

86.42%

34

S. Lin and C.-T. Li

Table 3 Performance comparisons with state-of-the-art unsupervised person Re-ID methods Dataset Metric (%) Supervised without soft biometric Supervised Semi-supervised Unsupervised

GAN [27]

VIPeR Rank-1 –

PRID Rank-1 –

Market-1501 Rank-1 mAP 79.3 56.0

DukeMCMT-reID Rank-1 mAP 67.7 47.13

PIE [26] ACRN APR SSDAL TJ-AIDLDuke MMFADuke TJ-AIDLMarket MMFAMarket

– – – 37.9 35.1 36.3 38.5 39.1

– – – 20.1 34.8 34.5 26.8 35.1

78.1 83.6 84.3 39.4 58.2 56.7 – –

– 72.6 70.7 – – – 44.3 45.3

56.2 62.6 64.7 19.6 26.5 27.4 – –

– 52.0 51.9 – – – 23.0 24.7

The superscripts: Duke and Market indicate the source dataset which the model is trained on

set of PETA system. The different recognition accuracy can be found in Fig. 4. The unsupervised cross-dataset methods TJ-AIDL and MMFA did not provide the attribute recognition rate due to the different sets of attribute labels used for the different dataset. Table 3 shows the detailed comparison for supervised with and without soft biometrics, semi-supervised attribute transfer and unsupervised cross-dataset transfer. By integrating the attribute information, the supervised methods with attributes can usually outperform the deep supervised methods by 4–6%. The semi-supervised method is based on the transferred attribute features only. It gives a relatively weak Re-ID performance due to the lack of local features information. The unsupervised cross-dataset person Re-ID methods such as TJ-AIDL and MMFA show promising performances compared with the fully supervised methods.

5 Existing Issues and Future Trends The existing Market-1501 and DuketMTMC-reID have a limited set of soft biometric attributes. The current 23/30 attributes cannot easily distinguish most of the person in these two datasets. To fully exploit the soft biometrics information for person Re-ID, a large-scale and richly annotated person Re-ID dataset is definitely needed. However, soft biometric attributes are expensive to annotate especially for the large-scale dataset. Unfortunately, recently released dataset such as MSMT17 [23] did not come with soft biometric annotations. The second problem in soft biometrics is the lack of the standardisation guideline when annotating the attributes. In Market-1501, there are eight upper body colours and nine lower body colours. In DukeMTMC-reID, the annotated colours for upper body and lower body become 8 and 7. The age group separation definitions are

Person Re-identification with Soft Biometrics Through Deep Learning

35

different between PETA dataset and Market-1501. The clothing types are also inconsistent between PETA, Market-1501 and DukeMTMC-reID. If all the dataset following the same annotation guideline, soft biometric attributes can be used and evaluated in the cross-dataset or multi-dataset person Re-ID scenarios. The soft biometrics should not be limited to boosting the person Re-ID performance or leverage for cross-dataset adaptation. How to utilise the existing attributes and extending to the natural language description based person retrieval will be an interesting topic to research on in the future.

6 Conclusion Soft biometric attributes are useful auxiliary information for person Re-ID task. In supervised deep person Re-ID models, soft biometric information could improve the person Re-ID performance by 2–4%. When transferring a Re-ID model from one camera system to another, the soft biometric attributes can be used as a share domain knowledge for domain adaptation task. In addition, the attributes can also be transferred via a semi-supervised deep co-training strategy from a richly annotated dataset to an unlabelled Re-ID dataset. The performance for cross-dataset Re-ID model adaptation or attributes transfer is still far away from optimal. There is ample space for improvement and massive potential for these application. Currently, there are only a few person Re-ID dataset with soft biometric information. The soft biometric attributes are very limited and inconsistent across different datasets. How to effectively obtain the soft biometrics and establish an annotation guideline will be a meaningful research work for the person Re-ID problem.

References 1. Y. Deng, P. Luo, C.C. Loy, X. Tang, Pedestrian attribute recognition at far distance, in ACM International Conference on Multimedia (ACM MM) (2014) 2. D. Gray, S. Brennan, H. Tao, Evaluating appearance models for recognition, reacquisition, and tracking, in International Workshop on Performance Evaluation for Tracking and Surveillance (PETS), vol. 3 (2007), pp. 41–47 3. A. Gretton, K. Fukumizu, Z. Harchaoui, B.K. Sriperumbudur, A fast, consistent kernel twosample test, in Advances in Neural Information Processing Systems (NIPS) (2009) 4. M. Hirzer, C. Beleznai, P.M. Roth, H. Bischof, Person re-identification by descriptive and discriminative classification, in Scandinavian Conference on Image Analysis (SCIA) (2011) 5. S. Khamis, C.H. Kuo, V.K. Singh, V.D. Shet, L.S. Davis, Joint learning for attribute-consistent person re-identification, in European Conference on Computer Vision Workshops (ECCVW) (2014) 6. E. Kodirov, T. Xiang, S. Gong, Dictionary learning with iterative Laplacian regularisation for unsupervised person re-identification, in British Machine Vision Conference (BMVC) (2015) 7. R. Layne, T.M. Hospedales, S. Gong, Person re-identification by attributes, in British Machine Vision Conference (BMVC). British Machine Vision Association (2012)

36

S. Lin and C.-T. Li

8. R. Layne, T.M. Hospedales, S. Gong, Attributes-based re-identification, in Person Reidentification (Springer, London, 2014), pp. 93–117 9. R. Layne, T.M. Hospedales, S. Gong, Re-id: hunting attributes in the wild, in British Machine Vision Conference (BMVC). British Machine Vision Association (2014), pp. 1–1 10. Y. Lin, L. Zheng, Z. Zheng, Y. Wu, Y. Yang, Improving person re-identification by attribute and identity learning (2017), arXiv preprint 11. S. Lin, H. Li, C.t. Li, A.C. Kot, M.l.F. Alignment, Unsupervised, F.O.R.: Multi-task mid-level feature alignment network for unsupervised cross-dataset person re-identification, in British Machine Vision Conference (BMVC) (2018) 12. C.C. Loy, T. Xiang, S. Gong, C. Change, L. Tao, X. Shaogang, C.C. Loy, T. Xiang, S. Gong, Time-delayed correlation analysis for multi-camera activity understanding. Int. J. Comput. Vis. 90, 106–129 (2010) 13. T. Matsukawa, E. Suzuki, Person re-identification using CNN features learned from combination of attributes, in International Conference on Pattern Recognition (ICPR) (2016) 14. D.A. Reid, S. Samangooei, C. Hen, M.S. Nixon, A. Ross, Soft biometrics for surveillance: an overview, in Handbook of Statistics (Elsevier, Oxford, 2013) 15. A. Schumann, R. Stiefelhagen, Person re-identification by deep learning attributecomplementary information, in Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2017) 16. Z. Shi, T.M. Hospedales, T. Xiang, Transferring a semantic representation for person reidentification and search, in Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, Piscataway, 2015) 17. C. Su, F. Yang, S. Zhang, Q. Tian, Multi-task learning with low rank attribute embedding for person re-identification, in International Conference on Computer Vision (ICCV) (2015) 18. C. Su, S. Zhang, J. Xing, W. Gao, Q. Tian, Deep attributes driven multi-camera person reidentification, in European Conference on Computer Vision (ECCV) (2016) 19. C. Su, S. Zhang, J. Xing, W. Gao, Q. Tian, Multi-type attributes driven multi-camera person re-identification. Pattern Recogn. 75, 77–89 (2018) 20. H. Wang, S. Gong, T. Xiang, Unsupervised learning of generative topic saliency for person re-identification, in British Machine Vision Conference (BMVC) (2014) 21. H. Wang, X. Zhu, T. Xiang, S. Gong, Towards unsupervised open-set person re-identification, in International Conference on Image Processing (ICIP) (2016) 22. J. Wang, X. Zhu, S. Gong, W. Li, Transferable joint attribute-identity deep learning for unsupervised person re-identification, in Conference on Computer Vision and Pattern Recognition (CVPR) (2018) 23. L. Wei, S. Zhang, W. Gao, Q. Tian, Person transfer GAN to bridge domain gap for person re-identification, in Conference on Computer Vision and Pattern Recognition (CVPR) (2018) 24. H.X.X. Yu, A. Wu, W.S.S. Zheng, Cross-view asymmetric metric learning for unsupervised person re-identification, in International Conference on Computer Vision (ICCV) (2017) 25. L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, Q. Tian, Scalable person re-identification: a benchmark, in International Conference on Computer Vision (ICCV) (2015) 26. L. Zheng, Y. Huang, H. Lu, Y. Yang, Pose invariant embedding for deep person re-identification (2017), arXiv preprint 27. Z. Zheng, L. Zheng, Y. Yang, Unlabeled samples generated by GAN improve the person reidentification baseline in vitro, in International Conference on Computer Vision (ICCV) (2017)

Atypical Facial Landmark Localisation with Stacked Hourglass Networks: A Study on 3D Facial Modelling for Medical Diagnosis Gary Storey, Ahmed Bouridane, Richard Jiang, and Chang-Tsun Li

1 Introduction The task of landmark localisation is well established within the domain of computer vision and widely applied within a variety of biometric systems. Biometric systems for person identification commonly apply facial [1–7], ear [8] and hand [9] landmark localisation, where Fig. 1 shows example of these landmark localisation variations. The landmark localisation task can be described as predicting n fiducial landmarks when given a target image, the human face is one common target for landmark localisation where semantically meaningful facial landmarks such as the eyes, nose, mouth and jaw line are predicted. The purpose of the landmark localisation task within biometric system pipeline is to aid the feature extraction process from which identification can be predicted. Generally, there are two types of features extracted, these being geometry-based and texture features; geometry-based features use the landmarks locations directly as features, for example, ratio distances between these landmarks [1]. Texture features instead use the predicted landmarks as local guides for feature extraction from specific facial locations. It is key that the landmark localisation performed is accurate in order to reduce poor feature extraction and therefore potential system errors.

G. Storey · A. Bouridane Computer and Information Sciences, Northumbria University, Newcastle upon Tyne, UK e-mail: [email protected]; [email protected] R. Jiang () Computing and Communication, Lancaster University, Lancaster, UK e-mail: [email protected] C.-T. Li School of Information Technology, Deakin University, Waurn Ponds, VIC, Australia e-mail: [email protected] © Springer Nature Switzerland AG 2020 R. Jiang et al. (eds.), Deep Biometrics, Unsupervised and Semi-Supervised Learning, https://doi.org/10.1007/978-3-030-32583-1_3

37

38

G. Storey et al.

Fig. 1 Landmark localisation application examples: (Left)—face, (Centre)—palm, (Right)—ear

Fig. 2 Asymmetrical face examples

The main focus of this chapter is facial landmark localisation which has a long history of research and is also referred to as face alignment. Research to date can be generally divided into three categories. Holistic based approaches such as Active Appearance Models (AAMs) [10, 11] solve the face alignment problem by jointly modelling appearance and shape. Local expert based methods such as Constrained Local Models (CLMs) [12] learn a set of local experts detectors or regressors [13, 14] and apply shape models to constrain these. The most recent advancements which have attained state-of-the-art results apply CNN based architectures with probabilistic landmark locations in the form of heat maps [15]. While these advancements have increased both the accuracy and reduced the computational time of the landmark localisation process, challenges still exist. One specific challenge is that of asymmetrical faces [16], while a majority of the population have typical face structures with small degrees of asymmetry, as shown in Fig. 2, there exists a section who for a variety of reasons including illness and injury display atypical facial structure, including those with a large degree of asymmetry. To enable biometric systems that are universally accessible and do not discriminate against those with atypical face structures due to poor feature extraction, it important to ascertain the accuracy of landmark localisation methods on this type of facial structure, especially as the public training sets do not contain specific samples of this demographic.

Atypical Facial Landmark Localisation: A Study

39

In this chapter a study is presented, which evaluates the accuracy of a number of landmark localisation methods, namely with two data sets containing atypical faces. A specific focus on the state-of-the-art stacked hourglass architecture is also documented. The remaining sections of this chapter are structured as follows, firstly a brief history of landmark localisation methods is presented in Sect. 2. Section 3 provides a detailed overview of the stacked hourglass architecture in general and the Face Alignment Network (FAN) method [17] applied specifically for facial landmark localisation. The evaluation is presented in Sect. 4, which highlights the accuracy of each method against the data sets. Finally Sect. 5 provides a conclusion to this chapter and explores future areas of research.

2 Landmark Localisation History In this section a brief description of historically important landmark localisation methods is presented. The first subsection details non-deep learning based methods which up until recent years were considered state of the art, while the second subsection concentrates on the deep learning based methods from recent literature.

2.1 Traditional Methods Within the traditional methods the Active Shape Model (ASM) developed by [18] provided one of the first great breakthrough methods which could be applied to landmark localisation, they followed up this work, an alternative method, namely AAMs [10]. Both methods while not specifically designed for face landmark localisation leverage the idea of defining statistically developed deformable models. There are similarities and distinct differences between the methods, while both use a statically generated model consisting of both texture and shape components learnt from a training data set, the texture component and how it is applied in the landmark fitting process are distinct to each method. The shape model is composed through the alignment of the training images by using a variation of the Procrustes method which scales, rotates and translates the training shapes so that they are aligned as closely as possible. Principle Component Analysis (PCA) is then carried out on the training images reducing the dimensions of the features while retaining the variance in the shape data. A mean shape is also generated which is often used as a starting point for fitting to new images. The ASM is considered to be in the CLM group of methods; these model types use the texture model as local experts in which they are trained on texture information taken from a small area around each landmark. The local expert ASM uses a small set of grey-scale pixel values perpendicular to each landmark while other CLM techniques use a block of pixels around the landmarks or other feature descriptors such as SIFT [12, 19]. The fitting of the model is carried out via the optimisation of an objective function using the prior shape and the sum

40

G. Storey et al.

of the local experts to guide the alignment process. AAM differs from the CLM group of methods by using a texture model of the entire face rather than regions. To create this all face textures from the training images are warped to a mean-shape, transformed to grey scale and normalised to reduce global lighting effects. PCA is then applied to create the texture features. Alignment on an unseen image is carried out by minimising the difference between the textures of the model and the unseen image [10]. Further advancements in accurate and computationally efficient landmark localisation arrived with the application of regression based fitting methods rather than sliding-windows based approaches. Regressors also provide detailed information regarding the local texture prediction criteria when compared with the classifier approach which is a binary prediction of match or not. [14] proposed a method named Boosted Regression coupled with Markov Networks, in which they apply Support Vector Regression and local appearance based features to predict 22 initial facial landmarks in an iterate manner, Markov Networks are then used to sample new facial locations to apply the regressor to in the next iteration. Cascaded regression was then applied by [20, 21] in which a cascade of weak regressors is applied to reduce the alignment error progressively while providing computationally efficient regression methods. Different feature types have been applied; these, for example [22], have recently produced a face alignment method based upon a multilevel regression using fern and boosting. This has been subsequently built upon in [23] where a regression based technique named Robust Cascaded Posed Regression can also differentiate between landmarks that are visible and non-visible (occlusion) and estimate those facial landmarks that may be covered by another object such as hair or a hand proposed. [13] have also applied a regression technique with local binary features and random forests to produce a technique that is both accurate and computationally inexpensive meaning that the algorithm can perform at 3000fps for a desktop PC and up to 300fps for mobile devices. The previous methods predicted facial landmarks on faces in limited poses at most between ±60◦ , both the Tree Shape Model (TSM) [24] and PIFA [25] are notable methods which could handle a greater range of face pose. The TSM [24] was unique amongst landmark localisation methods in that it did not use a regression or iterative methods for determining landmarks positions; instead this used the HOG parts to determine location based upon appearance and the configuration of all parts was scored to determine the best fit for a face. The final X and Y coordinates of the predicted landmarks are derived from the centre of a bounding box for that specific parts detection. [25] proposed PIFA as a significant improvement in dealing with all face poses and determining the visibility of a landmark across poses for up to 21 facial landmarks. This method extended 2D cascaded landmark localisation through the training of two regressors at each layer of the cascade. The first regressor predicts the update for the camera projection matrix which map to the pose angle of the face. The second is responsible for updating the 3D shape parameter which determines 3D landmarks positions. Using 3D surface normals, visibility estimates are made based upon a z coordinate; finally, the 3D landmarks are then projected to the 2D plane.

Atypical Facial Landmark Localisation: A Study

41

2.2 Deep Learning Methods The initial deep learning Convolutional Neural Network (CNN) based landmark localisation methods while displaying high accuracy were limited to a very small set of sparse landmarks when compared with previous traditional methods. A Deep Convolutional Network Cascade was proposed in [26], this consisted of a three stage process for landmark localisation refinement; at each level of the cascade, multiple CNNs were applied to predict the locations for individual and subsets of the landmarks. This method only considered five landmarks and the capability to expand this to further landmarks is computationally expensive due to the nature of using individual CNNs to predict each landmark. [27] applied multi-task learning to enhancement in which they trained a single CNN with not only facial landmark locations but also gender, smile, glasses and pose information. Linear and logistic regression were used to predict the values for each task from shared CNN features. When directly compared with the Deep Convolutional Network Cascade [26] they showed increased landmark accuracy with a significant computational advantage of using a single CNN. A Backbone-Branches Architecture was applied in [28] which outperformed the previous methods in terms of both accuracy and speed for five facial landmarks. This model consisted of multiple CNNs, a main backbone network which generates low-resolution response maps that identify approximate landmark locations, and then branch networks produce fine response maps over local regions for more accurate landmark localisation. The next generation of deep learning methods expanded on these initial methods increasing the number of landmarks detected to the commonly used 68. HyperFace applies a multi-task approach which also considered face detection. The idea of the multi-task approach is that inter-related tasks can strengthen feature learning and remove over-fitting to a single objective. HyperFace used a single CNN originally AlexNet, but modified this by taking features from layers 1, 3 and 5, concatenating these into a single feature set, then passing these through a further convolutional layer prior to the fully connected layers for each task. At the same time the fullyconvolutional network (FCN) [28] emerged as a technique, in which rather than applying regression methods to predict landmarks coordinates, they are based upon response maps with spatial equivalence to the raw images input. Convolutional and de-convolutional networks are used to generate a response map for each facial landmark; further localisation refinement applying regression was then used in [29– 31]. The stacked hourglass model proposed in [32] for human pose estimation which applied repeated bottom-up then top-down processing with intermediate supervision has been applied to the landmark localisation in a method called the FAN [15]; this has shown state-of-the-art performance on a number of evaluation data sets. Furthermore this method expanded the capability of detection from 2D to 3D landmarks through the addition of a depth prediction CNN which takes a set of predicted 2D landmarks and generates the depth. At the time of publication the FAN method outperformed previous methods for accurate landmark localisation.

42

G. Storey et al.

3 Stacked Hourglass Architecture In this section a detailed overview of the stacked hourglass architecture is given [32]. This architecture has proven to be extremely accurate for landmarks localisation tasks in both human pose detection where landmarks include the head, knee, foot and hand, and also for facial landmark localisation [17, 32]. It has the capability and potential to generalise well to other types of landmark localisation.

3.1 Hourglass Design The importance of capturing information at every scale across an image was the primary motivation for [32] design of the hourglass network. Originally designed for the task of human pose estimation where the key components of the human body such as head, hands and elbow are best identified at different scales. The design of the hourglass provides the capability to capture these features across different scales and bring these together in the output as pixel-wise predictions. The name hourglass is taken from the appearance of the networks downsampling and upsampling layers which are shown in Fig. 3. Given an input image to the hourglass, the network initially consists of downsampling convolutional and max pooling layers which are used to predict features down to a very low resolution. During this downsampling of the input the network branches off prior to each max pooling step and further convolutions are applied on the pre-pooled branches; this is then fed back into the network during upsampling. The purpose of network branching is to capture intermediate features across scales, without the application of these branches rather than learn features at each scale the network would behave in a manner previously shown in Fig. 2 where initial layer learns general features and deeper layers learn more task specific information. Following the lowest level of convolution the network then begins to up sample back to the original image resolution through the application of nearest neighbour upsampling and element wise addition of the previously branched features. Each of the cuboids in Fig. 3 is a residual module also known as bottleneck blocks as shown in Fig. 4. These blocks are the same as those used within the ResNet architectures.

3.2 Stacked Hourglass with Intermediate Supervision The final architecture proposed by [32] took the hourglass design and stacked n hourglasses in an end-to-end fashion, wherein the best performing configuration for human pose estimation was n = 8. Each of these hourglasses is independent in terms of the weight parameters. The purpose of this stacked approach is to provide a mechanism in which the predictions derived from a single hourglass can be evaluated at multiple stages within the total network. A key technique in the use

Atypical Facial Landmark Localisation: A Study

Down Sampling using Covolutional Layers

43

Up Sampling using Nearest Neighbour

Fig. 3 Hourglass design

Fig. 4 Block design: (Left) The basic bottleneck block. (Right) The hierarchical, parallel and multi-scale block of FAN

of this stacked design is that of intermediate supervision, in which at the end of each individual hourglass a heat-map output is generated to which a Mean Square Error (MSE) loss function can be applied. This process is similar to the iterative processes found in other landmark localisation methods, where each hourglass further refines the features and therefore the predictions as they move through the network. Following the intermediate supervision the heat-map, intermediate features from the hourglass and also the feature from the previous hourglass are added. To do this a 1 × 1 convolutional layer is applied to remap the heat-map back into feature space.

44

G. Storey et al.

3.3 Facial Alignment Network The FAN takes the stacked-hourglass design and trains this for the task of facial landmark localisation. Landmark localisation has similar challenges to that of human pose estimation, where the face landmarks are represented at different local scales within the context of the global context human face. Architectural changes are made to the network design where FAN reduces the total number of stacked hourglasses from 8 to 4. Also the structure of the convolutional blocks are changed from bottle necks to a hierarchical, parallel and multi-scale block, which performs three levels of parallel convolution alongside batch normalisation before outputting the concatenated feature map (Fig. 4). It was shown in [17] that when total parameter number is equal this block type outperforms the bottleneck design. The parameters of the 1 × 1 convolutional layers are changed to output heat-maps of dimension H × W × m, where H and W are the height and width of the input volume and m is the total number of facial landmarks predicted where m = 68. Training of the FAN was completed using a synthetically expanded version of the 300-W [33] named the 300-W-LP [34], while the original 300-W was also used to fine-tune the network. Data augmentation was applied during training, this employed random flipping, rotation, colour jittering, scale noise and random occlusion. The training applied a learning rate of 10−4 with a mini-batch size of 10. At 15 epoch intervals the learning rate was reduced to 10−5 then again to 10−6 . A total of 40 epochs were used to fully train the network. The MSE loss function is used to train the network: MSE =

2 1 Yi − Yî n

(1)

where Yi is predicted heat-map for the ith landmark and Yî is a ground truth heat-map consisting of a 2D Gaussian centred on the landmark location of the ith landmark.

3.4 Depth Network for 3D Landmarks A further extension to the FAN method (Fig. 5) is the capability to extend the 2D facial landmarks to 3D; this is achieved through the application of a second network. This second network takes as the input the predicted heat maps from the original 2D landmark localisation and the face image. The heatmaps guide the networks focus on areas of the image at which depth should be predicted from. This network is not hourglass based but instead an adapted ResNet-152, where the input takes 3 + N where 3 is the RGB channels of the image and N is the heatmap data where N = 68. The output of the network is N × 1. Training applied 50 epochs using similar data augmentation as the 2D model training, with a learning rate of 10−3 and an L2 loss function.

Atypical Facial Landmark Localisation: A Study

45

Fig. 5 Facial alignment network architecture overview

4 Evaluation Within this section an evaluation on landmark localisation. The evaluation was conducted using PyTorch 0.4 on Windows 10 with a Nvidia GTX 1080 GPU. A key foundation for many end-to-end automated diagnostic pipelines is the requirement to have precise facial landmark localisation. It is common practice to use these detected facial landmarks directly as geometric features or as indicators of areas of interest from which feature extraction can occur. In previous research [16] it has been highlighted that a number of methods that have gained state-of-the-art accuracy on symmetrical faces do not display the same level of accuracy when the face displays asymmetry, like those diagnosed with facial palsy. In this study we expand the previous research to include a larger sample size, while also investigating the impact new deep learning methods have in comparison with the previous landmark localisation methods. The methods evaluated in order of publication are the Tree Shape Model (TSM) [24], the DRMF [35] and the deep learning based Face Alignment Network (FAN) [15]. The evaluation of landmark localisation accuracy uses two separate data sets both containing images of individuals with varying grades of facial palsy. Data set A consists of 47 facial images which have 12 ground truth landmarks. Data set B consists of a further 40 images which are annotated with 18 ground truth landmarks per image. Normalised Mean Error (NME) using face size normalisation as described in [15] is used as the evaluation metric. Different methods of landmark localisation have variance in both the number and specific locations of the landmarks predicted; a subset of facial landmarks are used which are common across all methods which allows for a comparative analysis. The cumulative localisation NME error for data sets A and B are shown in Figs. 6 and 7, respectively. The results show that the deep learning based FAN method displays a consistently higher level of accuracy across both datasets. DRMF performs accurate landmark prediction for certain test samples but specifically in test set B where there is high degree of facial asymmetry there is a percentage of the sample for which the error increases by a substantial amount. Finally TSM performs poorly in general comparatively and this error grows substantially as the level of

46

Fig. 6 Cumulative localisation error distribution from Facial Palsy test set A

Fig. 7 Cumulative localisation error distribution from Facial Palsy test set B

G. Storey et al.

Atypical Facial Landmark Localisation: A Study

47

facial asymmetry increases. Analysing the prediction NME error for a specific selection of landmarks as shown in Fig. 8, the results show that while FAN and DRMF have similar level of accuracy for the eye and nose landmarks, the mouth which has the largest range of asymmetrical deformation is where the deep learning based FAN excels. Figure 9 provides a visual example of the landmark localisation output; this highlights the capability of the deep learning FAN method to provide a high level of accuracy when fitting landmarks to the face and specifically the mouth region when compared with previous techniques.

Fig. 8 Normalised mean error per landmark: (Top)—Facial Palsy test set A, (Bottom)—Facial Palsy test set B, (a)—TSM 99 part shared, (b)—DRMF, (c)—FAN

Fig. 9 Landmark localisation fitting example for each evaluated method. (Left)—FAN, (Centre)— DRMF, (Right)—TSM

48

G. Storey et al.

5 Conclusion The focus of this chapter was to study how accurately current landmark localisation methods predict landmarks on atypical faces. It was found that of the methods evaluated only the state-of-the-art FAN method could accurately predict facial landmarks, especially on the difficult mouth landmarks which show a higher degree of atypical appearance. The stacked hourglass architecture and its derivative the FAN prove to be a high performing method for landmark localisation, which has the potential to be applied to other landmark localisation tasks such as the ear and hand.

References 1. A. Juhong, C. Pintavirooj, Face recognition based on facial landmark detection, in 2017 10th Biomedical Engineering International Conference (BMEiCON), (IEEE, Piscataway, 2017), pp. 1–4 2. R. Jiang et al., Emotion recognition from scrambled facial images via many graph embedding. Pattern Recogn. 67, 245–251 (2017) 3. R. Jiang et al., Face recognition in the scrambled domain via salience-aware ensembles of many kernels. IEEE Trans. Inf. Forensics Secur. 11(8), 1807–1817 (2016) 4. R. Jiang et al., Privacy-protected facial biometric verification via fuzzy forest learning. IEEE Trans. Fuzzy Syst. 24(4), 779–790 (2016) 5. R. Jiang et al., Multimodal biometric human recognition for perceptual human–computer interaction. IEEE Trans. Syst. Man Cybern. Part C 40(5), 676 (2010) 6. R. Jiang et al., Face recognition in global harmonic subspace. IEEE Trans. Inf. Forensics Secur. 5(3), 416–424 (2010) 7. G. Storey et al., 3DPalsyNet: a facial palsy grading and motion recognition framework using fully 3D convolutional neural networks, in IEEE access, (IEEE, Piscataway, 2019) 8. Y. Zhou, S. Zaferiou, Deformable models of ears in-the-wild for alignment and recognition, in 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), (IEEE, Piscataway, 2017), pp. 626–633 9. E. Yörük, H. Duta˘gaci, B.Ã. Sankur, Hand biometrics. Image Vis. Comput. 24(5), 483–497 (2006) 10. T.F. Cootes, G.J. Edwards, C.J. Taylor, Active appearance models. IEEE Trans. Pattern Anal. Mach. Intell. 23(6), 681–685 (2001) 11. R. Gross, I. Matthews, S. Baker, Active appearance models with occlusion. Image Vis. Comput. 24(6), 593–604 (2006) 12. D. Cristinacce, T. Cootes, Automatic feature localisation with constrained local models. Pattern Recogn. 41(10), 3054–3067 (2008) 13. S. Ren, X. Cao, Y. Wei, J. Sun, Face alignment at 3000 FPS via regressing local binary features, in 2014 IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, Piscataway, 2014), pp. 1685–1692 14. M. Valstar, B. Martinez, X. Binefa, M. Pantic, Facial point detection using boosted regression and graph models, in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, (IEEE, Piscataway, 2010), pp. 2729–2736 15. A. Bulat, G. Tzimiropoulos, How far are we from solving the 2D & 3D face alignment problem? (and a dataset of 230,000 3D facial landmarks), in 2017 IEEE International Conference on Computer Vision (ICCV), (IEEE, Piscataway, 2017), pp. 1021–1030 16. G. Storey, R. Jiang, A. Bouridane, Role for 2D image generated 3D face models in the rehabilitation of facial palsy. Healthc. Technol. Lett. 4(4), 145–148 (2017)

Atypical Facial Landmark Localisation: A Study

49

17. A. Bulat, G. Tzimiropoulos, Binarized convolutional landmark localizers for human pose estimation and face alignment with limited resources, in 2017 IEEE International Conference on Computer Vision (ICCV), vol. 3, (IEEE, Piscataway, 2017), pp. 3726–3734 18. T. Cootes, B. Er, J. Graham, An introduction to active shape models, in Image Processing and Analysis, (Cengage Learning, Boston, 2000), pp. 223–248 19. S. Milborrow, F. Nicolls, Active shape models with SIFT descriptors and MARS, in Proceedings of the 9th International Conference on Computer Vision Theory and Applications, (SciTePress, Setúbal, 2014), pp. 380–387 20. P. Dollár, P. Welinder, P. Perona, Cascaded pose regression, in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, (IEEE, Piscataway, 2010), pp. 1078–1085 21. S. Zhu, C. Li, C.C. Loy, X. Tang, Face alignment by coarse-to-fine shape searching, in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Volume 07–12-June, (IEEE, Piscataway, 2015), pp. 4998–5006 22. X. Cao, Y. Wei, F. Wen, J. Sun, Face alignment by explicit shape regression. Int. J. Comput. Vis. 107(2), 177–190 (2014) 23. X.P. Burgos-Artizzu, P. Perona, P. Dollar, Robust face landmark estimation under occlusion, in 2013 IEEE International Conference on Computer Vision, (IEEE, Piscataway, 2013), pp. 1513–1520 24. D. Ramanan, X. Zhu, D. Ramanan, Face detection, pose estimation, and landmark localization in the wild, in 2012 IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, Piscataway, 2012), pp. 2879–2886 25. A. Jourabloo, X. Liu, Pose-invariant face alignment via CNN-based dense 3D model fitting. Int. J. Comput. Vis. 124(2), 187–203 (2017) 26. Y. Sun, X. Wang, X. Tang, Deep convolutional network cascade for facial point detection, in 2013 IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, Piscataway, 2013), pp. 3476–3483 27. Z. Zhang, P. Luo, C.C. Loy, X. Tang, Facial landmark detection by deep multi-task learning, in Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 8694, (LNCS, Berlin, 2014), pp. 94–108 28. Z. Liang, S. Ding, L. Liang, Unconstrained facial landmark localization with backbonebranches fully-convolutional networks. arXiv:1507.03409 [cs]. 1, (2015) 29. A. Bulat, Y. Tzimiropoulos, Convolutional aggregation of local evidence for large pose face alignment, in Proceedings of the British Machine Vision Conference 2016, (British Machine Vision Association, Durham, 2016), pp. 1–86 30. H. Lai, S. Xiao, Y. Pan, Z. Cui, J. Feng, C. Xu, J. Yin, S. Yan, Deep recurrent regression for facial landmark detection. IEEE Trans. Circuits Syst Video Technol 28(5), 1144–1157 (2015) 31. S. Xiao, J. Feng, J. Xing, H. Lai, S. Yan, A. Kassim, Robust Facial Landmark Detection Via Recurrent Attentive-Refinement Networks (Springer, Cham, 2016), pp. 57–72 32. A. Newell, K. Yang, J. Deng, Stacked hourglass networks for human pose estimation, in ˘ S ECCV 2016, (Springer, Cham, 2016), pp. 483–499 Computer Vision âA¸ 33. C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, M. Pantic, 300 faces in-the-wild challenge: the first facial landmark localization challenge, in 2013 IEEE International Conference on Computer Vision Workshops, (IEEE, Piscataway, 2013), pp. 397–403 34. X. Zhu, Z. Lei, X. Liu, H. Shi, S.Z. Li, Face alignment across large poses: A 3D solution, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, Piscataway, 2015) 35. A. Asthana, S. Zafeiriou, S. Cheng, M. Pantic, Robust discriminative response map fitting with constrained local models, in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, (IEEE, Piscataway, 2013), pp. 3444–3451

Anti-Spoofing in Face Recognition: Deep Learning and Image Quality Assessment-Based Approaches Wael Elloumi, Aladine Chetouani, Tarek Ben Charrada, and Emna Fourati

1 Introduction Facial recognition is a biometric modality that is increasingly used for both identification and authentication purposes. Thanks to its convenience in today’s digital age, this technology is more and more integrated into the daily lives of consumers. Moreover, since it only requires a front face camera, it has paved the way for selfie-based authentication on smartphones, leading to its adoption in many critical applications such as mobile banking, payment apps, and border control. However, the overall security of face recognition-based systems represents the major concern of users. The detection of spoofing attacks, also called direct attacks or presentation attack detection (PAD), remains a challenging task that has gained the attention of the research community. Spoofing attacks consist of compromising the biometric sensor by presenting a fake biometric data of a valid user. The main face spoofing attacks are perpetrated using photographs or videos of the valid user. Furthermore, forged 3D masks can also be easily and affordably reproducible, making it a more challenging task to combat such a variety of spoofing techniques. In this chapter, we propose two fast and non-intrusive PAD solutions for face recognition systems: the first one is based on Image Quality Assessment (IQA) and motion cues, while the second one proposes a multi-input architecture by combining a pre-trained CNN model and the local binary patterns (LBP) descriptor. Our methods are well suited for real-time mobile applications as they take into

W. Elloumi () · T. B. Charrada · E. Fourati Worldline Company, Blois, France e-mail: [email protected]; [email protected]; [email protected] A. Chetouani PRISME Laboratory, University of Orléans, Orléans, France e-mail: [email protected] © Springer Nature Switzerland AG 2020 R. Jiang et al. (eds.), Deep Biometrics, Unsupervised and Semi-Supervised Learning, https://doi.org/10.1007/978-3-030-32583-1_4

51

52

W. Elloumi et al.

consideration both reliable robustness and low complexity. Moreover, they are privacy-compliant as the user has the whole control of his biometric data with storage and matching on the user device. The remainder of this chapter is organized as follows. In Sect. 2, we present the related work regarding anti-spoofing techniques with a focus on IQA methods and deep learning approaches. Sections 3 and 4 detail the different stages and results of the IQA and CNN-based approaches, respectively. A discussion about the two proposed methods is presented in Sect. 5. Finally, Sect. 6 concludes the chapter and discusses further enhancement.

2 Related Work Depending on their biometric system level, biometric anti-spoofing methods may be classified into three groups: sensor level (hardware-based), feature level (softwarebased), or score level (software- and hardware-based) [1]. In this study, we focus only on software-based face techniques which are integrated in the feature extraction module after the acquisition of a biometric sample. They include dynamic, static, and multimodal countermeasures. Dynamic anti-spoofing techniques analyze motion over a face video sequence to detect physiological signs of liveness like eye blinking [2], eye [3] or lip movements [4]. Motion cues may also rely on the subtle relative movements between facial parts [5], the overall motion correlation between the face and the background regions [6], or the motion estimation to detect planar media-produced attacks such as prints or screens [7]. Other dynamic approaches analyze the 3D structure of the face [8, 9], or simply prompt a specific action request that needs the cooperation of the user (challenge) like facial expression [10], head rotation [9], or mouth movement [4]. However, requiring user collaboration is not suited to the trends of non-intrusive biometric authentication systems. Static methods focus on a single instance of the face instead of the video data. These techniques are generally faster and less intrusive than dynamic approaches, as they run in background and thus do not involve user cooperation. The key idea is that a fake face is likely to have lower image quality than a genuine one. Therefore, most static methods are based on face appearance analysis, i.e., texture analysisbased techniques, using texture information or image quality measures (IQMs). The former may rely on descriptors like local binary patterns (LBP), Fourier Spectrum, derivatives of Gaussians (DoG), and Histograms of Oriented Gradients (HOG). Other descriptors were exploited in [11] to analyze joint color-texture information in different color spaces. However, the main drawback of these approaches is the poor generalization of the features as they can easily overfit the training data regarding a specific illumination or imagery condition [12]. The latter, i.e., IQMs, rely on the presence of quality artifacts in the attack media, often related to the differences between reflectance properties of exposed objects and genuine faces.

Anti-Spoofing in Face Recognition: Deep Learning and Image Quality. . .

53

Existing IQMs may be divided into three groups: no-reference (NR), fullreference (FR), and reduced reference (RR) metrics. NR-IQMs calculate a general assessment on one image based on a priori knowledge such as spatial analysis or some pre-trained statistical models. In contrast, FR-IQMs compare the given image to a reference undistorted one, also called “pristine image.” This comparison can be related to the error sensitivity, the structural similarity, the mutual information between both images or our human visual system. RR-IQMs use only partial information from the reference image. Because of the lower accuracy of static methods, some works have combined several IQMs to detect spoofing attacks: Using a parameterization of 25 IQMs, Galbally et al. [13] have proposed a binary classification system to detect spoofing attacks for 3 biometric modalities (iris, fingerprint, and face recognition). Costa-Pazo et al. [14], have, thereafter, studied and compared two presentation attacks detection methods: the first one is based on a subset of 18 reproducible IQMs from [13], while the second one is a texture-based approach using Gabor-Jets. Deep learning architectures have also been recently explored for PAD. Feng et al. [15] proposed a multi-cues architecture that combines Shearlet-based image quality, face motion, and scene motion cues. They used a pre-trained layerwise sparse autoencoder, which is fine-tuned with a softmax layer classifier and labeled data using backpropagation. Other approaches leverage deep convolutional neural networks (D-CNN) to detect features of morphed face images. Lucena et al. [16] used transfer learning in CNN to generate bottleneck features. For that, this face anti-spoofing network (FASNet) uses the VGG-16 architecture that was pre-trained on ImageNet database. The top layers of VGG-16 were changed to perform the binary classification for the anti-spoofing task. Raghavendra et al. [17] introduced a method based on a pre-trained D-CNN to detect both digital and print-scanned morphed face image. This approach fused features from two pre-trained D-CNN networks (AlexNet and VGG19). Networks were fine-tuned independently on digital and print-scanned morphed face images. The fused features were further classified using the probabilistic collaborative representation classifier (P-CRC). Rehman et al. [18] proposed a continuous data-randomization technique for deep models using the VGG11 architecture. Their key contribution is the training algorithm that optimizes both convergence-time and model-generalization. Multimodal techniques have also been explored as a countermeasure to spoofing attacks. They rely on several biometric modalities for a stronger authentication, such as the combination of face and voice [19], the correlation between voice and lips motions [4, 20, 21], or the combination of face, voice, and iris [22, 23]. However, such solutions may entail challenges like inconvenience, cost, or harm of user experience. In this chapter, we will present and compare two static face anti-spoofing approaches: the first one is based on IQA and the second one combines a pre-trained CNN model and the LBP descriptor. Both developed methods will be detailed and compared to existing approaches.

54

W. Elloumi et al.

3 Image Quality Assessment-Based Approach The proposed method follows three major stages: Given a video of the user, we start by extracting specific frames using motion-based cues. IQMs are then computed on these frames. The obtained values are then used as inputs to a classifier with two outputs: real or fake face. These stages will be detailed in the following sections.

3.1 Frame Extraction Since most IQA metrics compute quality measures on images rather than videos, the first step consists of extracting the frames on which IQMs-based features are calculated. Analogue works generally extract all the frames from the given video [13, 14]. In this study, in order to optimize the computation time and improve the frame comparison step (in case of slight modifications between adjacent frames), we propose to focus only on some specific frames. For this purpose, some authors proposed to extract half of the frames per second in each video [16]. More generally, an intuitive alternative would be to extract frames at a fixed time interval I for each input video. In [24], the authors report the classification results using different values of I. A value of I = 72, yielding 4 frames per video, was chosen as the most efficient one from a computation time perspective without degrading the accuracy of their classifier. Here, we proposed to select the most relevant frames according to the motion. The major motivation behind the utilization of motion cues is their intrinsic relevance in genuine samples compared to fake ones. For instance, microfacial expressions are highlighted in [25] using a motion magnification pre-processing step, thus enhancing the performance of face spoofing attack detection. Moreover, their features are partly based on motion estimation using optical flow. The latter, also called optic flow, provides an estimate of 2D-velocities for all visible points. Although it is widely used in the motion estimation tasks, it has a relatively heavy computation. In [6], the authors suggest a more adequate approach to consider for our scope of application, especially for print attacks. They outline a high correlation between the total movements of the face compared to the background when it comes to the presentation attacks. In contrast, real samples exhibit a remarkably lower correlation between the movements of these two regions of interest. In our approach, we address this aspect by extracting a given frame (I2) only if it yields a minimum face-VS-background motion (FBM) compared to the previously extracted one (I1). The first frame of the video is considered to be the first baseline of comparison. The calculation of FBM is shown in Eq. (1), where th is a threshold used in the motion function as shown in Eq. (2). The motion along a region of interest (RoI) is measured by the normalized number of this region’s pixels where the difference of intensity values between the two images exceeds a threshold th. Δ is the Dirac delta function, D is the pixel-wise difference between the two frames along the given RoI, and SD is the number of its pixels. To crop the face (RoI), we

Anti-Spoofing in Face Recognition: Deep Learning and Image Quality. . .

55

have used “CascadeObjectDetector” provided by the vision toolbox of Matlab based on the Viola-Jones algorithm for face detection. This method has proven to ensure satisfying results in terms of speed and accuracy [26]. The background corresponds here to the subtraction of the detected RoI (face part) from the whole image. It should be noted that th and FBM _ min values were set up after several empirical tests on Replay Mobile Database and are fixed, respectively, to 15 and 1.01. F BM =

Motion (F ace, I 1, I 2, th) Motion (Background, I 1, I 2, th)

Motion (RoI, I 1, I 2, th) =

x,y

δ (D (x, y) − th) SD

(1)

(2)

Figure 1 represents an example of 2 consecutively extracted frames. In this particular case, the blinking action yields a relevant movement of the face compared to the background. Nevertheless, this is not the only scenario where the FBM would exceed the considered threshold. In fact, different subjects exhibit different behaviors regarding actions like blinking, smiling, etc. During our experiments, we have noticed that breathing or swallowing natural actions sometimes confines the FBM in small values, as the throat is considered to be a part of the background. This indicated that we should not solely rely on the restricted RoI provided by the ViolaJones algorithm. We have thus extended the face region in both vertical directions, as shown by the plotted bounding box, to include the neck and the hair as assets for potentially relevant small movements. Thanks to this selective frame extraction, we exploit liveness-related motion cues while reducing the computational cost. Indeed, the processing of the motion estimation of the face, the background, and the ratio between them (FBM) only takes about 0.0044 s, which represents 0.62% of the computation time for the considered sample. Therefore, the additional computational cost related to motion-processing is remote compared to the efficiency gained by computing quality measures on specific frames.

Fig. 1 Motion-based frame extraction

56

W. Elloumi et al.

3.2 IQM Calculation In this study, we propose to exploit FR- and NR-IQMs to detect spoofing attacks. For FR metrics, the full availability of the reference image is required. This represents a challenge for biometric authentication, where only the storage of a template or a model of the original biometric data is allowed. In some analogue face anti-spoofing works such as [27] and [13], the authors circumvent this limitation by computing the quality between the original reference image and its smoothed version. This approach relies on the assumption that real and fake samples react differently to the quality loss yielded by the blurring effect, and thus, this difference brings discriminative power to the system in order to distinguish between genuine and synthetic face appearances. In our model, FR metrics are used differently. We propose to use the latter metrics to compute similarity between each pair of consecutively extracted frames. To the best of our knowledge, this approach has not been considered in previous antispoofing studies. The idea behind comparing the frames of the same video is to highlight their motion-based differences since these differences are likely to be more relevant within genuine videos than in fake ones. As illustrated in Fig. 2, for a set of N frames extracted from a given video, we characterize each of the N − 1 last frames with its comparison to its previous frame, according to the considered FR metrics. The use of NR-IQMs is simpler since they need only an image. So, the NR-IQMs are calculated on this same set of N − 1 frames. The final features vector is finally composed of the combination of both extracted values (FR and NR). It should be noted that when testing our method, we have considered both all-frame and motionbased frame extractions. In both cases, the aforementioned approach is applied to the considered frame set.

Fig. 2 Feature-vector construction for a set of N frames extracted from a given video, combining FR- and NR-IQMs

Anti-Spoofing in Face Recognition: Deep Learning and Image Quality. . .

57

3.3 Selected Image Quality Measures (IQMs) The study of Galbally et al. [13] proposed a set of 25 IQMs to detect spoofing attacks. In [14], the authors considerably improve the classification results using 18 of these IQMs, namely MSE, PSNR, AD, SC, NK, MD, LMSE, NAE, SNRv, RAMDv, MAS, MAMS, SME, GME, GPE, SSIM, VIF, and HLFI. The latter is the only NR-IQM among this subset. We have used these 18 IQMs as our starting point, and simulated other publicly available IQMs listed in Table 1. Among these 6 metrics, we have only kept those that represent a good tradeoff between accuracy and computation time. Figure 3 shows a comparison of their feature-computation time of one image and their individual performance when considering 10 frames per video for the training and the testing samples. The time was measured on a 64-bit Windows-7 PC with a 2.50 GHz processor and 16 GB RAM memory, running Matlab R2017a. The 3 fastest IQMs ensure an acceptable performance. HIGRADE-1 and Robust BRISQUE yield lower error rates, but they are relatively slow. NIQE gives the worst performance among these 6 methods. In order to ensure a compromise between speed and accuracy, we have selected BRISQUE, GM-LOGBIQA, and BIQI, in addition to the initial 18-IQM set used in [14]. It is worth mentioning that for BIQI metric, we have only considered the first stage values. Thus, each image is characterized by the statistical model of 18 wavelet coefficients. The GM-LOG-BIQA metric yields 40 features for each image sample, contributing

Table 1 List of IQMs simulated in our set of experiments in addition to the 18 IQMs used in [14]

3 4 5 6

Metric Blind/Referenceless Image Spatial QUality Evaluator Gradient-Magnitude map and Laplacian-of-Gaussian based Blind Image Quality Assessment Blind Image Quality Index Naturalness Image Quality Estimator Robust BRISQUE index HDR Image GRADient based Evaluator-1

Ref [28, 29] [30]

BIQI NIQE Robust BRISQUE HIGRADE-1

[31, 32] [33, 34] [35] [36]

ue G -B IQ A

isq

O -L G M

HTER (%)

Preliminary HTER (%) 10 8 6 4 2 0 Br

BI Q I NI Q HI E G RA DE Ro -1 bu st br iq ue

Br G isq M ue -L O G -B IQ A

Time (sec)

Computation time (sec) 1.5 1 0.5 0

Acronym BRISQUE GM-LOG-BIQA

BI Q I NI Q HI E G RA DE Ro -1 bu st br iq ue

# 1 2

Fig. 3 Comparison of 6 IQMs in terms of computation time (in sec) of one image’s features and LDA classification error rates, using 10 frames per video

58

W. Elloumi et al.

Training data

Motionbased Frame extraction

NR and FR IQMs

R Real/Fake F

Fig. 4 General diagram of our IQM-based method

to the classification model. For BRISQUE metric, we have only considered the 18 feature values yielded in the first stage. The computed values are then used as inputs to classify each selected frame as real or fake. Note that the used classification model is built only on the training set without any overlap with the test set. A majority rule is applied on the frame predictions, in order to generate a decision regarding the whole video. The different stages of the proposed approach are summarized in Fig. 4.

3.4 Experimental Results This section is devoted to the evaluation of the robustness of our method on publicly available databases, using the evaluation protocol described in Sect. 3.4.2.

3.4.1

Datasets

Replay Attack The Replay Attack database [37], which was released by Idiap Research Institute, incorporates 1300 videos of 50 identities. In addition to realaccess samples, it includes three types of attack: photos printed on paper, videos displayed on an iPhone 3GS screen, and those displayed on a high resolution screen using an iPad. For each attack type, samples were generated according to two scenarios: whether the attack device/paper is hand-held or fixed on a support. The videos, having a minimum duration of 9 s, were recorded using a 13 MacBook with a resolution of (320 × 240), under different lighting conditions: controlled (the background of the scene is uniform and the light of a fluorescent lamp illuminates the scene) and adverse (the background of the scene is non-uniform and day-light illuminates the scene). Figure 5 shows some frames of the captured video. Replay Mobile The Replay Mobile database [14] was also released by the abovecited Research Institute. It consists of 1190 videos of 40 subjects. Likewise, Replay Mobile has been split into subsets using the same structure as Replay Attack, and

Anti-Spoofing in Face Recognition: Deep Learning and Image Quality. . .

59

Fig. 5 Samples from Replay Attack database. In the top row, samples from controlled scenario. In the bottom row, samples from adverse scenario. Columns from left to right show examples of real access, printed photograph, mobile phone, and tablet attacks Fig. 6 Samples from Replay Mobile database. Top row: samples from attack attempts captured on a smartphone. Bottom row: samples captured on a tablet. Columns, from left to right, show examples of matte screen-light on, matte screen-light off, print-light on, and print-light off, respectively

it also includes different lighting conditions. The videos were recorded using an iPad Mini2 and a LG-G4 smartphone, with a resolution of (1280 × 720) for at least 10 s. Figure 6 shows examples of the attack attempts. Since all the videos were recorded and displayed using mobile devices, this database is more adapted to our framework, as the targeted implementation of our face anti-spoofing system is a mobile application.

3.4.2

Evaluation Protocol

For performance evaluation, we use the false acceptance rate (FAR), i.e., the percentage of fake samples classified as real, the false rejection rate (FRR), denoting the opposite case, and the half total error rate (HTER), the average between them, as an overall performance indicator. For frame classification, two classifiers

60

W. Elloumi et al.

were considered: Matlab implementation of linear discriminant analysis (LDA) (Discrimination type set as pseudo-linear), and LIBSVM release [38] of support vector machine (SVM) with a radial basis function (RBF) kernel and a parameter γ = 1.5.

3.4.3

Results on Replay Attack Database

Using the previously described protocol for feature combination, we have obtained the error rates on the Replay Attack Database detailed in Table 2, where “All” and “Motion” denote the use of all frames and the restriction to the selected ones, respectively. Except for the last two columns (HTER-v), all error rates are calculated on a per-frame basis. Questions here may arise about delivering a fair comparison between “All” and “Motion” approaches, as the corresponding performance indicators do not reply on the same number of samples. So, we have noted the video-based error rates (HTER-v), where we have associated results to each video by applying a majority rule on the occurrence of real and fake predictions within its frames. The bold-written HTER values denote the use of the motionbased frame subset. We hereby recall that the HTER-f obtained in [14] based on the initial 18 feature set on all frames and using LDA classification is 9.78%. The slight difference between this value and the ones we have obtained for the same feature set on all the frames might be due to the difference in the feature calculation approach. As explained in Sect. 3.2, they perform a Gaussian filter on the given image, in order to obtain a smoothed version of it. The original and the smoothed images are compared by the FR methods. In contrast, we compute the FR-IQMs between the frames of the video, highlighting some motion cues which are more likely to be relevant in genuine samples. The most interesting finding in these results is that our method outperforms by far the initial 18-feature set classification, as the global error rate declines from around 9% to much less than 1% for LDA classification. In contrast, SVM results are generally less suited to our testing protocol. We also note that although the lowest error rate in Table 2 (i.e., HTER of 0.01 %) is only met using our set of features on all the frames, we achieve a relatively close HTER on the motion-based frames (i.e., 0.02 %). This compromise is favored by the reduction of computational cost. On average on all the training and testing video samples of the Replay Attack and Replay Mobile databases, respectively, we extract 134 (out of 375) and 172 (out of around 300) frames per video, respectively. Besides, this slight performance degradation noted in frame-based errors does not affect the accuracy of the overall prediction for our method, as it correctly classifies all the test set-videos of Replay Attack for both “All” and “Motion” protocols. This is not the case for the initial 18-IQM set, which confirms the efficiency of our method.

Anti-Spoofing in Face Recognition: Deep Learning and Image Quality. . .

61

Table 2 LDA and SVM classification results on Replay Attack database IQM

Video Frames

Initial18 IQM Final21 IQM

All Motion All Motion

FAR LDA 2.21 3.16 0.00 0.00

SVM 0.00 0.00 0.00 0.00

FRR LDA 16.68 16.41 0.01 0.05

SVM 24.22 26.36 24.22 26.36

HTER-f LDA SVM 9.44 12.11 9.78 13.18 0.01 12.11 0.02 13.18

HTER-v LDA SVM 6.25 8.33 7.92 9.34 0.00 8.33 0.00 9.34

Table 3 LDA and SVM classification results on Replay Mobile database IQM

Video frames

Initial18 IQM Final21 IQM

3.4.4

All Motion All Motion

FAR LDA 3.26 1.25 0.05 0.07

SVM 0.00 0.00 0.00 0.00

FRR LDA 8.15 9.71 3.65 1.66

SVM 36.23 31.85 36.23 31.85

HTER-f LDA SVM 5.71 18.11 5.48 15.92 1.85 18.11 0.86 15.92

HTER-v LDA SVM 4.13 18.21 5.75 18.05 1.65 18.21 0.79 18.05

Results on Replay Mobile Database

The classification error rates on the Replay Mobile Database, detailed in Table 3, prove the conclusions drawn previously. First, the final 21-IQM set outperforms again our initial baseline for LDA classification, since the overall error rate declines from above 4% to less than 1%. Second, SVM only yields relatively low errors with the initial 18 IQMs. The inadequacy of the SVM classifier to our 3 additional IQMs gives us a stronger motive to use LDA in our final implementation, especially regarding the relatively high computational cost of SVM compared to LDA. Finally, we also note that the lowest error rate (0.79%) is achieved using our IQM set on the “Motion” protocol, which again confirms the consistency of exploiting motionbased cues in frame extraction and feature calculation as explained in Sects. 3.1 and 3.2. In a nut of shell, the obtained results show that our method outperforms state-ofthe-art solutions using the simple and relatively fast LDA classifier. Furthermore, we reduce the computational cost as we compute features on fewer samples, thanks to the selective frame extraction. Although the frame number reduction highly depends on the level of face-VS-background motion of the considered video, it is approximately cut to half on average on all the samples of the considered databases. In the next section, we propose to explore deep learning on PAD as it has proved to yield remarkably interesting results in related work.

62

W. Elloumi et al.

4 Deep Learning-Based Approach As shown in Sect. 2, several architectures have been explored for PAD. In the following, we propose a static method that combines a pre-trained CNN model and the LBP descriptor.

4.1 Proposed Scheme The proposed scheme processes two aspects of inputs simultaneously as illustrated in Fig. 7: the raw face images and their corresponding LBP descriptor. Face images pass through a set of convolutional layers which we deploy as feature extractors. We use the same architecture as the VGG-16 [39] to retrieve a deep representation from raw faces. The VGG is a convolutional neural network implementation developed by the Visual Geometry Group at Oxford. It has 13 convolutional layers assembled as 5 blocks. The first block has 64 filters in each layer, the second presents 128 filters per layer, the third filters with 256 kernels, and the fourth and fifth have 512 as filters. All convolution layers have 3 × 3 as kernel size and are followed by a 2 × 2- window max-pooling layer. All activation layers are rectified linear units (ReLU). After these convolutional layers, stands a one fully connected layer (FC) composed of 256 neurons. The second input of our architecture is the LBP histograms of the frames. LBP [40] has been proven to be very successful at preventing both 2D and 3D attacks. Its discriminative property has been studied thoroughly [41], therefore we propose to explore its contribution when combined to a deep CNN architecture. For that, the output of the last convolutional layer is concatenated with the LBP histograms of the frame before being passed to the FC layer as shown in Fig. 7. For the decision making process, the LDA has been used as classifier.

Fig. 7 Block diagram of the proposed scheme

Anti-Spoofing in Face Recognition: Deep Learning and Image Quality. . .

63

4.2 Pre-Processing Before proceeding to the training of our model, we have carried out some preprocessing steps. For each image, we first detect the face using the OpenFace face detector [42]. Then, using Dlib library algorithms, the extracted face is aligned and centered based on the outer eyes and nose positions [43]. Finally, the image face is resized to 96 × 96 pixels. For LBP histogram estimation, the whole face image has been used as input.

4.3 Training Our training process is composed of two phases: a training step to fine-tune the used pre-trained VGG model and a joint-training step to train our final classifier. As training from scratch is time consuming and requires a lot of data, we used the VGG-16 model trained on the ImageNet dataset as starting point. Then, we finetuned the model to adapt its weights on our datasets. We also frozen the first 3 blocks of the VGG-16 and we trained the remaining layers as illustrated in Fig. 8. The model used was trained using Keras [44] with the following parameters: • • • • •

Optimization function: Adam with Beta1 = 0.9, Beta2 = 0.99 Momentum: 0.1 Learning rate: 0.0001 Batch size: 100 Input size: 96 × 96 × 3

In order to generalize the model and prevent overfitting, data augmentation was applied by randomly rotate, translate, and flip horizontally/vertically. Note that the employed datasets are already decomposed into training and test sets by the authors. Then, we joined the output of our trained CNN model and the corresponding LBP histograms to train the classifier.

Trainable

Image

Frozen

Sigmoid LBP

Fig. 8 Training procedure

FC256

64

W. Elloumi et al.

4.4 Experimental Results 4.4.1

Datasets

We evaluated our method on three publicly available databases. In addition to Replay Attack and Replay Mobile datasets, we have also tested our method on CASIA face anti-spoofing (CASIA-FASD) database [45]. The latter contains 600 video clips of 50 genuine subjects, and fake faces are made from the high quality records of the genuine faces. The key impact here is that this database considers a vaster variation of attacks. Furthermore, different qualities are considered, namely the low, normal, and high qualities. It provides three types of attacks: warped photo attack, cut photo attack, and video attack as shown in Fig. 9.

4.4.2

Results

Our evaluation is based on the same metrics used for the assessment of our IQMbased approach which are the false acceptance rate, the false rejection rate, and the half total error rate. In this section, we first compare the obtained results using two different classifiers (LDA and sigmoid). Then, we discuss the impact of adding LBP and compare the results with the state-of-the-art.

Classification We tested two different classifiers: LDA and sigmoid classifiers. The sigmoid classifier denotes here the late layers of the initial FASNET architecture. Table 4 presents the obtained classification results for both classifiers on Replay Attack, Replay Mobile, and CASIA-FASD databases. This table reports video based results by using majority vote. The obtained results show that the LDA classifier, clearly, outperforms the sigmoid. Indeed, with LDA classifier, our proposed approach reports a perfect classification on Replay Mobile and CASIA-FASD databases and denies all the imposter attacks on the Replay Attack database. Fig. 9 Samples of spoofing attack images in the CASIA-FASD database. (a) Printed photo attack; (b) Printed photo with perforated eye regions; (c) Video replay [46]

Anti-Spoofing in Face Recognition: Deep Learning and Image Quality. . .

65

Table 4 LDA and sigmoid classification results on Replay Attack, Replay Mobile, and CASIAFASD databases

LDA Sigmoid

Replay Attack FAR (%) FRR (%) 0 5 0 5

Replay Mobile FAR (%) FRR (%) 0 0 0 0

CASIA-FASD FAR (%) FRR (%) 0 0 0.781 8.62

Table 5 The obtained confusion matrices on Replay Attack database without and with LBP

Real Attack

Replay Attack Without LBP Predicted real 80 4

Predicted attack 0 396

With LBP Predicted real 76 0

Predicted attack 4 400

Total 80 400

Table 6 The obtained confusion matrices on Replay Mobile database without and with LBP

Real Attack

Replay Mobile Without LBP Predicted real 110 0

With LBP Predicted attack 0 192

Predicted real 110 0

Predicted attack 0 192

Total 110 192

Deep Learning VS Deep Learning + LBP Estimation Results In order to better show the contribution of LBP, we present in Tables 5 and 6 the obtained confusion matrices for video classification with and without LBP histogram estimation on Replay Attack and Replay Mobile databases. On Replay Mobile database, perfect results were obtained with and without LBP, while on Replay Attack database, we obtained 4 false accepted videos (imposter videos that are misclassified as real videos of genuine subjects) without LBP histograms estimation. However, when LBP histograms are combined to the deep-based features, we obtained 4 false rejected videos (real videos of genuine subjects that are misclassified as imposter videos). So, LBP allows us to deny all the imposter attacks which is less critical compared to the obtained results without LBP estimation.

Comparison With State-of-the-Art Methods Table 7 reports a comparison of the HTER of our approach with the state-of-the-art deep learning based anti-spoofing methods on Replay Attack, Replay Mobile, and CASIA-FASD databases. Our method reached an HTER of 0% on Replay Mobile and CASIA-FASD databases. On Replay Attack database, the HTER obtained by our approach (2.5%) is higher than the HTER of the multi-cues architecture [15]

66

W. Elloumi et al.

Table 7 HTERs (%) comparison with other state-of-the-art methods Method DPCNN [48] Multi-cues Integration + NN [15] FASNet [16] DDGL [47] LiveNet [18] Our approach

Replay Attack 6.1 0 1.2 0 5.74 2.5

CASIA-FASD 4.5 – – 1.3 – 0

Replay Mobile – – – – – 0

Results are obtained from respective research papers ‘–’ represents that HTER values are not available Table 8 HTERs (%) of both developed methods

Method IQM-based Deep-based

HTER Replay Attack 0.00 2.5

Replay Mobile 0.79 0

(HTER of 0%), the FASNet [16] architecture (HTER of 1.2%), and the DDGL [47] (0%). However, according to the results presented in Table 4, our method, with LDA classifier, denies all the imposter attacks on all of the three tested databases.

5 Discussion Both approaches were evaluated using different datasets. Table 8 summarizes the obtained performance in terms of HTER. As we can see, our methods obtained good performance. The IQM-based method is better for the Replay Attack dataset, while the deep-based method obtained the best results for Replay Mobile dataset. The latter integrate LBP features that allowed to perfectly denying all the imposters for both datasets (FAR=0, see Tables 5 and 6). Concerning the computational cost, the IQM-based method needs to extract a number of features. The IQM’s extraction step spends around 0.4s per frame (see Fig. 3). The deep-based method needs only to extract the LBP and the deep features. Both are not time consuming, since LBP features can be extracted on real time [49] and the deep features are relatively fast [50] as small patches were used (96 × 96). It is worth noting that the computation time can be optimized by using a method that aims to accelerate the computation of CNNs [51] or by employing models that are dedicated to mobile or embedded systems (MobileNet [52]). According to the context and the available computation capacity, the IQMbased method or the deep-based method will be employed. In our case, we focus on embedded systems, especially PAD on mobile applications. So, the developed deep-based method will be privileged since the best results were obtained with this approach (at least on Replay Mobile database) and the computation cost could be clearly reduced.

Anti-Spoofing in Face Recognition: Deep Learning and Image Quality. . .

67

6 Conclusions In this chapter, two anti-spoofing methods were presented. The first one is based on the combination of some selected IQMs. The underlying idea is that the quality of the image degrades when a fake image or a fake video is presented to the biometric system. The second method is based on the combination of deep and handcrafted features (LBP), which allows to exploit the capacity of both strategies (automatic and manual features) to detect fake information (image or video). Both approaches were evaluated using different datasets and the obtained results proved to outperform state-of-the-art approaches. Moreover, our methods are well suited for real-time mobile applications and they are also privacy-compliant. Although this study has reached its aim, some limitations are bound to be considered. For the IQM-based method, the parameters used in the motion-based frame extraction module are set empirically on Replay Mobile database, as it is the closest to our application framework. For future work, we will test our methods on other databases commonly used in the literature for anti-spoofing purposes, in order to study the generalization of our approaches. We will try to consider the motion information in order to select the most relevant frames. 3D CNN models could be also used to leverage the motion. Another idea that is worth exploring is to use CNN-based quality features instead of the handcrafted ones (IQMs).

References 1. A. Anjos, J. Komulainen, S. Marcel, A. Hadid, M. Pietikainen, in Face Anti-Spoofing: Visual Approach, ed. by S. Marcel, M. Nixon, S. Z. Li (Springer, London, 2014), pp. 65–82, Chapter 4 2. G. Pan, Z. Wu, L. Sun, Liveness detection for face recognition, in Recent advances in face recognition (IntechOpen, Rijeka, 2008), pp. 109–124 3. H. Jee, S. Jung, J. Yoo, Liveness detection for embedded face recognition system. Int. J. Comput. Electr., Autom., Control Inf. Eng. 2(6), 2142–2145 (2008) 4. K. Kollreider, H. Fronthaler, M.I. Faraj, J. Bigun, Real-time face detection and motion analysis with application in liveness assessment. IEEE Trans. Inf. Forensics Secur. 2(3), 548–558 (2007) 5. K. Kollreider, H. Fronthaler, J. Bigun, Non-intrusive liveness detection by face images. Image Vision Comput. 27, 233 (2009) 6. A. Anjos, S. Marcel, Counter-measures to photo attacks in face recognition: a public database and a baseline, in 2011 International Joint Conference on Biometrics (IJCB) (2011), pp. 1–7. 7. W. Bao, H. Li, N. Li, W. Jiang, A liveness detection method for face recognition based on optical flow field, in 2009 International Conference on Image Analysis and Signal Processing (IEEE, 2009), pp. 233–236. 8. A. Lagorio, M. Tistarelli, M. Cadoni, C. Fookes, S. Sridharan, Liveness detection based on 3d face shape analysis, in International Workshop on Biometrics and Forensics (IWBF) (IEEE, 2013), pp. 1–4. 9. T. Wang, J. Yang, Z. Lei, S. Liao, S. Z. Li, Face liveness detection using 3d structure recovered from a single camera, in International Conference on Biometrics (ICB) (IEEE, 2013), pp. 1–6 10. E.S. Ng, A.Y.S. Chia, Face verification using temporal affective cues, in IEEE International Conference on Pattern Recognition (ICPR) (IEEE, 2012), pp. 1249–1252

68

W. Elloumi et al.

11. Z. Boulkenafet, J. Komulainen, A. Hadid, Face anti-spoofing based on color texture analysis, in IEEE International Conference on Image Processing (ICIP) (IEEE, 2015), pp. 2636–2640 12. T.de. Freitas Pereira, A. Anjos, J.M. De Martino, S. Marcel, Can face anti-spoofing countermeasures work in a real world scenario? in International Conference on Biometrics (ICB) (IEEE, 2013), pp. 1–8 13. J. Galbally, S. Marcel, J. Fierrez, Image quality assessment for fake biometric detection: application to iris, fingerprint, and face recognition. IEEE Trans. Image Process. 23(2), 710– 724 (2014) 14. A. Costa Pazo, S. Bhattacharjee, E. Vazquez Fernandez, S. Marcel, The replay-mobile face presentation-attack database, in International Conference of the Biometrics Special Interest Group (BIOSIG) (IEEE, 2016), pp. 1–7 15. L. Feng, L.M. Po, Y. Li, X. Xu, F. Yuan, T.C.H. Cheung, K.W. Cheung, Integration of image quality and motion cues for face anti-spoofing: a neural network approach. J. Visual Commun. Image Representation 38, 451–460 (2016). https://doi.org/10.1016/j.jvcir.2016.03.019 16. O. Lucena, A. Junior, V. Hugo G Moia, R. Souza, E. Valle, R. De Alencar Lotufo, Transfer learning using convolutional neural networks for face anti-spoofing, in Image Analysis and Recognition. ICIAR 2017. Lecture Notes in Computer Science, ed. by F. Karray, A. Campilho, F. Cheriet, vol. 10317 (Springer, Cham, 2017). 17. R. Raghavendra, K.B. Raja, S. Venkatesh, C. Busch, Transferable deep-CNN features for detecting digital and print-scanned morphed face images, in IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (IEEE, 2017), pp. 1822–1830 18. Y.A. Ur Rehman, P. Lai Man, M. Liu, LiveNet: Improving features generalization for face liveness detection using convolution neural networks. Expert Syst. Appl. 108, 159 (2018). https://doi.org/10.1016/j.eswa.2018.05.004 19. H.T. Cheng, Y.H. Chao, S.L. Yeh, C.S. Chen, H.M. Wang, Y.P. Hung, An efficient approach to multimodal person identity verification by fusing face and voice information, in IEEE International Conference on Multimedia and Expo (IEEE, 2005), pp. 542–545 20. J. Komulainen, I. Anina, J. Holappa, E. Boutellaa, A. Hadid, On the robustness of audiovisual liveness detection to visual speech animation, in IEEE 8th International Conference on Biometrics Theory, Applications and Systems (BTAS) (IEEE, 2016), pp. 1–8 21. A. Melnikov, R. Akhunzyanov, K. Oleg, E. Luckyanets, Audiovisual liveness detection, in Image Analysis and Processing—ICIAP 2015. ICIAP 2015. Lecture Notes in Computer Science, ed. by V. Murino, E. Puppo, vol. 9280 (Springer, Cham, 2015) 22. P.H. Lee, L.J. Chu, Y.P. Hung, S.W. Shih, C.S. Chen, H.M. Wang, Cascading multimodal verification using face, voice and iris information, in IEEE International Conference on Multimedia and Expo (IEEE, 2007), pp. 847–850 23. T. Barbu, A. Ciobanu, M. Luca, Multimodal biometric authentication based on voice, face and iris, in E-Health and Bioengineering Conference (EHB) (IEEE, 2015), pp. 1–4 24. Y. Tian, S. Xiang, Detection of video-based face spoofing using LBP and multiscale DCT, in Digital Forensics and Watermarking. IWDW 2016. Lecture Notes in Computer Science, ed. by Y. Shi, H. Kim, F. Perez-Gonzalez, F. Liu, vol. 10082 (Springer, Cham, 2017) 25. S. Bharadwaj, T. Dhamecha, M. Vatsa, R. Singh, Computationally efficient face spoofing detection with motion magnification, in IEEE Conference on Computer Vision and Pattern Recognition Workshops (IEEE, 2013). https://doi.org/10.1109/CVPRW.2013.23 26. P. Viola, M. Jones, Rapid object detection using a boosted cascade of simple features, in Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1 (IEEE, 2001), pp. I–511–I–518 27. Z. Akhtar, G. Luca Foresti, Face spoof attack recognition using discriminative image patches. J. Electr. Comput. Eng. 2016, 4721849-1–4721849-1 (2016) 28. A. Mittal, A.K. Moorthy, A.C. Bovik, BRISQUE software release (2011), http:// live.ece.utexas.edu/research/quality/BRISQUE_release.zip 29. A. Mittal, A.K. Moorthy, A.C. Bovik, No reference image quality assessment in the spatial domain. IEEE Trans. Image Process. 21(12), 4695–4708 (2012). https://doi.org/10.1109/TIP.2012.2214050

Anti-Spoofing in Face Recognition: Deep Learning and Image Quality. . .

69

30. W. Xue, X. Mou, L. Zhang, Blind image quality prediction using joint statistics of gradient magnitude and Laplacian features. IEEE Trans. Image Process. 23(11), 4850–4862 (2014) 31. A.K. Moorthy, A.C. Bovik, A modular framework for constructing blind universal quality indices. IEEE Signal Process. Lett. 17, 513 (2009) 32. A. K. Moorthy, A. C. Bovik, BIQI software release (2009), http://live.ece.utexas.edu/research/ quality/biqi.zip 33. A. Mittal, R. Soundararajan, A. C. Bovik, NIQE software release (2012), http:// live.ece.utexas.edu/research/quality/niqe.zip 34. A. Mittal, R. Soundararajan, A.C. Bovik, Making a completely blind image quality analyzer. IEEE Signal Process. Lett. 20, 209 (2012) 35. A. Mittal, A.K. Moorthy, A.C. Bovik, Making image quality assessment robust, in Conference Record of the Asilomar Conference on Signals, Systems, and Computers, Monterey, CA (IEEE, 2012), pp. 1718–1722. http://live.ece.utexas.edu/research/quality/robustbrisque_release.zip 36. D. Kundu, D. Ghadiyaram, A.C. Bovik, B.L. Evans, No-reference quality assessment of high dynamic range images, in 50th Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, USA, 6-9 Nov 2016 (IEEE, 2016). http://users.ece.utexas.edu/~bevans/ papers/2017/crowdsourced/index.html 37. I. Chingovska, A. Anjos, S. Marcel, On the effectiveness of local binary patterns in face antispoofing, in International Conference of Biometrics Special Interest Group (BIOSIG) (IEEE, 2012), pp. 1–7 38. C.C. Chang, C.J. Lin, LIBSVM: a library for support vector machines (2001), http:// www.csie.ntu.edu.tw/~cjlin/libsvm 39. K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, in International Conference on Learning Representations (ICLR) (2015) 40. T. Ojala, M. Pietikainen, T. Maenpaa, Multiresolution grayscale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Int. 24(7), 971–987 (2002) 41. T. Ahonen, A. Hadid, M. Pietikäinen, Face recognition with local binary patterns, in Computer Vision - ECCV 2004. ECCV 2004, Lecture Notes in Computer Science, ed. by T. Pajdla, J. Matas, vol. 3021 (Springer, Berlin, 2004), pp. 469–481 42. B. Amos, B. Ludwiczuk, M. Satyanarayanan, in OpenFace: a general-purpose face recognition library with mobile applications. CMUCS-16-118, Technical report (CMU School of Computer Science, 2016) 43. E.K. Davis, Dlib-ml: a machine learning toolkit. J. Mach. Learn. Res. 10, 1755–1758 (2009) 44. C. François (2015) Keras, https://keras.io/ 45. Z. Zhang, J. Yan, S. Liu, Z. Lei, D. Yi, S. Li, A face antispoofing database with diverse attacks, in 5th IAPR International Conference on Biometrics (ICB) (IEEE, 2012) 46. S.Y. Wang, S.H. Yang, Y.P. Chen, J.W. Huang, Face liveness detection based on skin blood flow analysis. Symmetry 9(12), 305 (2017). https://doi.org/10.3390/sym9120305 47. I. Manjani, S. Tariyal, M. Vatsa, R. Singh, A. Majumdar, Detecting silicone mask based presentation attack via deep dictionary learning. IEEE Trans. Inf. Forensics Security 12(7), 1713–1723 (2017). https://doi.org/10.1109/TIFS.2017.2676720 48. L. Li, X. Feng, Z. Boulkenafet, Z. Xia, M. Li, A. Hadid, An original face anti-spoofing approach using partial convolutional neural network, in 6th International Conference on In Image Processing Theory Tools and Applications (IPTA), Oulu, Finland (2016), pp. 1–6 49. M.B. López, A. Nieto, J. Boutellier, J. Hannuksela, O. Silvén, Evaluation of real-time LBP computing in multiple architectures. J Real-Time Image Process 13(2), 375–396 (2017) 50. J. Johnson (2017), https://github.com/jcjohnson/cnn-benchmarks#vgg-paper 51. P. Wang, J. Cheng, Accelerating convolutional neural networks for mobile applications, in 24th ACM International Conference on Multimedia (ACM, 2016), pp. 541–545 52. A.G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, H. Adam, MobileNets: efficient convolutional neural networks for mobile vision applications (2017). https://arxiv.org/pdf/1704.04861.pdf

Deep Learning for Biometric Face Recognition: Experimental Study on Benchmark Data Sets Natalya Selitskaya, S. Sielicki, L. Jakaite, V. Schetinin, F. Evans, M. Conrad, and P. Sant

1 Introduction Face recognition is a topic of intensive research on the intersection of pattern recognition and computer vision. Because of its naturally innate for humans and noninvasive nature, face recognition earned its leading place in numerous practical applications in information security, surveillance systems, law enforcement, access control and smart cards, etc. The most obvious and reliable techniques for face recognition are based on biometric measurements of the human face physical characteristics. More subtle biometric-based approaches exploit physiological and behavioural characteristics including electroencephalogram signals and expression of human emotions [30, 45]. Over three decades of research and development in face recognition, a high performance has been achieved in many applications. Nevertheless, the recognition methods are still affected by factors such as illumination, facial expression, and poses, as described in [41, 51, 54]. Efficient attempts to find solutions to these problems have been made with Machine Learning (ML) as well as with Deep Learning (DL) methods and techniques which extract hierarchical and semantic structures existing in images [9, 30]. In this light, a key property of ML is to find models capable of explaining the observed data and predicting new unseen samples represented by features which describe given data, as discussed in [18, 35, 42]. Unfortunately, standard ML methods are not universally effective and so still need manual adjustment of learning parameters [17, 33]. The pattern features can vary between data samples so that the

N. Selitskaya () · S. Sielicki · L. Jakaite · V. Schetinin · F. Evans · M. Conrad · P. Sant University of Bedfordshire, School of Computer Science and Technology, Luton, UK e-mail: [email protected]; https://www.beds.ac.uk/departments/computing © Springer Nature Switzerland AG 2020 R. Jiang et al. (eds.), Deep Biometrics, Unsupervised and Semi-Supervised Learning, https://doi.org/10.1007/978-3-030-32583-1_5

71

72

N. Selitskaya et al.

learning parameters are changed. As a result, the optimal performance cannot often be achieved in a general case [43, 52]. Data preprocessing for capturing only informative features significantly boosts ML performance, and the highest performance can be found when different ML and feature extraction methods are explored in an adhoc way, as described in [4, 18, 37, 40, 50]. In this regard, ML techniques such as Bag of Features (BOF), speed up robust features (SURF), Histogram Oriented Gradients (HOG) have been intensively explored [1, 34]. Conventional ‘shallow’ ANNs with a predefined structure cannot learn high-level features. Feature learning (extraction and selection) can be done within a certain limit [45]. The ‘deep’ structures provided by Convolution Neural networks (CNN) have been capable of generating desired features needed for solving face recognition problems [26]. In this chapter we experimentally explore the face recognition techniques on benchmark data sets such as: Extended Yale B, AT&T, Caltech 101, Essex Faces96 and Grimace, JAFFE, Gatech, and the newly created BEDS set. Our study aims to identify the strengths and weaknesses of the face recognition algorithms under different conditions represented by the benchmark data sets. Specifically, we run experiments with ‘shallow’ and ‘deep’ ML algorithms along with feature extraction methods which are commonly used in face biometrics. DL-based solutions require to optimise a large set of parameters, and so can often be inefficient within available computational resources. The experimental results and findings are evaluated and discussed in terms of recognition accuracy, computational requirements, and robustness to variations in facial images. The new findings allow us to conclude that when the shallow ML methods are combined with the feature extraction techniques, they are competitive to the DL solutions, but not requiring massive computations. The following Sect. 2 reviews research already done on the methods and data investigated in this study. Sections 3 and 4 describe ML methods commonly used for face recognition explored in our study. Section 5 examines hardware requirements needed for the effective study in the area, summarises available facial image data sets we have used in experiments, outlines setups of the experiments, and shows results of ML algorithms achieved on the benchmark data sets. Finally, Sects. 6 and 7 discuss and summarise the observed weak and strong characteristics of the algorithmes, demonstrated on the given benchmarks.

2 Related Work In this section we discuss the main feature extraction techniques which can be applied to face recognition problems.

Deep Learning for Biometric Face Recognition: Experimental Study

73

2.1 Bag of Features The Bag of Features (BOF) generates a compact representation of the largescale image features in which relations between the features are discarded. The representation may be seen either as a vector in the linear feature space or as a feature frequency histogram. The basis for the feature space, or the list of buckets for a histogram, also called terms vocabulary or visual words vocabulary, is constructed from the features extracted out of the training set. Features extracted from the test set are associated with the nearest vocabulary terms, and a novel image is represented as a linear combination of the vocabulary terms and frequency of their occurrence [34]. BOF’s popularity was built on its simplicity and compactness of the spatially orderless collections of the quantified local image descriptors which makes BOF implementations computationally lightweight [34]. BOF has been adopted for computer vision and pattern recognition and, in particular, for face recognition studies [4, 37, 52], and facial emotion recognition [20]. BOF leaves significant freedom in selecting the ways of quantifying visual features. However, there is a requirement on feature space which has to support a concept of neighbourhood. Usually, linear spaces endowed with the metric defined by various distance functions are used to quantify the closeness of features. Nonlinear spaces can be also used in a general case. One of the fastest and most efficient feature detection and representation techniques for BOF model, the Scale-Invariant features Transformation (SIFT), is described in more details below. It utilises a 64-dimensional linear space of gradients with the Euclidean metric defining neighbouring relations between the SIFT features. The SIFT was applied in [25] for recognition of ageing faces because it relies on feature gradients which are not much changed with age. However, faces in different poses, which could be linked to emotions, were challenging to recognise. To address the problem of the facial recognition errors due to the variety of emotion expressions, the BOF method has been used with SIFT and SURF descriptors with the added spatial dimension that traced the region of the face the features came from [20]. The study achieved better results than the spatial-less methods. The study suggested using emotion recognition decisions as drivers of the movie recommendation system for future study. The suggestion [48] is another version of Bag of Aesthetic Features (BOF) as a way of classifying images which could be useful for face recognition. A study of the computational time for face track retrieval was investigated in [4]. Here, a sparse form of BOF was adopted to retrieve facial images from the video backup file which did not achieve good retrieval time despite the model accuracy trade-off due to a large size of the facial index. Another facial retrieval improvement study [52] adopted BOF for extracting the initial low-discriminating local features, combined with additional multireferencing global feature rating approach with the high discriminating power. Unlike a more common unsupervised vocabulary learning approach, a labelled

74

N. Selitskaya et al.

facial data set and spatially specific grids were used to build visual words with a better discriminating quality. Though the speed of retrieval was impressive, the authors recommended a supervised form of training to improve performance. A features selection methodology alternative to clustering was proposed in [37] where a supervised RBF neural network was used to extract image features. The study used YouTube faces, ORL faces and extended Yale training data sets. However, due to an over-fitting problem, because many features were not used during training, CNN was suggested for future studies.

2.2 Speeded-Up Robust Features The Speeded-Up Robust Features (SURF) method is a robust and fast algorithm for local detection, comparison, and representation of the invariant image features which is popular in pattern recognition studies [2, 10]. SURF [1] is a further development of the mentioned above popular feature detection algorithm SIFT [28]. It employs simplification, reduction of the feature space dimensionality, and neighbouring pixel aggregation to achieve greater processing rates while keeping or even improving accuracy [36]. In particular, SURF feature detection and extraction methods have been successfully adopted for solving face recognition challenges such as face detection in the presence of pose [40] and illumination variations, nonalignment and rotation of facial features [10], or applied as a hybrid model [19] as well as for localisation [40], which allows for effective generalisation [1, 36], 3D image reconstruction [40], and image registration [3]. Due to the impressive performance in terms of speed, the SURF methods have been suggested for computer vision and pattern recognition applications such as online face recognition and banknote image recognition, as shown in [10], and [14]. The methods were capable of handling image features by using alignment and image rotation. In [23], the robustness of the SURF technique to image feature misalignment was studied. The results from the study indicated a satisfactory performance on the ORL and UMIST data sets. Using a 3D facial representation [55, 56], the SURF technique was effective in projecting the facial features because of the robustness of the technique to the rotation. Another SURF implementation was investigated for the localisation of the face landmarks under the non-linear viewpoint and expression changes [40]. During the training phase, the 3D face constellation models were built from the SURF feature descriptors using ANN to assign descriptors to 16 landmark classes of the 3D model. During the test phase, planar projections of the 3D models were used to match the test faces. Experiments on challenging data sets demonstrated efficiency and robustness of the technique. Another implementation of SURF was in a cascade network [24]. The use of SURF features has significantly reduced the dimensionality of model in comparison with wavelet techniques, and consequently reduced the training time. Experiments on the MIT and FDDB facial data sets were satisfactory especially in terms of the

Deep Learning for Biometric Face Recognition: Experimental Study

75

convergence. However, the training size was not sufficient to prevent over-fitting of the cascade network, and so this will unlikely work well on new unseen face images. The advantage of SURF is the high processing speed which is achieved due to the low dimensionality (from 64 to 128) of a feature space [10]. This is however not enough for robust face recognition under variations in face illumination as well as in poses.

2.3 Histogram of Oriented Gradients The Histogram of Oriented Gradients (HOG) is another gradient-based algorithm that extracts local gradient-based features from images. Unlike the sparse feature selection that is used in many location-unaware BOF implementations, HOG utilises a dense evenly distributed spatial grid of descriptors. Each descriptor comprises an n-dimensional model of the edges and their orientations in its immediate vicinity. Gradients are calculated over small multi-pixel cells that are joined into cell blocks over which histograms of gradients are computed. Such histograms also can be viewed as vectors in n-dimensional rotational angle space. The cell blocks overlap each other, and gradients are normalised over larger regions containing a number of blocks [6]. In the study [50], local and global HOG feature classifiers have been mixed with weights defined by their importance estimated in the region of interests. Global and local feature descriptors in this study are formed by concatenation of HOG vectors in the combined high-dimensional space over whole image or regions, respectively, with subsequent PCA and LDA dimensionality reduction. The Nearest Neighbours was used as a classifier in the resulting spaces. Using benchmark FERET and CAS-PAERL-R1 data set, better performance was obtained when compared to other techniques such as Gabor and Local Binary Pattern (LBP) features extractor. However, the problem with this study was the manual adjustment of parameters. An improvement with manual adjustment has been achieved by an adaptive method implemented in [54], as a fusion of LBP and HOG features of faces learned on a Multiple kernel adaptive classifier. One of the interesting things is the capability of HOG to solve edge and orientation problems of pattern recognition while LBP was responsible for the texture components of the face. The result providing an improved performance was not possible with only HOG or LBP and sufficient for face recognition in the ambient environment. Apart from the environmental factors, face images generally contain noise. In [53] an improved method of HOG was suggested with dense grid technique to handle all the facial noise problems at once. The technique was compared with conventional Gabor and LBP which outperformed these methods. Some of the gaps in the study were lack of more robust data set such as with occlusion of faces which was recommended for future studies. Apart from facial occlusion noise, face recognition [13] has been studied in other facial noise caused by compression of instance during archiving and blur facial

76

N. Selitskaya et al.

images. HOG was adopted in terms of spatial features used with SVM. This resulted in improved recognition with SVM which was better than a conventional HOG transform on the FDDB benchmark data sets. Besides, the HOG was also studied in [46] apart from the orientation aspect of HOG. In this study the CAS-PEAL and FRGS facial databases were used to compare the computational performances obtained with Gabor, LBP, and HOG, the last has demonstrated the highest performance. The problem addressed in this study was in achieving a trade-off between speed and accuracy. A further study combined HOG with SVM, as described in [7].

2.4 CNN Features Deep Learning has recently received much attention as making a significant impact on the efficiency of Computer Vision applications. DL can learn features and tasks directly from data and, as a result, DL is often referred to as end-to-end learning. The term ‘Deep’ refers to the hidden layers in the network, so the more hidden layers, the deeper the network. Conventional ‘shallow’ ANNs usually have two or three hidden layers, while deep learning networks can have a few hundred. CNN exploits Deep Neural network structures to build reliable features from the local image patches. The similarity existing in images which feed a CNN is used so that the useful results achieved by someone can be utilised via Transfer Learning. Then the CNN can be quickly retrained on a data set of interest in order to adjust the network parameters and so as to improve the performance. In [5] a CNN was applied for facial recognition based on the orientation of sensors with regards to facial images. None of the feature extraction was applied, and only a CNN has been designed. Another study has adopted a CNN to distinguish fake faces from real faces which were modelled in a 3D environment [27]. The required regularisation of CNN was achieved by minimising the parameters which have been suggested in other multidimensional study.

3 Feature Extraction Techniques 3.1 Bag of Features Bag of Features (BOF) is an extension of the technique known as Bag of Words. It uses a feature dictionary set in order to represent patterns of interest in a given image as a multi-set (or bag) of the ordered pairs of features along with the number of its occurrences. BOF model generates a compact representation of the large-scale image features in which the spatial relations between features are usually discarded [34].

Deep Learning for Biometric Face Recognition: Experimental Study

77

A BOF dictionary of a manageable size can be constructed from large-scale features extracted out of the training set images by clustering similar features, as described in [34]. Features extracted from images are associated with the closest dictionary term using the Nearest Neighbours or similar techniques in a feature parameters space. Matches between the image and the dictionary features are normalised and evaluated. The result is a vector representation of the given j -th image Ij in the dictionary feature space as follows: Ij = {a1j v1 , a2j v2 , . . . , anj vn },

(1)

where vi are the feature dictionary basis vectors, and aij are the numbers of occurrences in the Ij image normalised by the count number of all features in the image. BOF technique offers a freedom in the selection and representation of the image features. However, the choice of BOF parameters is crucial for achieving high accuracy, robustness, and speed of recognition, as described in [4, 20, 48]. Speeded-Up Robust Features (SURF) method is a robust and fast gradient-based sparse algorithm for local detection, comparison, and representation of invariant image features [2, 10]. SURF described in [1] is a further development of an exhaustive feature detection scale-invariant feature transform (SIFT) algorithm proposed in [28]. SIFT provides a computationally efficient reduction of feature space dimensionality as well as aggregation of neighbouring pixels, keeping or even improving the recognition accuracy [36]. The SURF-SIFT family of algorithms combines two phases: (1) image feature detection and (2) feature description. The feature detection phase uses convolution of the Gaussian filter, to smooth up the feature noise, with the second-order gradient operator to extract edge, corners, blobs, and other candidate structures, among those the most promising ones are selected with the maximal extrema and those that pass the optional noise filters. The process is repeated for multiple image scales. The intensity gradient maximum and its location from an adjacent plane subspace of the scale space are interpolated to extract the robust persistent points of interest. For the rotationinvariant variations of the algorithms, the orientations of the features producing the best gradient extrema are also detected [1]. To avoid sub-sampling recalculations on various scales, the SURF algorithm uses the integral image which is calculated by summing up intensities of all pixels in the box at the current point to retrieve any rectangular area intensity on the constant speed. Instead of the Gaussian filters, approximation box filters are used. Following [1, 36], the scale-normalised determinant of the Hessian matrix is used to estimate the second-order gradient: det(Happrox ) = Dxx Dyy − (0.9Dxy ),

(2)

where Dxx is the box filter approximation of the second-order Gaussian derivative at the i-th pixel location xi = (xi , yi ) with the scale σ of the image I . The above feature representation phase is built on feature descriptors that characterise the neighbourhood of each of interest points in all the scales and

78

N. Selitskaya et al.

their dominant directions. The convolution of the Gaussian filter and the firstorder gradient measures in various directions for the neighbouring grid components comprise the feature space basis [1]. In particular, the SURF algorithm estimates the Gaussian and Haar wavelet convolution using box filters in the neighbourhood radius defined by the scale in the number of rotational steps. The orientation calculating the maximum linear sum of the different Haar components is chosen as dominant. Finally, in the chosen direction a 64-dimensional descriptor is built using the Haar components and their absolute values, calculated for each 4 × 4 cell of t-th region of the image I : vt = {. . . ,

dxi ,

dyi ,

|dxi |,

|dyi |, . . . },

(3)

where dxi is the sum of the Haar wavelet details responses in the horizontal direction of pixels in the i-th cell of the region, i ∈ {1, . . . , 16}, and |dxi | is the sum of absolute values of the wavelet responses, and y denotes the vertical direction of the Haar wavelet responses, as described in [1].

3.2 Histogram of Oriented Gradients Feature Descriptor The Histogram of Oriented Gradients (HOG) is another algorithm which extracts local gradient-based features from images. Unlike the sparse feature selection that is used in the SURF-SIFT family, HOG utilises an evenly distributed and overlapping spatial grid of descriptors. Each descriptor comprises an n-dimensional model (usually n = 9) of edges and their orientation in its immediate vicinity. Edges and other points of interest are detected using the first-order gradients approximated by the Haar-like wavelet filters in n directions between 0◦ and 180◦ . Gradients are calculated over small multi-pixel rectangular or radial cells, and then convoluted with the Gaussian local filters. For each cell, a n-bin histogram gradient votes are calculated. The cell-level histograms are joined into the multi-cell blocks-level histograms. Such a histogram can be viewed as vectors in n-dimensional rotational angle space. The cell blocks overlap each other, and gradients are normalised over larger regions containing a given number of blocks. For image classification, a soft linear SVM can be used, as described in [6]. According to [46] for a given bin Ct represented as a set of pixels p, the histogram vector vt = {bt0 , bt1 , · · · , btn−1 } can be calculated as follows: bti =

{gp : p ∈ Ct , θp ∈ [iθs − θs /2, iθs + θs /2]} , |Ct |

(4)

where gp and θp are the gradient of intensity and direction of the gradient in pixel p, and θs is the bin sector angle.

Deep Learning for Biometric Face Recognition: Experimental Study

79

Note that according to [6] the non-Gaussian gradients calculated with [−1, 0, 1] masks provide the best performance of the above technique.

4 Artificial Neural Networks and Deep Learning 4.1 Artificial Neural Networks Artificial Neural Networks (ANNs) have been shown efficient in learning of feature representation, see e.g. [12, 16]. For example, an ANN has been used for generalised multivariate piecewise regression f : f : X ⊂ R m → Z ⊂ R n .

(5)

In particular cases, an ANN implements a multivariate linear regression: y = f (x) = Wx, ∀x ∈ X ⊂ R m , ∀y ∈ Y ⊂ R n ,

(6)

where W ∈ W ⊂ R n × R m is the adjustable coefficient matrix. ANNs with more than one hidden layer are called Deep Neural Networks (DNN), and corresponding ML methods are called Deep Learning (DL). Models of the real world, which DNNs implement, have a large number of the learning parameters, mapping ANN layers of various architectures, and therefore can emulate the reality more accurately. This is typically achieved with a non-linear activation function introduced. Mapping of the learning parameters W onto a loss function: l : W → R,

(7)

where W = W1 × W2 × · · · × Wn and n is the number of ANN layers. The aim of the ANN training is to minimise the loss function. In the case of the linear regression, linear regression (a one-hidden layer ANN with a linear activation function), the loss function is ˆ − Y) ˆ T (XW ˆ − Y). ˆ l(W) = (XW

(8)

According to [15], it is easy to find an analytic solution to the matrix W: ˆ T X) ˆ −1 X ˆ T Y. ˆ W = (X

(9)

For general cases of the loss function minimisation, numerical solutions have been developed by using information about error gradients, see e.g. [12]. Gradients of the loss function are introduced in respect to the learning parameters W and inputs

80

N. Selitskaya et al.

x at each layer, making the learning algorithms efficient in terms of approximation accuracy and computational time. The derivative chain rule is the core of learning algorithms, known as backpropagation, which are used for solving the above minimisation problem [38]. Given an ANN function f with the hidden and output layers g and h, the inputs x are mapped onto the outputs z as follows: f : X ⊂ R m → Z ⊂ R n ,

(10)

where y = g(x), and z = h(y). The derivative of the output in respect to the input xi is ∂z ∂yj ∂z = . ∂xi ∂yj ∂xi

(11)

j

The above can be described in the vector form: ∇xz = Jg (x)T ∇yz ,

(12)

where Jg is Jacobian operator. The gradient of the loss function l : Y ⊂ R k → R in respect to the matrix W can be calculated as described above if W is flattened to a vector w. The similar approach can be used for an arbitrary dimension of W as follows: ∇Wl = Jg (W)T ∇yl .

(13)

A gradient descent learning algorithm described in [31] can be implemented as follows: Wt+1 = Wt − α∇Wl + γ (Wt − Wt−1 ),

(14)

where t stands for the iteration index, α is the learning rate, and ∇Wl is the loss function gradient in respect to the learning matrix W.

4.2 Convolutional Neural Networks When using general purpose DNNs for image recognition the necessity to address such problems as the input’s high dimensionality and weak control over the feature selection and processing algorithms led to the development of more specific DNN architectures. One of the popular DNN architectures for the image and signal recognition is Convolutional Neural Networks (CNNs). CNN specific techniques include the use of the local receptive fields—local neighbouring pixel patches that

Deep Learning for Biometric Face Recognition: Experimental Study

81

are connected to few neurons of the next layer. Such an architecture hints the CNN to extract locally concentrated features and could be implemented using Hadamard product of weight matrix and receptive fields masks y = (M W)x. Similarly to the feature extraction techniques, signals from the local receptive field can be convoluted with kernel functions. The convolution operator conv : Z 2 → R of the image I pixel intensity u at i, j -th pixel location with the filter kernel v can be defined as follows: x = conv(i, j ) = u ∗ v(i, j ) =

u(i − k, j − l)v(k, l),

(15)

(k,l)∈I

where u : I ⊂ Z 2 → R, v : I ⊂ Z 2 → R, ∀(i, j ) ∈ Z 2 , ∀(k, l) ∈ Z 2 , ∀x ∈ R. To process local features shared learning parameters can be used instead of the weight matrix, when a Hadamard product of weight vector is exploited y = (w U)T x. To reduce dimensionality, the deep layers average the neighbouring nodes and use a pooling function, dropping, or down-sampling number of outputs, according to [12, 22].

4.3 Pairwise Neural Network Structures A multiclass classification problem could be often reduced to a set of binary classifiers. Binary classifiers can be not only simpler, but also more accurate. Similarly to the CNN’s local reception fields which ‘encourage’ particular direction of learning on the ANN architecture level, a pairwise receptive fields could be used with C(C − 1)/2 number of pairs for C classes [44, 51]. A receptive field mask becomes ⎤ ⎡ 1 1 0 0 ··· 0 0 ⎢ 1 0 1 0 ··· 0 0 ⎥ ⎥ ⎢ ⎢ ··· ··· ··· ··· ··· ··· ··· ⎥ ⎥ ⎢ ⎥ ⎢ 0 0 0 ··· 1 0 ⎥ ⎢ 1 (16) M=⎢ ⎥. ⎢ 1 0 0 0 ··· 0 1 ⎥ ⎥ ⎢ ⎢ 0 1 1 0 ··· 0 0 ⎥ ⎥ ⎢ ⎣ 0 1 0 1 ··· 0 0 ⎦ ···

···

···

···

···

···

···

The outputs of each pairwise classification function are as follows: (yki , ykj ) = fkij (xi , xj ) = [ mki wki

mkj wkj ][ xi

xj ]T ,

(17)

82

N. Selitskaya et al.

where k ∈ {1, 2, · · · , C(C − 1)/2} are summed for each i-th or j -th indexes by the next ‘voter’ layer with positive for i-th and negative for j -th unit coefficients defined by the ‘voter’ mask: ⎡

⎤ ··· ··· ⎥ ⎥. ··· ⎦ ··· (18) For the transformation function for such a combined pairwise perception field and voter layer y = V(M W)x, the back-propagated gradient of the learning parameters can be calculated as (where U is a unit matrix) 1 ⎢ −1 V=⎢ ⎣ 0

1 0 −1

··· ··· ··· ···

1 0 0

1 0 0

0 1 −1

0 1 0

··· ··· ··· ···

0 1 0

0 0 1

∇Wl = W (x U)T (VT ∇yl ).

··· ··· ··· ···

0 0 1

(19)

5 Implementations and Experiments This section describes (1) the hardware and software platforms used for our experiments, (2) benchmark data sets, and (3) software packages used in the experiments. Finally, the section describes the obtained results.

5.1 Hardware and Software The MATLAB Parallel Computing Toolbox provides the capability of the parallel computation of Computer Vision Toolbox functions on multiple CPU cores, including the BagOfFeatures function which is used in the experiments. To utilise the parallel CPU computing capabilities of the Toolbox, a CPU is used with the maximal number of cores. Note that functions of the MATLAB Deep Learning Toolkit which are used in our experiments can be significantly slower than on GPU specialised in the matrix computation operations. To efficiently use a GPU, some CPU characteristics such as supported Peripheral Component Interconnect Express v.3.0 (PCIe) bus bandwidth (number of PCIe lanes multiplied by a lane transfer rate), CPU clock speed, and the size of the supported random access memory (RAM), and the memory access bandwidth have to be set. To fully support the capabilities of a single top of the consumer line GPU in terms of the video random access memory (VRAM) size and PCIe bus bandwidths, any basic CPU of the Intel Core i5 seventh generation class, or similar AMD CPU, with 3+ GHz frequency, up to 64 Gb and 2-channel access RAM, and 16 PCIe lanes support, as well as the compatible motherboard chip-set, would be enough. In this study, a budget Intel Core i5-7500 CPU and ASUS TUF Z270 Mark two motherboard were used.

Deep Learning for Biometric Face Recognition: Experimental Study

83

Running DL in MATLAB requires an NVIDIA GPU and an NVIDIA CUDA toolkit. The choice of GPU is dependent on a VRAM, the number of pipelines or Compute Unified Device Architecture (CUDA) cores multiplied by clock frequency, memory bandwidth, and on integral texture fill rate metric. The consumer line NVIDIA GPU with 11–12 GB VRAM, 32 GB RAM would be enough to hold whole frames to be sent back and forth to and from GPU. It is a minimum for the less resource expensive ML algorithms, such as BagOfFeatures. While 64 GB and up (if CPU supports it) of RAM is highly recommended, on the contemporary operating systems it is also possible to create virtual memory of the larger than physical RAM size using mapping to the high-performance storage such as SSD. For this study 64 GB (4 × 16 GB) of Corsair Vengeance LPX DDR4 2666 MHz RAM and 2 × 500 GB Crucial NVMe PCIe M.2 SSD were used. GPU would be the most wattage consumers with utilisation varying up to 250 W. CPU follow with 65–250 W range, with the rest of components consuming watts in single digits. Also, NVIDIA recommends at least 500 W power supply for the whole system if at least one GPU of the DL grade is used. Therefore, for the study, to potentially accommodate a second GPU to supply 750 W EVGA. Experiments were run in MATLAB 2018, using the Statistics and Machine Learning, Image Processing, Computer Vision System, Deep Learning, Parallel Computing Toolboxes. To enable an access to a GPU, a 410.48 NVIDIA GPU driver, CUDA 10.0 GPU programming abstraction toolbox, and cuDNN 7.4.1 Deep Neural Network library were installed. The benchmark facial image data sets demand a high amount of memory, especially if the experimental hardware configuration has less than 64 GB of RAM. The ‘Java Heap Size’ parameter at the ‘MATLAB—General—Java Heap Memory’ panel should be increased in case of the OutOfMemoryError errors. The ‘MATLAB array size limit’ check-box is recommended to be unset at ‘MATLAB—General— Workspace’ panel. To create or increase virtual memory that can be larger than the actual physical RAM a ‘swap’ file is used in the modern Linux distributions such as Ubuntu instead of the old-fashion swap partitions. The swap file of 256 GB size was used in the study. To enable an automatic parallel computation the ‘Use Parallel’ check-box should be checked in at the ‘Computer Vision System Toolbox’ panel as well as the ‘Automatically create parallel pool’ check-box at the ‘Parallel Computing Toolbox—Parallel Pool’ panel, unless the parallel pool control is done from the code itself.

5.2 Benchmark Data Sets For experiments with the ML and feature extraction techniques, the following facial benchmark data sets are used. Extended Yale B Face Database (also abbreviated in tables below as ‘Yale B’) contains grey-scale images from 29 individuals with 65 lighting and nine pose

84

N. Selitskaya et al.

variations, and are grouped in individual folders. The individuals are represented by 585 images of 640 × 480 pixels dimension [11]. Some images have been corrupted. Thus, 22 person data sets, among them 6 with the reduced number of images (181, 314, 508, 513, 580, 580) were left after cleaning the corrupted files and folders. AT&T Laboratories Cambridge ‘The Database of Faces’ (formerly ORL database of faces) also (‘AT&T’ abbreviated in tables below) contains ten grey-scale images per subject from 40 individuals with different lighting, pose, and facial expressions including open or closed eyes, and are grouped in the individual folders. Each individual is represented by ten images, all in grey-scale, and of 92 × 112 pixels dimension [39]. California Institute of Technology ‘Caltech 101 Faces’ data (‘Caltech’ abbreviated) contain 450 images of 30 individuals (3 of them are paintings) with variations of background, lighting, face positioning, facial expressions and facial details, and the number of images per individual. These images have different backgrounds, all are presented in colour, and of 856 × 562 pixels dimension [8]. These images were manually grouped in the separate folders by the individual, and folders containing only one subject were excluded from the study, leaving six folders with 5–6 images per individual, and 19 folders with 19–29 images per individual. Bedfordshire University student image data (abbreviated as ‘BEDS’) contain 1000 images of five individuals with variations of face positioning, facial expressions, and facial details. All images contain the same background, are coloured, and have 640 × 480 pixels dimension. University of Essex Face Recognition Data used in our study are abbreviated as ‘Faces96’ and ‘Grimace’. Faces96 data contain 3016 coloured images of 151 individuals, 20 images per individual, of 196×196 pixels dimension. It contains images of people of various racial origins, mainly of first-year undergraduate students in front of the complicated background of glassy posters. Grimace data contain 360 coloured cropped images of 18 individuals simulating various emotion expressions on a neutral background, of 180 × 200 pixels dimension, as discussed in [47]. Japanese Female Facial Expression Database (‘JAFFE’ abbreviated) contains 213 grey-scale images of seven facial expressions posed by ten Japanese female models. Images are of 256 × 256 pixels dimension [29]. Georgia Technology Institute Face Database (‘Gatech’ abbreviated) contains 750 coloured images of 50 people. For each individual, there are 15 colour images of the 640 × 480 pixels dimension, captured in two different sessions to take into account the variations in illumination conditions, facial expression, and appearance. In addition to this, the faces were captured at different scales and orientations, as discussed in [32].

5.3 Computational Experiments This section describes the experiments conducted on the benchmark data sets using the MATLAB toolboxes Computer Vision, Deep Learning, Image Processing,

Deep Learning for Biometric Face Recognition: Experimental Study

85

Parallel Computing, Statistics and Machine Learning. A PC has included Intel Core i5 CPU, 2-channel 64 GB DDR4, and GeForce GTX 1070 Ti GPU. The experiments were run for five ML methods on eight benchmark data sets. Tables 1, 2, 3, 4, and 5 show the mean and median recognition accuracies and computational times for each method used in these experiments within a threefold cross-validation. For experiments with BOF technique, MATLAB bagOfFeatures and trainImageCategoryClassifier functions were used with default parameters. The parameters of Table 1 Training median accuracy of ML methods on the benchmark data sets (Part a)

Table 2 Training median accuracy of ML methods on the benchmark data sets (Part b)

Table 3 Test median accuracy of ML methods on the benchmark data sets (Part a)

Table 4 Test median accuracy of ML methods on the benchmark data sets (Part b)

Table 5 Computational time of ML methods on the benchmark data sets

Method BOF SURF HOG CNN CNN+BOF Method BOF SURF HOG CNN CNN+BOF Method BOF SURF HOG CNN CNN+BOF Method BOF SURF HOG CNN CNN+BOF

Caltech 0.9644 0.9749 0.9764 0.9981 0.9956 Faces96 0.9988 1.0000 0.9996 1.0000 0.9996 Caltech 0.6122 0.5336 0.5455 0.9578 0.9420 Faces96 0.9983 0.9785 0.9801 0.9983 0.9818

BEDS 1.0000 1.0000 1.0000 1.0000 1.0000

AT&T 1.0000 1.0000 1.0000 0.6906 1.0000

Yale B 0.9912 0.9999 0.9925 0.9998 1.0000

Grimace 1.0000 0.9826 1.0000 1.0000 1.0000

JAFFE 1.0000 1.0000 1.0000 0.9128 0.9660

Gatech 0.9900 1.0000 0.9833 1.0000 1.0000

BEDS 1.0000 1.0000 1.0000 1.0000 1.0000

AT&T 0.9875 0.8625 0.9250 0.6125 0.9250

Yale B 0.9879 1.0000 0.9921 0.9997 0.9996

Grimace 1.0000 0.9306 1.0000 1.0000 1.0000

JAFFE 1.0000 1.0000 1.0000 0.9268 0.9000

Gatech 0.9600 0.9933 0.9467 0.9933 0.9933

Method BOF SURF HOG CNN CNN+BOF

Time [s] 39,384 1075 14,107 8524 8994

86

N. Selitskaya et al.

the bagOfFeatures define feature extraction from a grid of points of interest with horizontal and vertical steps of 8 pixels by an upright SURF extractor with 32, 64, 96, and 128 patch sizes of a scale space. The functions keeps 80% of the strongest features from each category and clustering them with k-means algorithm into 500dimension feature space, and a 80/20% training to test partitioning. The parameters of the trainImageCategoryClassifier function define a linear SVM as a classification method [21, 49]. Coloured images from CalTech 101 Faces and BEDS data sets were converted into a grey-scale space accepted by the bagOfFeatures implementation using rgb2gray function. The BOF algorithm using the grid-extracted upright SURF features has shown the lowest accuracy on the Caltech 101 Faces data set: 96.6% on the training and 61.2% on the test data sets. Accuracy 96.0% was shown on the Gatech test and 99.0% on the training data subsets. The experiments on the Extended Yale B data set have shown accuracy 99.1% and 98.8% on the training and test data, accordingly. On the AT&T data set, the BOF algorithm showed accuracy 100% on the training and 98.8% on the test data sets. The Essex Faces96 experiments achieved 99.9% training and 99.8% test accuracies. On the BEDS students data set, Essex Grimace, and JAFFE data sets, the algorithm has shown 100% performances. For the SURF experiments, the same functions were used with different configuration parameters which specify the use of SURF detector and an orientation aware SURF extractor. The BOF algorithm using SURF for detecting and extracting the oriented features has shown the lowest accuracies on the Caltech 101 Faces data set, 97.5% on the training and 53.4% on the test data sets. On the AT&T data set, the SURF features provided 100% accuracy on the training and only 86.3% on the test data sets. The Essex Grimace data set experiments demonstrated 93.1% performance. Experiments on Face96 have shown better 97.9% performance, and on Gatech data set the algorithm achieved 99.3% performance. The training and test experiments on the Extended Yale B data set have scored 99.9% and 100%, respectively. On the BEDS students and JAFFE data sets, the accuracy was 100% on both subsets. For experiments with the HOG algorithm, the same functions were set with an additional custom HOG extractor wrapper, using similar to the BOF experiments a 8 × 8 pixels grid extraction matrix, shown in the Listing 1.1. Specifically, the HOG parameters include a 2 × 2-cell block with an 8 × 8-pixel default configuration. The lowest accuracy was observed when the HOG algorithm was tested on the Caltech 101 Faces data set. The algorithm shows a similar performance provided by the BOF techniques: 97.8% on the training and 54.6% on the test data sets. On the AT&T data set, the HOG algorithm has shown 100% performance on the training and 92.5% on the test data subsets. On the Gatech data set the algorithm has shown a lower performance 94.7%. The test and training accuracy for Essex Faces96 were 98.0% and 99.9%, respectively. The experiments on the Extended Yale B data set have been scored 99.2% on both training and test data. On the BEDS students, Essex Grimace, and JAFFE data sets, the algorithm has shown 100% accuracy on both subsets.

Deep Learning for Biometric Face Recognition: Experimental Study

87

Listing 1.1 Custom feature extractor for building HOG features

function [ features , featureMetrics , v ali dpo in ts ] = . . . extractHOGFeaturesWrap ( I ) %% C r e a t e 8 x8 g r i d [ xl , y l ] = s i z e ( I ) ; gridStep = 8; gridX = 1: g r i d S t e p : xl ; gridY = 1: g r i d S t e p : yl ; [ x , y ] = meshgrid ( gridX , g r i d Y ) ; points = [x (:) , y ( : ) ] ; %% C a l l s t a n d a r d HOG e x t r a c t o r [ f e a t u r e s , v a l i d p o i n t s ]= e x t r a c t H O G F e a t u r e s ( I , p o i n t s ) ; %% S e t up dummy s t r e n g t h m e t r i c s [m, ~ ] = s i z e ( v a l i d p o i n t s ) ; f e a t u r e M e t r i c s = o n e s (m, 1 ) ; return The above algorithm employs a pretrained CNN to perform Transfer Learning (TL) technique. An AlexNet CNN has been trained on different objects which are not included in the given data. The settings of CNN can be modified for training on the different data set. Grey-scale images of the AT&T and Extended Yale B data sets were converted to 3-channel colour space, which is the AlexNet requirement by using grey2ind and ing2rgb functions in order to create a pseudo-coloured representation of the grey-scale images. Initially, the default options InitialLearningRate:α = 0.01, Momentum:γ = 0.9, MiniBatchSize = 128 of the trainNetwork function of MATLAB’s Deep Learning Toolbox were used. However, the performance has decreased on BEDS data set, and then InitialLearningRate has been changed to α = 0.001. This has decreased the accuracy on the AT&T data set. This data set has a significantly smaller number of images per subject, and so the MiniBatchSize has been decreased to 64, which significantly improved the performance. Surprisingly, the CNN was unable to achieve the accuracy higher than 69.1%. The CNN has shown 95.8% performance on Caltech 101 Faces data set which was problematic for the BOF algorithm. However, CNN has performed worse on the AT&T data than the BOF algorithm, showing only 61.3% performance, we expected observing 69.1% accuracy on the train subset. The CNN algorithm has shown a performance of 92.7% for the JAFFE data set shown 100% for the BOF-based algorithms. The BEDS and Essex Grimace data benchmark sets have demonstrated 100% performance, and on Extended Yale B, Essex Faces96, and Gatech data sets the accuracies were 99.9%, 99.8%, and 99.3%, respectively. The CNN models based on the AlexNet and trained specially for each of the benchmark data sets, described in the above section, were used to extract the

88

N. Selitskaya et al.

features. Activation levels of the neighbouring neurons of the fully connected layer fc7 were grouped into low-dimensional features and then fed into the standard MATLAB bagOfFeatures via the custom feature extractor function, see Listing 1.2. Experiments with different feature dimensions were run. A 8-dimension CNN feature descriptor has demonstrated evenly high accuracy rates for all data sets, including the set difficult for CNN AT&T, providing an accuracy of 92.5%. The Caltech 101 Faces data set has scored 94.2%, BEDS data set 100%, Extended Yale B 99.9%, Essex Faces96 98.2%, Essex Grimaces 100%, and Gatech data set 99.3%. The lowest performance was demonstrated on the JAFFE data set showing 90.0%. The computational time for the combined CNN and BOF techniques, as expected, was slightly higher than for the individual CNN and BOF cases with the SURF detectors. However, this time was significantly lower than that for the BOF with the HOG or dense grid SURF features. Listing 1.2 Custom feature extractor for building CNN features

function [ features , featureMetrics , val id poi nts ] = . . . extractCNNFeaturesWrap ( I ) g l o b a l gBofNet %% E x t r a c t a c t i v a t i o n l e v e l s o u t o f A l e x N e t ’ s f c 7 layer featureLayer = ’ fc7 ’ ; f e a t u r e s = a c t i v a t i o n s ( gBofNet , I , f e a t u r e L a y e r , . . . ’ ExecutionEnvironment ’ , ’ auto ’ , . . . ’ MiniBatchSize ’ , 6 4 , . . . ’ OutputAs ’ , ’ columns ’ ) ; %% B u i l d 8− d i m e n s i o n a l f e a t u r e d e s c r i p t o r s o u t o f % f c 7 l a y e r ’ s 4096 n e u r o n a c t i v a t i o n s f_dim = 8 ; [m, ~ ] = s i z e ( f e a t u r e s ) ; f e a t u r e s = r e s h a p e ( f e a t u r e s , m/ f_dim , f _ d i m ) ; %% C r e a t e dummy f e a t u r e l o c a t i o n s and s t r e n g t h s [m, ~ ] = s i z e ( f e a t u r e s ) ; v a l i d p o i n t s = z e r o s (m, 2 ) ; f e a t u r e M e t r i c s = o n e s (m, 1 ) ; As the MATLAB Deep Learning Toolbox is limited in network architectures, the combined pairwise learning and voter layer, see Listings 1.3 and 1.4, was added after the last fully connected layer of the retrained AlexNet which was used in the previous CNN experiment. The mandatory softmax layer followed by the sum of squared errors was added. The default options InitialLearningRate:α = 0.01, Momentum:γ = 0.9, and MiniBatchSize = 128 of the trainNetwork function of the Deep Learning Toolbox

Deep Learning for Biometric Face Recognition: Experimental Study

89

were used. As the learning was slow on the BEDS data set, the minimal number of epochs was increased to 600. To capture the variability in facial images, a comparative experiment has been run with the same parameters for the CNN and pairwise methods on the Caltech 101 Faces, BEDS, AT&T, Essex Faces96, JAFFE, and Gatech benchmarks. Both CNN and pairwise methods have achieved 100% performance on the Essex Faces96 and JAFFE data sets, and 94.3% on the Caltech 101 Faces, 98.8% on the AT&T, and 99.3% on the Gatech data sets. The pairwise CNN was successfully trained with the default parameters, achieving 100% performance (Table 6). Listing 1.3 Custom pairwise and voter learning layer, forward propagation

f u n c t i o n Y = p r e d i c t ( l a y e r , X) [ h , w, c , n ] = s i z e (X ) ; l a y e r .W = l a y e r .M . * l a y e r .W; l a y e r .NW = l a y e r . N * l a y e r .W; i f c l a s s (X) == " g p u A r r a y " YR = g p u A r r a y ( l a y e r .NW) * s q u e e z e (X ) ; else YR = l a y e r .NW * s q u e e z e (X ) ; end Y = r e s h a p e (YR, [ h , w, c , n ] ) ; end Listing 1.4 Custom pairwise and voter learning layer, gradients back-propagation

f u n c t i o n [ dLdX , dLdW ] = . . . b a c k w a r d ( l a y e r , X, Z , dLdZ , memory ) [ h , w, c , n ] = s i z e (X ) ; dLdZR = r e s h a p e ( dLdZ , [ ] , n ) ; %s q u e e z e ( dLdZ ) ; l a y e r .W = l a y e r .M . * l a y e r .W; l a y e r .NW = l a y e r . N * l a y e r .W; Table 6 Median performance of multiclass and pairwise CNN on the benchmarks Method CNN PW CNN

Caltech 0.9425 0.9425

BEDS 0.0000 1.0000

AT&T 0.9875 0.9875

Faces96 1.0000 1.0000

JAFFE 1.0000 1.0000

Gatech 0.9933 0.9933

90

N. Selitskaya et al.

%c h a i n dLdY = J ( d z / dy ) T * dLdZ i f c l a s s (X) == " g p u A r r a y " dLdX = r e s h a p e ( g p u A r r a y ( l a y e r .NW’ ) * dLdZR , . . . [ h , w, c , n ] ) ; else dLdX = r e s h a p e ( l a y e r .NW’ * dLdZR , . . . [ h , w, c , n ] ) ; end %c h a i n dLdW = J ( d z / dW) T * dLdZ [ cw , nw ] = s i z e ( l a y e r .W) ; XR = s q u e e z e (X ) ; i f c l a s s (X) == " g p u A r r a y " dLdW = g p u A r r a y ( l a y e r .W) . * ( sum (XR , 2 ) / n . * . . . o n e s ( [ cw , nw ] , ’ l i k e ’ , X) ’ ) ’ . * . . . ( l a y e r . N’ * sum ( dLdZR , 2 ) / n ) ; else dLdW = l a y e r .W . * ( sum (XR , 2 ) / n . * . . . o n e s ( [ cw , nw ] , ’ l i k e ’ , X) ’ ) ’ . * . . . ( l a y e r . N’ * sum ( dLdZR , 2 ) / n ) ; end end

6 Discussion In our experiments the Caltech 101 Faces data set was challenging for all BOF techniques regardless of the type and density of the interest point detection and feature extraction. Similarly to the AT&T data set, the Caltech 101 data set has shown a weak correlation between the number of features and recognition accuracy (Tables 3, 4, 7 and 8). For these benchmarks the performance has been improved by ca 10% for a significant computational cost. However, the BEDS and Extended Yale B data sets with a high number of images per subject have shown no accuracy improvement with the number of feature descriptors.

Table 7 Number of the selected features on the benchmarks (Part a) Method BOF SURF HOG

Caltech 2,652,150 25,725 426,325

BEDS 61,363,200 490,780 11,125,275

AT&T 172,040 2680 28,160

Yale B 48,998,400 664,972 8,883,512

Deep Learning for Biometric Face Recognition: Experimental Study Table 8 Number of the selected features on the benchmarks (Part b)

Method BOF SURF HOG

Faces96 4,530,000 83,956 958,548

Grimace 529,920 396 106,452

91 JAFFE 524,290 9620 123,010

GT 9,216,000 170,750 1,670,900

Fig. 1 Examples of Caltech 101 face images mismatched by BOF method using SURF features

Table 9 Median performance of SURF method with feature vocabulary and 80% strongest features on the benchmarks

Vocabulary 1000 2000 4000

Caltech 0.6680 0.7013 0.5967

BEDS 1.0000 1.0000 1.0000

AT&T 0.9000 0.9250 0.9375

Yale B 1.0000 1.0000 1.0000

Having analysed the mismatched images, it became clear that a number of these images were with a rich and highly variable background, while only two facial features were persistently matched, that apparently is not enough to correctly identify a person, as shown in Fig. 1. Majority of the mismatched images show subjects with visually similar hairstyles, face types, who are likely relatives. To improve the recognition accuracy with the BOF techniques, we could increase the size and quality of the BOF visual vocabulary. To test this hypothesis, new computational experiments with a variable vocabulary size (Table 9) and percentage of the retained strongest features (Table 10) were run for a BOF technique employing the SURF detectors and descriptors. The experiments have shown that there exists the optimal vocabulary size, and percentage of the retained strongest features is different from the default MATLAB bagOfFeatures parameters which are individual for each data set. Finding the optimal configuration parameters can

92

N. Selitskaya et al.

increase the ‘quality’ of the feature vocabulary by ca 20% without significant increase of the computational time on the ‘difficult’ Caltech 101 data set. For the AlexNet CNN the Caltech 101 Faces data set did not cause a problem. However, the AT&T benchmark with a smaller number of images per subject, and absent background is more difficult to recognise. Nevertheless, tuning of the parameters InitialLearningRate and MiniBatchSize of the learning algorithm has allowed us to improve the recognition accuracy. The improvement appears when the benchmark images are significantly different in terms of the numbers of subjects and images per subject. The optimal parameters of the CNN technique were different for each such a data set, and so must be individually tailored for each data set (Tables 11 and 12). When a specific feature for a benchmark is not feasible, combining of the CNN and ‘shallow’ BOF-style methods can be useful, especially when the computational time is comparable with the pretrained CNN and BOF employing the SURF feature detectors and descriptors. In our experiments, an ensemble of the AlexNet CNN employing the SURF techniques was capable of correcting the poor performance of each other on the difficult Caltech 101 faces and AT&T benchmarks (Tables 13 and 14). A high performance of the CNN achieved in our experiments has inspired us to run new experiments with the feature extractors which feed the BOF model. The similar performance of the BOF model with the CNN which has been achieved shows that a non-optimal choice of the feature extractors can affect the performance on the Caltech 101 Faces benchmark. The experiments with the CNN feature Table 10 Median performance of SURF method with strongest features selected and 3000 vocabulary size on the benchmarks Table 11 Mean performance of AlexNet CNN for different MiniBatch sizes and InitialLearningRate α = 0.01 parameters on BEDS and AT&T benchmarks

Table 12 Mean performance of AlexNet CNN for different MiniBatch sizes and InitialLearningRate α = 0.001 parameters on BEDS and AT&T benchmarks

Strongest [%] 30 50 70 90 MiniBatch 8 16 32 64 128 MiniBatch 8 16 32 64 128

Caltech 0.5847 0.7113 0.6247 0.6387

BEDS 1.0000 1.0000 1.0000 1.0000

BEDS, α = 0.01 0 0 0.2000 0.2000 0 BEDS, α = 0.001 1.0000 1.0000 1.0000 1.0000 1.0000

AT&T 0.8875 0.8625 0.9625 0.9250

Yale B 1.0000 1.0000 1.0000 1.0000

AT&T, α = 0.01 0.7875 1.0000 0.9875 0.9750 0.7125 AT&T, α = 0.001 0.9875 0.9875 0.9250 0.5500 0.1500

Deep Learning for Biometric Face Recognition: Experimental Study Table 13 Performance of BOF method with CNN extracted features of various dimensions on the benchmarks (Part a)

Table 14 Performance of BOF method with CNN extracted features of various dimensions on the benchmarks (Part b)

Dimensions 2 4 8 16 32 64 Dimensions 2 4 8 16 32 64

Caltech 0.1780 0.6213 0.9420 0.8620 0.9720 0.9507 Faces96 0.6175 0.9354 0.9818 0.9917 0.9967 0.9884

93 BEDS 0.9830 0.9980 1.0000 1.0000 1.0000 1.0000

AT&T 0.7750 0.8875 0.9250 0.7000 0.6125 0.5500

Yale B 0.9839 0.9980 0.9996 0.9996 0.9988 0.9973

Grimace 0.9722 1.0000 1.0000 1.0000 1.0000 1.0000

JAFFE 0.9500 0.9800 0.9000 0.8750 0.8500 0.9500

GT 0.7267 0.9533 0.9933 1.0000 0.9933 0.9733

dimensions have demonstrated that there exist an optimal dimensionality which is unique for each tested data set. It has been also observed that some combinations of the training parameters and network structures, such as pairwise networks, have improved the performance.

7 Conclusions and Further Work Face recognition algorithms still have problems when lighting conditions, head rotation, view angles, and emotions vary. The large number of subjects which are required to be recognised often makes class boundaries difficult for learning. Deep Neural Networks have provided efficient solutions, although their computational efficiency has to be improved because of massive computations per iteration of error gradient algorithms. In practice parameters of gradient algorithms and neural network structures have to be explored in multiple experiments in order to maximise the performance. Pairwise network structures explored in our experiments have improved the performance because such networks require a small set of parameters which have to be optimised in experiments. The main findings in our work are presented as a tutorial developed on popular facial benchmark data sets. These findings are as follows. The use of dense grids for extracting a high number of features for BOF algorithms marginally increases the accuracy of face recognition. The greatest improvement has been achieved on the benchmark with a low number of images per subject.

94

N. Selitskaya et al.

Optimisation of parameters of the feature vocabularies has allowed us to generate a high-dimensional vocabulary and achieve efficient filtering of features. Optimised parameters found in our experiments have improved the recognition accuracy without significant computational expenses. Experiments with the AlexNet CNN have demonstrated that the face recognition accuracy on different benchmarks is dependent on the training parameters whose optimal values are unique for each benchmark. The ‘difficult’ benchmarks for the both BOF techniques and AlexNet CNNs were different. We attempted to overcome these problems and have considered the use of the BOF technique with CNN extracted features. This has allowed us to improve the performance, maintaining comparable computational expenses. The new findings allow us to conclude that when the shallow ML methods are combined with the feature extraction techniques, they are competitive to the DL solutions, but not requiring massive computations. The conducted study inspires us to further explore advantages of combining conventional ML and explosively developing DL methods for face recognition. We believe that the results presented in this study could be further improved by optimising the network configurations and learning parameters. An obvious way to achieve this is to combine higher level and conventional feature extraction methods. To improve the robustness face recognition algorithms can be explored on benchmarks with new challenges which can include face occlusions, disruptions, camouflage, etc.

References 1. H. Bay, T. Tuytelaars, L.V. Gool, Surf: speeded up robust features, in Computer Vision – ECCV 2006, ed. by A. Leonardis, H. Bischof, A. Pinz (Springer, Berlin, 2006), pp. 404–417 2. H. Bay, A. Ess, T. Tuytelaars, L.V. Gool, Speeded-up robust features (surf). Comp. Vision Image Underst. (CVIU) 110(3), 346–359 (2008) 3. F. Bowen, J. Hu, E.Y. Du, A multistage approach for image registration. IEEE Trans. Cybern. 46(9), 2119–2131 (2016) 4. B.C. Chen, Y.Y. Chen, Y.H. Kuo, T.D. Ngo, D.D. Le, S. Satoh, W.H. Hsu, Scalable face track retrieval in video archives using bag-of-faces sparse representation. IEEE Trans. Circuits Syst. Video Technol. 27(7), 1595–1603 (2017) 5. Z. Chen, Y. Wang, H. Liu, Unobtrusive sensor-based occupancy facing direction detection and tracking using advanced machine learning algorithms. IEEE Sensors J. 18(15), 6360–6368 (2018) 6. N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1 (June 2005), pp. 886–893 7. W. Deng, J. Hu, J. Lu, J. Guo, Transform-invariant PCA: a unified approach to fully automatic face alignment, representation, and recognition. IEEE Trans. Pattern Anal. Mach. Intell. 36(6), 1275–1284 (2014) 8. L. Fei-Fei, R. Fergus, P. Perona, Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. Comput. Vis. Image Underst. 106(1), 59–70 (2007)

Deep Learning for Biometric Face Recognition: Experimental Study

95

9. S. Frintrop, E. Rome, H.I. Christensen, Computational visual attention systems and their cognitive foundations: a survey. ACM Trans. Appl. Percept. 7(1), 1–39 (2010) 10. A.C. Geng Du, F. Su, Face recognition using surf features, in Proceedings SPIE, vol. 7496 (2009) 11. A.S. Georghiades, P.N. Belhumeur, D.J. Kriegman, From few to many: illumination cone models for face recognition under variable lighting and pose. IEEE Trans. Pattern Anal. Mach. Intell. 23(6), 643–660 (2001) 12. I. Goodfellow, Y. Bengio, A. Courville, Deep Learning (MIT Press, Cambridge, 2016) 13. S. Gunasekar, J. Ghosh, A.C. Bovik, Face detection on distorted images augmented by perceptual quality-aware features. IEEE Trans. Inf. Forensics Secur. 9(12), 2119–2131 (2014) 14. F.M. Hasanuzzaman, X. Yang, Y. Tian, Robust and effective component-based banknote recognition by surf features, in 2011 20th Annual Wireless and Optical Communications Conference (WOCC) (April 2011), pp. 1–6 15. T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning. Springer Series in Statistics (Springer, New York, 2001) 16. A.J. Izenman, Modern Multivariate Statistical Techniques: Regression, Classification, and Manifold Learning,1st edn. (Springer, Berlin, 2008) 17. L. Jakaite, V. Schetinin, C. Maple, J. Schult, Bayesian decision trees for EEG assessment of newborn brain maturity, in 2010 UK Workshop on Computational Intelligence (UKCI) (Sept 2010), pp. 1–6 18. L. Jakaite, V. Schetinin, J. Schult, Feature extraction from electroencephalograms for Bayesian assessment of newborn brain maturity, in 2011 24th International Symposium on ComputerBased Medical Systems (CBMS) (June 2011), pp. 1–6 19. B. Jun, I. Choi, D. Kim, Local transform features and hybridization for accurate face and human detection. IEEE Trans. Pattern Anal. Mach. Intell. 35(6), 1423–1436 (2013) 20. T. Kalsum, S.M. Anwar, M. Majid, B. Khan, S.M. Ali, Emotion recognition from facial expressions using hybrid feature descriptors. IET Image Process. 12(6), 1004–1012 (2018) 21. V. Kecman, T.-M. Huang, M. Vogt, Iterative single data algorithm for training kernel machines from huge data sets: theory and performance. Support Vector Mach. Theory Appl. 177, 605– 605 (2005) 22. Y. LeCun, Y. Bengio, G.E. Hinton, Deep learning. Nature 521(7553), 436–444 (2015) 23. Y. Lei, X. Jiang, Z. Shi, D. Chen, Q. Li, Face recognition method based on surf feature, in 2009 International Symposium on Computer Network and Multimedia Technology (Jan 2009), pp. 1–4 24. J. Li, T. Wang, Y. Zhang, Face detection using surf cascade, in 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops) (Nov 2011), pp. 2183–2190 25. Z. Li, U. Park, A.K. Jain, A discriminative model for age invariant face recognition. IEEE Trans. Inf. Forensics Secur. 6(3), 1028–1037 (2011) 26. H. Li, Z. Lin, X. Shen, J. Brandt, G. Hua, A convolutional neural network cascade for face detection, in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 00 (June 2015), pp. 5325–5334 27. L. Liu, C. Xiong, H. Zhang, Z. Niu, M. Wang, S. Yan, Deep aging face verification with large gaps. IEEE Trans. Multimedia 18(1), 64–75 (2016) 28. D.G. Lowe, Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004) 29. M.J. Lyons, S. Akamatsu, M. Kamachi, J. Gyoba, Coding facial expressions with Gabor wavelets, in FG (1998) 30. W. Mei, D. Weihong, Deep face recognition: a survey. CoRR, abs/1804.06655 (2018) 31. K.P. Murphy, Machine Learning: A Probabilistic Perspective (MIT Press, Cambridge, 2012) 32. A. Nefian, M. Khosravi, M. Hayes, Real-time detection of human faces in uncontrolled environments, in Proceedings of SPIE - The International Society for Optical Engineering (May 1999)

96

N. Selitskaya et al.

33. N. Nyah, L. Jakaite, V. Schetinin, P. Sant, A. Aggoun, Learning polynomial neural networks of a near-optimal connectivity for detecting abnormal patterns in biometric data, in 2016 SAI Computing Conference (SAI) (July 2016), pp. 409–413 34. S. O’Hara, B.A. Draper, Introduction to the bag of features paradigm for image classification and retrieval. arXiv, abs/1101.3354 (2011) 35. M. Oravec, Feature extraction and classification by machine learning methods for biometric recognition of face and iris, in Proceedings ELMAR-2014 (Sept 2014), pp. 1–4 36. E. Oyallon, J. Rabin, An analysis of the SURF method. Image Processing On Line 5, 176–218 (2015) 37. N. Passalis, A. Tefas, Learning neural bag-of-features for large-scale image retrieval. IEEE Trans. Syst. Man Cybern. Syst. 47(10), 2641–2652 (2017) 38. D.E. Rumelhart, G.E. Hinton, R.J. Wilson, Learning representations by back-propagating errors. Nature 323, 533–536 (1986) 39. F. Samaria, A. Harter, Parameterisation of a stochastic model for human face identification, in Winter Conference On Applications of Computer Vision (1994) 40. E. Sangineto, Pose and expression independent facial landmark localization using dense-SURF and the Hausdorff distance. IEEE Trans. Pattern Anal. Mach. Intell. 35(3), 624–638 (2013) 41. U. Scherhag, R. Raghavendra, K.B. Raja, M. Gomez-Barrero, C. Rathgeb, C. Busch, On the vulnerability of face recognition systems towards morphed face attacks, in 2017 5th International Workshop on Biometrics and Forensics (IWBF) (April 2017), pp. 1–6 42. V. Schetinin, L. Jakaite, Extraction of features from sleep EEG for Bayesian assessment of brain development. PLOS One 12(3), 1–13 (2017) 43. V. Schetinin, L. Jakaite, W.J. Krzanowski, Prediction of survival probabilities with Bayesian decision trees. Expert Syst. Appl. 40(14), 5466–5476 (2013) 44. V. Schetinin, L. Jakaite, N. Nyah, D. Novakovic, W. Krzanowski, Feature extraction with GMDH-Type neural networks for EEG-based person identification. Int. J. Neural Syst. 2018, 1750064 (2017) 45. V. Schetinin, L. Jakaite, N. Nyah, D. Novakovic, W. Krzanowski, Feature extraction with GMDH-type neural networks for EEG-based person identification. Int. J. Neural Syst. 28, 1750064 (2018) 46. C. Shu, X. Ding, C. Fang, Histogram of the oriented gradient for face recognition. Tsinghua Sci. Technol. 16(2), 216–224 (2011) 47. L. Spacek, Face Recognition Data (June 2008), Online. Accessed 13 Dec 2018 48. H.H. Su, T.W. Chen, C.C. Kao, W.H. Hsu, S.Y. Chien, Preference-aware view recommendation system for scenic photos based on bag-of-aesthetics-preserving features. IEEE Trans. Multimedia 14(3), 833–843 (2012) 49. Support vector machine template - MATLAB templateSVM (Dec 2018). Accessed 13 Dec 2018 50. H. Tan, B. Yang, Z. Ma, Face recognition based on the fusion of global and local HOG features of face images. IET Comput. Vis. 8(3), 224–234 (2014) 51. J. Uglov, L. Jakaite, V. Schetinin, C. Maple, Comparing robustness of pairwise and multiclass neural-network systems for face recognition. EURASIP J. Adv. Signal Process. 2008, Article ID 468693, 7 pp. (2008). https://doi.org/10.1155/2008/468693 52. Z. Wu, Q. Ke, J. Sun, H.Y. Shum, Scalable face image retrieval with identity-based quantization and multireference reranking. IEEE Trans. Pattern Anal. Mach. Intell. 33(10), 1991–2001 (2011) 53. Z. Xiang, H. Tan, W. Ye, The excellent properties of a dense grid-based HOG feature on face recognition compared to Gabor and LBP. IEEE Access 6, 29,306–29,319 (2018) 54. Z. Xie, P. Jiang, S. Zhang, Fusion of LBP and hog using multiple kernel learning for infrared face recognition, in 2017 IEEE/ACIS 16th International Conference on Computer and Information Science (ICIS) (May 2017), pp. 81–84

Deep Learning for Biometric Face Recognition: Experimental Study

97

55. C. Xu, Y. Wang, T. Tan, L. Quan, Automatic 3d face recognition combining global geometric features with local shape variation information, in Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings (May 2004), pp. 308–313 56. G. Zhang, Y. Wang, Robust 3d face recognition based on resolution invariant features. Pattern Recogn. Lett. 32(7), 1009–1019 (2011)

Deep Learning Models for Face Recognition: A Comparative Analysis Arindam Chaudhuri

1 Introduction The face recognition has been the most challenging problems in biometrics [1–6]. It identifies as well as verifies human faces based on certain available non-intrusive and natural unique features. The problem has been actively studied by computer vision and pattern recognition (CVPR) researchers. It finds intensive applications in military, finance, security, anti-terrorism, etc. Face recognition does not look towards cooperation from user. The process of acquisition does not require any contact and has good concealment. It is multi-dimensional detection and recognition affair that looks towards ability to recognize and identify faces considering several appearance variations which a face might have. Generally, the face is considered as 3D object coupled with various sources of light [7] along with other background data. The face appearance varies significantly as it is projected on 2D image [8]. Thus, the task is always to develop robust face recognition techniques which are able to perform non-contrived recognition with respect to the inherent variations. The recognition process should also take care of variance tolerations present in the faces. The problem becomes more challenging when the deformation effect arrives across innumerable faces available considering identity, race, or genetics as well as other variations from intra-personal aspects. The recognition process should also address image acquisition issues. Again, the recognition system output looks for fair amount of accuracy levels considering execution time and space efficiency factors. The conventional pipeline in face recognition consists of four stages as shown in Fig. 1 [3].

A. Chaudhuri () Samsung R & D Institute Delhi, Noida, India © Springer Nature Switzerland AG 2020 R. Jiang et al. (eds.), Deep Biometrics, Unsupervised and Semi-Supervised Learning, https://doi.org/10.1007/978-3-030-32583-1_6

99

100

A. Chaudhuri

Fig. 1 The conventional pipeline in face recognition

Technically face recognition is categorized as three groups [7–9]. The first group is 2D recognition of faces. 2D techniques form the initial version of face recognition which dates back to late 1990s and early 2000. Here algorithms considered with respect to geometric features, subspace, elastic matching, and neural networks. The face was identified considering brightness of image. Illumination hindered further progress of 2D face recognition influences the algorithms. 3D face recognition forms the second group. It has gained considerable attention in the recent past. Some of the significant works can be found in [9]. The advantage which 3D has over 2D is that 3D possesses human face’s geometric information in original. 2D faces are influenced through lighting, posture, and occlusion factors. It decreases face recognition system’s recognition ability. But 3D face recognition algorithms are not influenced by these issues. They consider variable illumination and posture in order to solve these aspects. The 3D image considers human face’s surface geometric features. No information is lost here because of posture changes. However, 3D image acquisition process is not taken into account. This happens because there is no information regarding brightness in 3D images. It is just affected by illumination changes. These have led towards the growth for face recognition of 3D images. The other significant works are available in [9]. 3D recognition techniques have also been applied in other fields. The third face recognition group belongs to 2D + 3D dual mode techniques. It went up to high performance levels. Here 2D image recognition technology is combined with 3D face attributes to reach good recognition heights. 2D + 3D face recognition has grown considerably during past few years [9]. There is a tremendous growth of deep learning with large face datasets emergence [1]. The powerful data learning capability of deep learning has increased 2D face recognition considerably. Deep learning has been an active machine learning research area. It builds deep neural networks towards simulation of human brain. Then it interprets and analyzes multi-modal data [3]. The effectiveness of traditional machine learning is dependent on the performance of feature representation. The machine learning optimizes learning weights and produces optimal learning outputs. Deep techniques complete data representation and feature extraction work automatically. The most important deep learning network which has been used almost all face recognition applications with good performance is convolutional neural network (CNN) [10]. Its deep architecture allows extracting discriminating feature set representations at several abstraction levels. CNN learns rich features from images easily. When deep CNN network is trained, the learning process is performed through huge number of network parameters and good number of datasets labeled. Generally, CNN is not trained from ground level. Pre-training

Deep Learning Models for Face Recognition: A Comparative Analysis

101

of CNN is performed considering large dataset where model weights are used for initialization or feature extractor for the task concerned. Then test dataset is used to tune the network. The fine-tuned CNN is not able to reach good performance levels when there are differences in training and test datasets considering illumination, expression, viewpoints, etc. The pre-trained CNN determines the fine tuning process. When there are notable variations among target and source applications, pre-trained model’s layers need to be further fine-tuned. But too much fine tuning may result in overfitting. The changing of initial data distribution for pre-trained CNN is not possible when significant variations exist between training and test datasets. CNN now struggles to improve its adaptability. Training time is another problem with the network. Any change in a layer may affect distributions change in other layers. The network needs to adapt itself continuously. Hence, careful parameter setting with small learning rates as well as normalization is required to achieve good results from network. The initial work in face recognition is attributed towards the historical eigenface approach [3] in 1990s decade. Some of the significant paradigms of face recognition over the years are highlighted in [1]. In the figure above four major technical streams are presented, viz. holistic learning, local handcraft, shallow learning, and deep learning. In holistic approach the low-dimensional representation is derived considering some distributions like linear subspace manifold and sparse representation. These methods however failed to address the uncontrolled change in faces which deviate from their a priori assumptions. Later this problem gave birth to the local feature-based face recognition. This constituted the local handcraft approach. Some of the techniques here worth mentioning include Gabor and their multilevel and high-dimensional extensions. Considering the local filtering’s invariant properties, they achieved considerable performance levels but they lacked distinctiveness and compactness. In the beginning of this decade, the face recognition researchers started using learning-based local descriptors. This was the beginning point of the research work in shallow learning methods towards the face recognition problem. Here better distinctiveness problem was resolved through local filters and for better compactness the encoding codebook was used. However, the robustness problem still persisted considering complex non-linear variations in the facial appearance. These methods addressed the face recognition problem using filtering responses or the feature codes’ histogram. The face recognition accuracy was gradually improved through preprocessing, local descriptors, feature transformation, etc. The results achieved could improve the LFW benchmark’s accuracy to around 95% only. The shallow learning methods were unable to extract stable identity feature against unconstrained facial variations. As a result of this, the face recognition systems then provided unsatisfactory performance with innumerable false alarms in real-world applications. The year 2012 was the game changing year in face recognition research. Here AlexNet won the ImageNet competition by an appreciable margin through deep learning [11]. CNN was used which provided feature extraction and transformation capabilities considering multiple processing unit layers cascade. The learning was performed here through several representations levels which correspond to various

102

A. Chaudhuri

abstractions. These abstractions constituted concept hierarchy considering good invariance towards face pose, lighting, expression changes, etc. as highlighted in figure above. The initial layer of the network is identical to the Gabor features. The more complex text features were learned by the next layer. The next layer displayed simple features like high-bridged nose, big eyes, etc. whereby the more complex features were learned here. In the final layer the network’s output could present certain facial attributes through which significant responses are made towards clear abstract concepts like smile, roar, color of eyes, etc. In CNN, the initial layers learned long-term features Gabor, SIFT, etc. and the higher-level abstraction was learned through the next layers. The combination of the higher-level abstraction represents the identity of face considering fair amount of stability. Again in 2014 DeepFace [12] achieved good accuracy over the LFW benchmark dataset (http:// vis-www.cs.umass.edu/lfw/part_labels/) considering human performance for the unconstrained condition. As such the entire focus of research shifted towards deep learning-based methods. The accuracy factor went upto nearly 100% in a short time span. The face recognition research landscape was revamped by the deep learning techniques in all aspects, viz. algorithms, datasets, protocols for evaluation, etc. This chapter presents a comparative investigation of deep learning face recognition models [3] considering algorithm as well as data aspects. The initial discussion starts with a multi-stage strategy towards collecting large face datasets containing thousands of sample face images considering thousands of unique identities. These datasets are adopted from the knowledge sources available at web. They are used towards model training and evaluation in this research. Then a categorization is made for the various face recognition process methods. This follows an exploration of face identification as well as verification for several deep network architectures with respect to face alignment, metric learning along with loss functions. Several face recognition models have used variants of deep architectures. These are assessed considering the modeling choice from relevance viewpoints. The experimental results for the different deep learning face recognition methods are presented along with their comparative analysis against benchmark datasets. The potential strengths and weakness of deep face recognition methods are also highlighted. Next several miscellaneous issues and open-end research questions [3] are placed. Finally, the future research directions are given.

1.1 Motivation Factor The prime objective for considering deep learning models for face recognition here is attributed towards investigating practical and empirical pieces deep face recognition architectures which have evolved over the years. The study of face recognition has become a hot topic due to continuous evolution of face recognition systems by different biometric research groups looking towards higher accuracy levels. Several important deep face recognition models which have made a mark in the past decade are investigated here. The comparative analysis has been done on

Deep Learning Models for Face Recognition: A Comparative Analysis

103

face recognition datasets which have been specifically prepared adhering to all the operational constraints considering benchmark datasets. This work revolves around the following questions: (a) How do deep face recognition architectures perform on unknown datasets? (b) What is the deep face recognition algorithms’ performance with respect to benchmarks? (c) What are the open research questions in deep face recognition systems?

1.2 Research Agenda This research work makes the following contributions: (a) A literature review on deep face recognition systems considering existing ones. (b) This research work presents the significant deep face recognition systems which have come up during the last decade. A comparative performance analysis of these architectures is presented on unknown datasets considering benchmark datasets. (c) A comprehensive discussion on several recognition issues and other open research questions on the topic. The chapter is structured in the following manner. The face recognition related work is placed in Sect. 2. Section 3 provides a brief overview on face recognition. Section 4 highlights the face recognition datasets in this research work. In Sect. 5 deep learning models for face recognition are presented. Section 6 discusses the experimental results. Section 7 has further recognition and open-ended discussions. Finally, Sect. 8 gives the concluding remarks.

2 Face Recognition: Related Work Here literature review of research done for face recognition using deep learning networks is presented. This review mainly contains the research pointers from the past decade. The interested readers may refer [3] for more elaborate research list.

2.1 Current Literature The usage of deep learning [13] techniques towards face recognition mainly attributes towards the emergence of large-scale face datasets. The deep networks are blessed with superior data learning which has made it very popular among the face recognition problem. The deep learning networks take care of representation

104

A. Chaudhuri

of data and extraction of features [14, 15] automatically. One of the most famous deep networks used in face recognition is CNN with appreciable performance [16, 17]. It has a deep architecture through which it extracts discriminating feature set representations considering multiple abstraction levels. The CNN learns rich image features with considerable ease. CNN is trained through its learning process of few millions network parameters requiring sufficient number of pre-training labeled datasets [18]. CNN is generally pre-trained through huge data. Then model’s weights are used towards specific feature extraction considering the required task. Then model is fine-tuned through test set [19, 20]. There is a good difference in illumination, expression, viewpoints, etc. for the test and training datasets. As such CNN which is finely tuned struggles to get appreciable performance levels. This is because of the fact that fine tuning depends on pre-training of CNN model. This hinders recognition performance towards present recognition activity. The good fine tuning looks towards better processing such that applications have appreciable variance. When fine tuning of pre-trained CNN is done effectively [21], then there may be certain issues with limited fine-tuned over-fitted data. The classification and recognition task of CNN is achieved considering the data distribution [22, 23]. When training and test set difference is significant, it is difficult to change initial data distribution for pre-trained CNN. This gives CNN model’s adaptability difficult even when it is fine-tuned. The large number of network parameters in training network is another time based issue. This is because when parameters’ change in any layer changes data distributions in all layers. This will require network towards continuously adapting itself for new data distribution. This leads to careful tuning of parameters and performing the training with small learning rate. Also, activation operation’s non-linear saturation makes things more difficult [24, 25]. The data normalization which enhances the network convergence [26] has been coined by LeCun. It was presented in order to resolve internal covariate shift in [27]. Another face recognition method for colored two-dimensional principal component analysis convolutional neural network (C2D-CNN) was proposed by [28]. Some other works are available in [29–31]. CNN networks use RGB color channels towards extraction of features. They remove inner correlations for color channels. The fusion method for color channel [32] effectively improves color performance face recognition. The face recognition datasets are blessed with high discriminative features. The deep learning networks are capable of learning such features. The information is effectively propagated through complex hierarchical structure by the deep learning techniques. It learn features automatically when mapping low towards high level features. This basis of the process considers high dimensional feature extraction and classifier design. Based on this the CNN is trained towards extracting high dimensional feature vector [6, 33, 34]. This is followed by classification using joint Bayesian [33] or any metric learning method [35]. In [36] CNN has been supervised through novel signal center loss along with softmax loss. They achieved good accuracy considering three important face recognition benchmarks. In [37] patch strategy was used along with CNN to learn the face features in order to improve face representation performance.

Deep Learning Models for Face Recognition: A Comparative Analysis

105

CNN was used for feature extraction by Facebook with appreciable results [12] which led to the development of DeepFace. DeepFace is a significant deep learning network which comprises of nine-layer deep CNN model having two convolutional layers and over 120 million parameters which are trained with four million facial images adapted from 4000+ identities. By aligning the images with respect to 3D model and using CNN ensemble it achieved 97.35% and 91.40% accuracy with LFW and YTF datasets, respectively. Deep hidden IDentity features commonly known as DeepID [34] is a significant deep learning network. It comprises of nine-layer network and four convolutional layers. It learns the weights considering face identification. Through last hidden layer outputs features are extracted. Then it generalizes towards face verification. Here the faces are aligned through similarity transformation with respect to two centers of eye and mouth corners, respectively. The training is done using CelebFaces (http://mmlab.ie.cuhk.edu.hk/ projects/CelebA.html) dataset. A 97.45% accuracy is reached considering LFW dataset. Another significant deep network is FaceNet [38]. It is deep CNN with its foundations from GoogLeNet (http://deeplearning.net/tag/googlenet/). The network is trained considering a face dataset (http://vis-www.cs.umass.edu/lfw/part_labels/) comprising of 100–200 million images with about eight million identities. The algorithm uses triplets. In order to measure similarity of faces it learns to map face images towards compact euclidean space. FaceNet used the inception model in order to learn the triplet embedding for face verification. The algorithm reached 99.63% and 95.12% accuracy with LFW and YTF datasets, respectively. The other significant deep learning architectures which have been widely used for face recognition are GoogleNet (https://leonardoaraujosantos.gitbooks.io/artificialintelligence/content/googlenet.html) and ResNet (https://en.m.wikipedia.org/wiki/ ResNet). In [39] several image quality covariates impact several CNN models’ performance. Here, the performance of four deep neural network models for image classification task was evaluated considering various aspects. It was discovered as noise and blur being constituted as most hindering factors. The [40] gives another view on covariate analyses for deep models. Here a comparison and evaluation of various CNN architectures considering visual psychophysics perspective. In object recognition task, 3D image models of rendered images with respect to ImageNet object classes in order to have views which are canonical. The network’s performance is determined when objects are viewed considering variant angles and distances. Another situation for determining the network’s performance arises when images are put into deformations like random linear occlusion, Gaussian blur, brightness changes, etc. In [41] CNN’s performance is evaluated considering image transformations and deformations. They combined various image transformations which allow CNN to adopt general image representations better. As such CNNs are also trained through supervised learning. In [42] image covariates effects such as rotation, translation, scaling with respect to interpretation, and internal representations produced through deep CNNs trained using ImageNet object classification. The CNN invariance considering the covariates increased appreciably with network depth. The network features also increased the discriminative power considering depth of network for transfer learning. In [43] an evaluation of equi-variance, invari-

106

A. Chaudhuri

ance, and equivalence are preserved considering image transformations by deep CNN. The representations based on deep CNN are superior to other representations including both in-variance and equi-variance as well as transformations with respect to training objectives.

3 Face Recognition: Overview In this section we present a brief overview of face recognition in terms of basic concepts and important components of face recognition. This section is mainly for the first time readers. Those who are already aware of the basic face recognition aspects may skip this section.

3.1 Basic Concepts and Terminologies A face recognition system basically consists of three modules [44]: (a) face detection, (b) landmark detection, and (c) face implementation. The face detection module localizes the faces in images or videos. The landmark detection module aligns the faces towards normalized canonical coordinates. The face implementation module performs the implementation with the aligned face images. As a rule of thumb, face recognition process is divided into verification and identification of faces. Generally, a known subjects’ set is enrolled in the system (gallery) initially and a new subject (probe) is presented during testing. The face verification calculates the one-to-one similarity between gallery and probe in order to determine if the two images belong to the same subject. The face identification calculates oneto-many similarity in order to determine specific identity of the probe face. The closed-set identification refers to the probe which appears in gallery identities. The open-set identification refers to the probes which are not present in gallery.

3.2 Face Recognition Components As a face is input to a face recognition module, it is required to know whether the face is live or spoofed. This is known as face anti-spoofing which needs to be performed as it avoids different attacks. This follows the face recognition process. The face recognition module comprises of (a) face processing, (b) deep feature extraction, and (c) face matching. These three components of face recognition are briefly discussed in this section. The face recognition process can be represented as follows:

Deep Learning Models for Face Recognition: A Comparative Analysis

107

face_match feat_extrac(data_proci (face_imagei )), feat_extrac data_procj face_imagej In the above process, two face images are represented through face_imagei and face_imagej . The data_proc takes care of several intra-personal variations like poses, illuminations, expressions, occlusions, etc. The encoding of identity information is performed through feat_extrac. The face_match is face matching algorithm which calculates the similarity scores.

3.2.1

Face Processing

The deep learning techniques have been used for face processing because of their powerful representation. The various conditions like poses, illuminations, expressions, occlusions, etc. greatly impact the performance of these methods. In automatic face recognition applications, the pose variation is considered as a serious challenge. As a result of this a fair amount of effort has been given in this direction. The other variations are tackled through identical methods. Based on this the face processing methods are generally grouped as one-to-many augmentation and manyto-one normalization. In one-to-many augmentation many patches with variance in pose are generated considering a single image, such that the deep networks can learn pose-invariant representations. In many-to-one normalization canonical view of face images are recovered considering one or many images of non-frontal view, such that face recognition is done with respect to the controlled conditions.

3.2.2

Deep Feature Extraction

The deep feature extraction considers two aspects, viz. network architectures and loss function. The network architectures here are generally placed into either backbone or multiple networks. Several CNN based architectures have evolved after the success received from ImageNet (http://www.image-net.org/) challenge. Some of them worth mentioning are AlexNet (https://en.m.wikipedia.org/wiki/ AlexNet), VGGNet (https://medium.com/coinmonks/paper-review-of-vggnet1st-runner-up-of-ilsvlc-2014-image-classification-d02355543a1), GoogleNet (https://leonardoaraujosantos.gitbooks.io/artificial-intelligence/content/googlenet. html), ResNet (https://en.m.wikipedia.org/wiki/ResNet), SENet [45], etc. These architectures serve as baseline models in face recognition tasks. There are some evolving face recognition architectures which look towards efficiency improvement. As the basic blocks are formed through backbone networks, the multiple networks are often trained through face recognition techniques considering multiple inputs or tasks. There is a considerable performance upliftment when results from multiple networks are accumulated. The next aspect is the loss function. The softmax loss is the most widely used function. It looks towards features separability in object

108

A. Chaudhuri

recognition. However, when intra-variations are greater than inter-differences in face recognition tasks, softmax loss is not that effective. People then generally look towards creating novel loss functions, such that features are more separable as well as discriminative. Some of the commonly used loss functions include euclidean distance-based loss, cosine margin-based loss, softmax loss variations, etc.

3.2.3

Face Matching

Once the deep networks are trained properly considering the appropriate loss function, test images are passed through networks in order to achieve a deep feature representation. When deep features are extracted, the similarity between two features is calculated through cosine distance. Then identification and verification tasks are performed through nearest neighbor and threshold comparison. This is followed by the usage of deep features towards post-processing. Finally, face matching is performed efficiently and accurately using metric learning, sparse representation-based classifier, etc. The entire gamut of the different face recognition components with respect to training and testing is highlighted in Fig. 2. Some of important aspects like the loss functions, face matching, and face processing are discussed in Sect. 5 considering the deep learning models considered in this investigation. For interested readers further details are available in [3].

Testing

Training

Face Matching

Loss Functions Domain Application

Deep Feature Extraction

Deep Feature Extraction

Face Processing

Face Processing

Input

Input

Fig. 2 Face recognition with its different components with respect to training and testing phases

Deep Learning Models for Face Recognition: A Comparative Analysis

109

4 Face Recognition Datasets In this section the face recognition datasets for performing the experiments with deep learning networks are discussed. There is a wide array of face recognition datasets readily available (http://www.face-rec.org/databases/) for research and exploration. However, in this research work based on the success received from [3] we created an appreciably large face dataset for performing the experiments with the deep learning models. The datasets were specifically being created for performing the experiments by our team. The face data are collected from various sources available across the data repositories. The experimental dataset contains over 400 million face images [3]. The face images were adopted from different sources like LWF, Facebook, Twitter, Google, CelebFaces, etc. The process consists of multiple steps towards collection of huge dataset with images in thousands with respect to unique identities in thousands as shown in Table 1. Figure 3 shows the sample images from experimental datasets. The different steps of this process with statistics are shown in Table 1 and briefly explained here: Step 1 The first step towards the construction of the face dataset requires obtaining candidate identity names list towards obtaining faces. The basic objective here looks into eminent personalities such that a good number of distinct images are available on the web. Based on popularity ranking considering the internet movie database (IMDB) celebrity list a base list of public figures is achieved. This list contains film

Table 1 The stage-wise statistics for the dataset preparation process Step 1 2 3 4 5

Objective Generation of list for candidates Expansion of image set Ranking of image sets Removal of duplicates Filtering manually

Processing type Automatic

Number of people 10,000

Automatic Automatic Automatic Manual

5244 5244 5244 5244

Fig. 3 The sample images from experimental datasets

Number of images/people 400

Total number of images 4,000,000

4000 2000 1246 750

20,976,000 10,488,000 6,534,024 3,933,000

110

A. Chaudhuri

artists. The overlapping is done considering people present in Freebase knowledge graph [3]. It contains information for 500 K different identities which results in ordered lists with 5 K males and females, respectively. It results in the formation of popular candidate list of 10 K names. For this we consider attribute information like ethnicity, age, kinship, etc. The 10 K images are considered in order to have annotation successful for the team. The filtering is done on candidate list in order to replace identities where lack of distinct images exists. This eliminates overlap with standard benchmark datasets. Now 400 images considering each 10 K names that are taken from Google search for images. Then 400 images are given to annotators to determine the identity that results to adequate image purity. The annotators are advised to keep identity when 400 image set has about 90% purity. The image scarcity can sometimes lead to lack of purity. The candidate list is reduced to 6500 identities through filtering. Any name which appears in LFW and YTF datasets are removed such that it is possible to perform training on new dataset as well as go for fair evaluation on the benchmark. This leads to a final list of 5244 celebrity names. Step 2 The 5244 celebrity names are searched in Google search for images. This activity is repeated when keyword actor is appended with names. It leads to eight queries per name with 1000 results for each. From this we can obtain 4000 images considering each identity. Step 3 This step removes any erroneous faces from each set. In order to achieve this, top 100 images for every identity are considered +ve and rest of the top 100 images are considered –ve training samples, respectively. Training is done on linear SVM (one-vs-rest) for every identity considering Fisher vector faces descriptor [3]. For each identity linear SVM then ranks 4000 downloaded images with respect to that identity. The top 2000 are retained with 2000 as threshold number was taken to support high precision in +ve predictions. Step 4 The exact duplicate images coming from the same image or the same image copies found through different searches are removed. The near duplicate images are also removed. This is achieved through VLAD descriptor [3] considering every image with clustering being performed on descriptors in 2000 images for each identity using strict threshold as well as retaining single element per cluster. Step 5 Till now 5244 identities and around 2000 images per identity. This step increases the data precision through human annotations. However, to avoid higher cost of annotation, annotators are helped considering automatic ordering again. Now training is done on multi-way CNN in order to have distinction for 5244 face identities considering AlexNet architecture. The softmax scores are then ranked images considering each identity set through inlier decreasing likelihood. The ordered images within every identity are placed as 400 blocks. The annotators then validate the entire blocks. If approximate purity of block is greater than 95% it is considered as good. Finally, we have 1,965,606 good images with about 95% (frontal) and 5% (profile).

Deep Learning Models for Face Recognition: A Comparative Analysis

111

We now have an accurate large-scale faces dataset in place with labeled identities. Additional images may be taken from Wikimedia Commons, IMDB, Baidu, Yandex, etc. The major concern here is towards duplicate removal which should be done very carefully. The step 1 for image collection can be automated considering pairwise distance distribution with respect to images downloaded. Generally any class of image having higher purity exhibits unimodal distribution.

5 Face Recognition with Deep Learning In this section important deep architectures for face recognition systems which have grown over the years are highlighted. The deep face recognition architectures are coupled with discriminative loss functions and face matching though deep features. There are a large number of human faces available. As such the face recognition process can be considered as a fine grained object classification task. However, for many applications it becomes really difficult to have all candidate faces during training stage. This makes face recognition a zero shot learning activity. As almost all the human faces share identical shape and texture, the representation adopted from small proportion of faces can generalize well. This calls towards including many IDs in the training set. Various organizations have revealed that their deep face recognition system is trained by around 106−07 IDs [3]. The currently available public training databases consist of only 103−105 IDs. This issue is addressed through designing effective loss functions and deeper architectures in order to give deep features higher discrimination capability through small training datasets. Here we place significant deep networks towards face recognition which has evolved over the years. The deep networks considered here are DeepID, DeepID2, DeepID2+, DeepID3, DeepFace, Face++, FaceNet, and Baidu.

5.1 Deep Learning Architectures Considering the basic face recognition overview presented in Sect. 3, the traditional face recognition problem can be represented as the sequence of activities as shown in Fig. 4. The face detection problem has been tackled through deformable parts models [46]. The most celebrated work on face detection has been done through CNN cascade by [47] as shown in Fig. 5. The corresponding test pipeline for the detector is shown in Fig. 6.

Fig. 4 The sequence of activities in traditional face recognition problem

112

A. Chaudhuri

Fig. 5 The CNN cascade for face detection

Fig. 6 The test pipeline for detector

Fig. 7 The 3-level CNN for estimating facial key point positions

Another significant work worth mentioning here is the facial point detection through CNN cascade by [48]. Here the facial key point positions are estimated with three-level CNN as shown in Fig. 7. CNN makes accurate predictions at first level considering the face data as presented in Fig. 8. It effectively avoids local minimum problem faced by other approaches. Figure 9 shows the deep CNN’s structure. The multiple networks’ outputs are fused towards accurate and robust estimation considering each level. At initial stage the extraction of global high-level features is performed considering the entire region of face. This allows towards localization of high accuracy key points. It helps to locate each key point considering information for texture context with respect to the whole face. The geometric constraints for important points are implicitly encoded. The training of networks

Deep Learning Models for Face Recognition: A Comparative Analysis

113

Fig. 8 The CNN predictions at first level

Fig. 9 The deep CNN structure

is performed in order to predict all important points at the same time. The method avoids local minimum due to ambiguity and data corruption for difficult samples. This happens because of occlusions, large pose variations, and extreme lightings. At the proceeding levels the networks are trained in order to refine initial predictions locally. The corresponding inputs are restricted towards small regions considering initial predictions. Here investigation was performed for various network structures considering accurate and robust facial point detection. The face alignment algorithms have been used extensively by DeepID models [34]. DeepID learns feature representations set for high-level towards verification of face. It has effectively learned difficult multi-class tasks for face identification which can be generalized towards verification and new identities for training set. DeepID’s generalization capability grows when during training higher number

114

A. Chaudhuri

Fig. 10 The high-level feature extraction process

Fig. 11 The structure of DeepID

of face classes are predicted. Figure 10 [34] shows feature extraction process at higher level. DeepID’s structure is shown in Fig. 11. The features are adopted from activations of last hidden layer neuron for deep CNN. To recognize around 10 K face identities for training set the classifier based learning is performed. A configuration was done in order to reduce neuron numbers with respect to feature extraction hierarchy. In Fig. 12 for two particular face regions, ten face regions and three scales are shown. With smaller number of hidden neurons, deep CNN forms compact identity-related features at higher layers. From various face regions, proposed features are extracted towards the formation of complementary and overcomplete representations. DeepID uses minimal data for processing. They achieved 97.45% accuracy of verification for LFW dataset with faces which are weakly aligned. The CNN cascade used by [48] is also referred to as DeepID1. DeepID1 had five landmarks with two eye centers, nose tip, and two mouth corners. It is globally aligned considering transformations which are identical. There are ten regions having three scales considering RGB/Gray such that 60 patches are obtained. Each of the 60 CNNs extracts 2160-dimensional vectors from specific patch as well as its horizontal counterpart which is flipped. The flipped patch counterpart is centered at

Deep Learning Models for Face Recognition: A Comparative Analysis

115

Fig. 12 The two particular face regions for ten face regions and three scales

left eye is derived through flipping the patch centered at right eye. The total length for this network is 19,200 (160 × 2 × 60). This is decreased to 150 through principal component analysis considering verification. Another significant work by [49] led to the development of DeepID2. It used supervised descent method towards face alignment and paved the way for DeepID2+ and DeepID3. Their approach was based on non-linear optimization method which used second order descent methods. They are blessed with superior robustness, faster processing, and greater reliability considering non-linear optimization of general smooth function. However, these methods suffer from analytical differentiability and imprecise approximations are impractical as well as Hessian matrix is not large and negative definite. The deficiencies were taken care of through supervised descent method which minimized least squares function which is nonlinear. The training was performed through supervised descent method when it learns directions of descent sequence which minimizes non-linear least squares functions mean sampled at various points. During testing supervised descent method reduces least squares objective which is non-linear through learned descent directions without calculating Jacobian and Hessian. The DeepID2 identified landmarks considering CMU intra-face landmark detector [50] as shown in Fig.13. After landmarks are detected by DeepID2, alignment is achieved through simple similarity transformation can be easily used. This strategy has been used by many models including DeepIDs. CNN structure for DeepID2 feature extraction [50] is given in Fig. 14. The corresponding patches for feature selection [50] are highlighted in Fig. 15. DeepID2 uses 200 patches during training cropped initially with different positions, scales, and color channels. Here every patch and its horizontal flip are put into CNN. The extraction of 2160-dimensional features is performed from patch as well as its mirror-flip. Figure 15 shows 25 best patches that are selected greedily.

116

A. Chaudhuri

Fig. 13 The DeepID2 identified landmarks for CMU intra-face landmark detector

Fig. 14 The CNN structure for DeepID2 feature extraction

Fig. 15 The 25 best patches greedily selected

DeepID1 and DeepID2 have used joint Bayesian for face verification. DeepID1 has used about 8000 identities for CNN training and about 2000 identities are taken with respect to joint Bayesian training. DeepID2 has used about 8000 identities for CNN training and about 2000 identities are held-out for joint Bayesian training. There are subtle differences which DeepID2 has when compared with DeepID1. DeepID2 is better landmark detector and has more landmarks/patches over DeepID1. DeepID2 adopted greedy patches selection. To achieve best performance, this activity has been done 7 times when ensemble model is trained. For DeepID2 verification and identification loss is best when verification signals are generated. The next network in this series was DeepID2+ [50]. Figure 16 [51] shows the architecture of DeepID2+. DeepID2+ adopted more data for processing. It considered data from private datasets such as CelebFace and WFRef which

Deep Learning Models for Face Recognition: A Comparative Analysis

117

Fig. 16 The architecture of DeepID2+

constituted about 12 K ID with 290 K images. It used a larger network with supervision at each layer. Figure 17 [51] shows the schematic representation of data being processed with DeepID2+. It extended the work from DeepID2 [3] and increased the hidden representations’ dimension. It also added supervision towards early convolutional layers. In DeepID2+ the units are tuned to identities such as Bill Clinton and attributes such as male, female, white, black, etc. It used binary features for faster search and testing. DeepID2+ was blessed with occlusion tolerance. Figure 18 shows some tested occluded images. DeepID2+ has used joint Bayesian for face verification. DeepID3 [51] followed DeepID2+. As such it was deeper version of its predecessor. It consists of 10–15 non-linear feature extraction layers in comparison to five layers for DeepID2+. DeepID3 inherits some attributes from DeepID2+. It considered unshared neural weights in last few feature extraction layers. It also performed addition process of supervisory signals towards initial layers as in DeepID2+. DeepID3 was reconstructed through stacked convolution as well as layers of inception stated in VGG and GoogLeNet [3] in order to achieve more robust face recognition. As such two versions of DeepID3 were placed. Figures 19 and 20 show two architectures of DeepID3, viz. DeepID3 net1 and DeepID3 net2. The DeepID3’s depth is attributed towards stacking multiple layers of convolution prior to each layer of pooling. Considering larger receptive fields, the continuous convolution helps towards features’ formation. It also aids higher

118

A. Chaudhuri

Fig. 17 The schematic representation of data processed with DeepID2+

Fig. 18 The tested occluded images

complex non-linearity with restricted parameters [3]. DeepID3 net1 considers two continuous convolutional layers prior to each layer of pooling. Here additional supervisory signals are added towards full connection layers branching from intermediary layers. This helps towards learning better mid-level features and also makes simple optimization. The top two convolutional layers are replaced with local connected layers. With unshared parameters, top layers form higher expressive features with feature dimension reduced. The final locally connected layer is used towards extraction of terminal features without any additional fully connected layer. DeepID3 net2 initiates the process considering every two continuous convolutional layers proceeded by single pooling layer. In later feature extraction stages, with respect to inception layers [3], there are three and two continuous convolutional layers preceding third and fourth pooling layers, respectively. Following each pooling layer, joint identification and verification supervisory signals are added towards fully connected layers. The rectified linear non-linearity [3] is considered for all except pooling layers with respect to both architectures. For the final feature extraction layer, dropout learning [3] is added. Considering significant depth DeepID3 networks are comparatively smaller in size than VGG or GoogLeNet. This is because in each layer feature maps are restricted in number. Identical to

Deep Learning Models for Face Recognition: A Comparative Analysis Fig. 19 The architecture of DeepID3 net1

119

Supervisory signals

Local-connection 10 Local-connection 9 Pooling 4

Supervisory signals

Full-connection 4

Convolution 8 Convolution 7 Pooling 3

Supervisory signals

Full-connection 3

Convolution 6 Convolution 5 Pooling 2

Supervisory signals

Full-connection 2

Convolution 4 Convolution 3 Pooling 1

Supervisory signals

Full-connection 1

Convolution 2 Convolution 1 Input face image

DeepID2+, DeepID3 is trained for same 25 face regions [3]. Here each network considers as input the particular part of the face. Considering feature selection, these face regions are selected. The differentiating factor is in terms of positions, scales, and color channels. With this different network can easily learn complementary information. The joint face identification and verification supervisory signals are appended during intermediate and final feature extraction layers at training stage. Once the training is completed, features are extracted from these networks considering respective regions of face. For face verification or identification, learning

120 Fig. 20 The architecture of DeepID3 net2

A. Chaudhuri

Supervisory signals

Full-conection 4 Pooling 4 Inception 9 Inception 8 Pooling 3

Supervisory signals

Full-connection 4

Inception 7 Inception 6 Inception 5 Pooling 2

Supervisory signals

Full-connection 2

Convolution 4 Convolution 3 Pooling 1

Supervisory signals

Full-connection 1

Convolution 2 Convolution 1 Input face image

is performed through an additional joint Bayesian model [3]. The learning for all DeepID3 networks and joint Bayesian models is performed through identical 300 K (approx.) training samples as used in DeepID2+. During face verification, there is a reduction of error rate by 0.81% and 0.26% for DeepID3 net1 and DeepID3 net2, respectively, when compared with DeepID2+ net with respect to horizontal flipping. The two DeepID3 architectures’ ensemble achieved 99.53% and 96.0% LFW face verification accuracy and rank-1 face identification accuracy, respectively. DeepID2+ and DeepID3 have used about 8000 identities for CNN training and

Deep Learning Models for Face Recognition: A Comparative Analysis

121

about 2000 identities are held-out for joint Bayesian training. The face identification performance of different DeepID models is available in [3]. The next significant deep face recognition model was DeepFace [12]. It was developed at Facebook and tries to close the gap with respect to the benchmarks in unconstrained face recognition. In this way it achieved appreciable performance level towards verification of face. It works on alignment step of face recognition pipeline which is of bit complex nature. It employs explicit 3D face modeling such that piecewise affine transformation is easily applied. Then it derives the face from nine-layer network. The 3D alignment step increases the robustness for facial appearance. Figure 21 represents alignment pipeline of a training image. DeepFace network consists of over 120 million parameters considering many connected layers locally but there is no sharing of weights. Figure 22 presents the DeepFace architecture. The network architecture’s success is attributed on the completion of alignment process. As at pixel level every face region’s location is fixed, then learning becomes possible considering RGB values for raw pixels. Here there is no need to apply several convolutions layers [3]. The training was performed with a considerably large facial dataset. With slight adaptation it outperformed existing systems. This dataset was labeled with identity consisting of four million images of face with above 4000 identities. An extremely compact facial representation is produced here. With large facial database the learned representations coupled well with the accurate model-based alignments such that good face generalization is obtained considering unconstrained environments. DeepFace achieved 97.35% accuracy on LFW dataset. The performance of DeepFace on LFW dataset is available in [3]. This reduces the error by more than 27% so the human-like performance is achieved. It also decreased the error rate by over 50% for YTF [3]. Some people have argued that DeepFace is not as good as DeepID2, DeepID3, Face++, and FaceNet [3].

Fig. 21 The alignment pipeline of training image

122

A. Chaudhuri

Fig. 22 The DeepFace architecture

Deep CNN

Training Phase Softmax

Deep CNN

Multi-class Classification

Deep CNN Testing Phase PCA

L2 Distance

Deep CNN Raw Image

Cropped Patches

Naive CNNs Face Representation

Fig. 23 The Megvii face recognition system

Face++ is the next significant deep face recognition model [52]. It is a naive CNN trained with a large dataset. Face++ has evolved its recognition performance with respect to the large number of observations. It considers Megvii face classification (MFC) database which consists of five million faces labeled for nearly 20 K persons. The distribution of MFC database is available in [52]. Figure 23 represents Megvii face recognition system. CNN architecture in Face++ system consists of simple ten layers. Face++ has achieved 99.50% accuracy on LFW dataset as benchmark. The continued performance improvement and corresponding long tail effect for Face++ are available in [52]. Face++ did not produce good results on Chinese ID (CHID) benchmark dataset which was a recognition system in real security certificate environment. Figure 24 shows some failed cases with respect to the CHID benchmark. Face++ does not require any joint identification, verification, and Bayesian, respectively. It also does not require 3D alignment. However, it requires larger dataset in comparison to DeepID in order to complete its training. Now we consider another important deep face recognition model developed by Google known as FaceNet. FaceNet is considered as unified embedding towards recognition of face as well as clustering. FaceNet has used very large dataset with about 260 million face images. It is very deep model which is cropped

Deep Learning Models for Face Recognition: A Comparative Analysis

123

Fig. 24 Some failed cases with respect to CHID benchmark Fig. 25 The model structure of FaceNet

Fig. 26 The harmonic triplet loss in FaceNet

closely but there is no alignment other than crop has been used. It directly learns mapping from images of face towards compact euclidean space. Here distances directly correspond towards face similarity measure. After the space is in place, face recognition, verification, and clustering tasks are implemented through standard techniques where FaceNet is embedded as feature vectors. The training is done for Deep CNN towards optimization of embedding itself instead of any intermediary layer of bottleneck. FaceNet is placed through two deep CNN architectures courtesy Zeiler and Fergus as well as GoogLeNet are available in [38]. The model structure of FaceNet is shown in Fig. 25. FaceNet is trained using roughly aligned matching and non-matching face patches triplets. These are generated using online triplet mining. Considering distance margin, triplet loss separates positive from negative pair. This approach benefits from greater representational efficiency. Considerable good recognition performance is achieved through 128 bytes per face. FaceNet reaches good performance levels with LFW dataset with accuracy of 99.63%. It achieves 95.12% with YouTube faces databases. The FaceNet’s performance can be referred from [38]. FaceNet uses harmonic triplet loss as shown in Fig. 26 describing

124

A. Chaudhuri

different face embedding versions that are received from various networks. The embeddings are compatible with each other. They also allow direct comparisons. The triplet selection can be sometimes tricky. With respect to curriculum learning, it is generally advisable to select hard positive or negative examples considering mini-batch. The large mini-batches may consider some thousand examples where argmin and argmax are calculated within a mini-batch. Along with this some strategies were devised towards the selection of semi-hard negative examples. To further improve the accuracy of clustering hard-positive mining is considered that encourages clusters of spherical nature for single person embeddings. Figure 27 shows typical face clustering examples. Finally, we present the Baidu [3] face recognition model. Baidu works with multiple patches and look towards ultimate accuracy. It is two-stage approach which integrates multi-patch deep CNN and deep metric learning. This helps towards extraction of low dimensional and highly discriminative features towards face recognition and verification. Figure 28 shows deep CNN structure on multi-patches. The training data is adopted from 1.2 M face images constituting 18 K people. Baidu has achieved 99.77% pair-wise verification accuracy on LFW dataset. It is Fig. 27 Typical face clustering examples

Deep Learning Models for Face Recognition: A Comparative Analysis

Conv1 Conv2 Conv3 Conv4 Conv5

125

Conv9 FC Softmax

Fig. 28 The CNN structure on multi-patches

significantly better than other techniques. The data size and number of patches also play a vital role. As such the faces quantity and training data identities are critical towards overall performance. Along with this multi-patch-based feature and metric learning considering triplet loss improves results even size of data grows. The pair-wise error rate for different training data and different number of patches as well as comparison between various methods considering several evaluation tasks are available in [3]. Baidu has shown a positive path towards real-world high performance face recognition systems.

5.2 Discriminative Loss Functions Taking over from object classification tasks, the cross entropy based softmax loss for feature learning was taken by DeepFace and DeepID. However, it was discovered that softmax loss was not capable on its own towards learning features with large margin. This resulted in the exploration of discriminative loss functions considering superior generalization ability. Two loss functions worth mentioning here are euclidean distance and cosine margin-based losses. Along with this feature and weight normalization became famous. We briefly discuss the euclidean distance, cosine margin, and softmax loss functions here for the different deep face recognition architectures.

126

5.2.1

A. Chaudhuri

Euclidean Distance Loss

Euclidean distance loss function considers metric learning [3]. It embeds images within the euclidean space decreasing intra-variance and increasing inter-variance. The contrastive and triplet losses are other functions which are used commonly. The contrastive loss considers face image pairs, pulling the positive pairs and pushing the negative pairs such that LOSS = bij max 0, h (a i ) − h a j 2 − ε + + 1 − bij max 0, ε − − h (a i ) − h a j 2

(1)

In Eq. (1) bij = 1 considers ai and aj as matching samples and bij = − 1 considers non-matching samples. The feature embedding h(·) controls the margins ε+ and ε − for matching and non-matching pairs, respectively. The DeepID2 considers softmax based face identification and contrastive loss-based verification supervisory signals towards learning discriminative representation. The joint Bayesian was considered in order to reach towards robust embedding space. The DeepID2+ enhanced hidden representations’ dimensions and also added initial convolutional layers’ supervision. The DeepID3 also considered VGGNet and GoogleNet. The margin parameters are difficult to consider for contrastive loss function. The triplet loss takes into account the relative difference of distances between matching pairs and non-matching pairs. Google introduced FaceNet and triplet loss for face recognition tasks. Considering face triplets, it minimized anchor and positive sample distance for similar identity and maximized anchor and negative sample distance for different identity. Using hard triplet face samples FaceNet considered x h a − h a s 2 + β < −h a x − h a n 2 i i i 2 i 2

(2)

In Eq. (2) a xi , a si , and a ni represent anchor, +ve and –ve samples, respectively and β being the margin with h(·) as the non-linear transformation embedding from image into feature space. TPE and TSE also projected P linearly in order to construct triplet loss which are represented through the following equations: x T T T a i P P a si + β < a xi P T P a ni

(3)

T T x a i − a si P T P a xi − a si + β < a xi − a ni P T P a xi − a ni

(4)

The other techniques combined triplet and softmax losses [3]. The network training was performed through softmax and fine tuning was done with triplet loss. The contrastive and triplet losses faced instability in training because of the effective training samples’ selection. The intra-variance is compressed through center loss and its variants. This loss is represented as

Deep Learning Models for Face Recognition: A Comparative Analysis

127

1 ai − centerb 2 i 2 2 m

LOSS Centre =

(5)

i=1

In Eq. (5) ai is ith deep feature in bi class and centerbi is center of bi class with deep features. In order to handle long tailed data, range loss is considered in order to minimize greatest range’s harmonic means for one class and maximize shortest inter-class distance within one batch. Some other variations were used by (http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html). The center loss and its variants still suffer from severe memory consumption while classification layer. They look towards balanced and adequate data for training considering each identity.

5.2.2

Cosine Margin Loss

The cosine margin-based loss function [3] allows to separate the learned features considering large cosine distance. The original softmax loss function is reformulated into large margin softmax loss function requiring: P1 a cos (pϕ1 ) > P2 a cos (pϕ2 )

(6)

In Eq. (6) p is +ve integer considering the angular margin, P is last fully connected layer’s weight, a is deep feature, and ϕ specifies the angle. The piecewise function is applied to large softmax loss function to enforce monotonicity. Here the loss function is ⎛

LOSS i =

⎞ Pb ai θ ϕb i i e ⎠ − log ⎝ Pbi ai θ ϕbi e + j =bi e Pbi ai cos(ϕj )

(7)

with θ (ϕ) = (−1)k cos (pϕ) − 2k, ϕ ∈

kπ (k + 1) π , p p

(8)

The large softmax loss has low degree of convergence. As a result of this softmax loss is integrated in order to achieve convergence. Along with this weight is controlled by dynamic hyperparameter λ. Considering additional softmax loss, the function changes into λ Pbi ai cos ϕbi + Pbi ai θ ϕbi hbi = 1+λ

(9)

128

A. Chaudhuri

Considering large softmax loss, angular softmax loss further normalizes weight P through L2 norm P = 1. The normalized vector lies on hypersphere. The discriminative face features are then learned on hypersphere manifold with angular margin. A deep hyper-spherical convolution network (SphereNet) is coined by which takes hyper-spherical convolution as its basic convolution operator and supervised through angular margin-based loss. The optimization problem of large softmax and angular softmax considers angular margin in multiplicative manner, ArcFace, CosineFace, and AMS loss placed additive cosine margin cos(ϕ + p) and cos(ϕ − p). These are considerably straightforward without any hyper-parameters λ and converge without softmax supervision. The cosine margin-based loss adds discriminative constraints explicitly on hypersphere. This intrinsically matches a priori which human face lies on manifold.

5.2.3

Softmax Loss with Its Variations

Softmax loss presents another significant function used extensively [3]. The feature or weight normalization in softmax loss function is P a Pˆ = , aˆ = α P a

(10)

In Eq. (10) α is scaling parameter. When a is scaled to fixed radius, α becomes significant. It has been proved that it normalizes weights and features towards one which makes softmax loss at higher values for training data. The feature and weight normalization are good strategies for implementing with loss functions. The loss functions are normalized weights only and trained with cosine margin to make learned features as discriminative. The feature normalization is adopted in order to overcome sample distribution of softmax bias. Again, the features’ L2-norm learned through softmax loss is informative of face’s quality. The L2-softmax put all features to have similar L2-norm by feature normalization. This allows similar attention towards good quality frontal faces and blurry faces with extreme pose. Some other significant works in softmax loss are [3].

5.3 Face Matching Through Deep Features At testing cosine distance and L2 distance figure out similarity between a1 and a2 deep features. The important tasks here are verification and identification. The verification and identification decisions are made through threshold comparison and nearest neighbor classifier.

Deep Learning Models for Face Recognition: A Comparative Analysis

5.3.1

129

Face Verification

The metric learning tries to identify a new metric that makes two classes more separable. It can also be used towards face matching considering extracted deep features. The joint Bayesian [3] model is another important metric learning method. It is found that it improves performance significantly. In joint Bayesian model face feature a is mapped as a = μ + ε with μ and ε being the identity and intra-personal variations, respectively. The similarity score is

sim (a1 , a2 ) = log

Prob (a1 , a2 |HI S ) Prob (a1 , a2 |HI D )

(11)

In Eq. (11) Prob(a1 , a2 | HIS ) and Prob(a1 , a2 | HID ) represent the probability that two faces belong to similar identities and different identities, respectively.

5.3.2

Face Identification

After calculating the cosine distance, heuristic voting strategy [3] has been used considering similarity scores towards robust multiple CNN models. From face image’s local regions extraction of local adaptive convolution features is also performed. They used extended sparse representation classifier for face recognition considering single sample for each person. Again, deep features are combined with SVM classifier in order to recognize all classes. Considering deep features, product quantization is used in order to retrieve top-k identical faces. Then they ranked these faces again through combination of similarities considering deep features with COTS matcher. Softmax has also been used in face matching when training and test set identities overlap. When training and testing data distribution are same, face matching methods stated are effective. There is a distribution change or domain shift between two domains which degrades performance. Transfer learning has been used for deep face recognition that utilizes data in relevant source domains in order to execute face recognition target domain. CNN features are combined with templatespecific linear SVMs which are also considered as transfer learning. Transfer learning can be embedded in deep models in order to learn more transferable representations. Some of the other significant works can be found in [3].

5.4 Face Processing Using Deep Features The processing of faces using deep features [3] plays a significant role in face recognition task. Broadly this activity has been classified as one-to-many augmentation and many-to-one normalization. The one-to-many augmentation [3] addresses data collection challenges effectively. They can augment training data as well as test data

130

A. Chaudhuri

gallery. They are placed into four classes, viz. data augmentation, 3D, CNN, and generative adversarial network (GAN). Data augmentation includes photometric transformations, geometric transformations, mirroring and rotating images. It has also been used towards deep face recognition algorithms. The 3D reconstruction of faces enriches training data diversity. They have also been used for 3D face reconstruction considering deep methods. The CNN models generate 2D images directly. In multi-view perceptron, identity features are learned by deterministic hidden neurons and view features are captured by random hidden neurons. When different random neurons are sampled, face images with variant poses can be analyzed. GAN has been used for refining images combining data distribution’s prior knowledge and faces’ knowledge. Several other versions of GAN are also available. The many-to-one normalization [3] produces front faces and reduces test data’s appearance variability such that faces are aligned and compared. They are placed into three classes, viz. stacked progressive auto encoders (SAE), CNN, and GAN. SAE maps non-frontal face towards frontal face considering auto encoders stack. CNN has been used for extracting face identity preserving features in order to reconstruct face images in canonical view. It consists of feature extraction and front face reconstruction modules. GAN consists of four landmark located patch and global encoder decoder networks. Through combination of adversarial, symmetry, and identity preserving losses it has generated frontal view and also maintains global structures with details from local areas.

6 Experimental Results Here results of various experiments are highlighted for deep networks discussed in Sect. 5.1 on the simulated face recognition datasets given in Sect. 4. The experiments are performed on the simulated data using DeepID1, DeepID2, DeepID2+, DeepID3, DeepFace, Face++, FaceNet, and Baidu models. The training of the models is performed using the simulated dataset. The deep models are evaluated through LFW and YTF datasets. These two datasets are considered as benchmark datasets since they contain different identities from the simulated dataset in Sect. 4. The implementation work is performed using Python 3 (https://www. python.org/download/releases/3.0/) with all the codes written from scratch considering NVIDIA CuDNN libraries (https://developer.nvidia.com/cudnn) in order to make the training process faster. All the investigations were done on NVIDIA GTX Titan Black GPUs (https://www.geforce.com/hardware/desktop-gpus/geforcegtx-titan-black/specifications) having 16 GB worth memory onboard with 12 GPUs along with. This hardware configuration [3] has been specifically prepared for this work. It is appreciable because of highly significant memory footprint with high complexity for deep networks. CNNs used here contain linear as well as non-linear class predictor and softmax layers from which descriptor vectors are received as output. Considering a face image four (256 × 256) pixel patches are being cropped considering corners. Then center is horizontally flipped and corresponding feature

Deep Learning Models for Face Recognition: A Comparative Analysis

131

vectors are averaged. In order to allow testing of multi-scale nature, scaling of face towards three variants sizes worth 384, 512, and 1024 pixels is performed. For each of these sizes, cropping process is repeated [3]. The resultant face descriptor is average considering all features. The faces detection is done through technique highlighted in [3]. For alignment of face, landmarks over the faces are calculated [3]. Then 2D similarity transformation is used for mapping face towards canonical postures. For the face videos considered, K descriptors of face are considered for each video through faces ordering with respect to their landmark confidence scores and taking first K top scores. 2D alignment is performed for frontal faces. For profiles no alignment is done. Then average of K descriptors of face represents the video. Few important aspects which we considered during training the deep learning models include dataset curation, image alignments, and parameter variation of network architectures. The data curation improves the models’ performance considerably. The alignment on test images also boosts up the performance. Varying the significant network parameters such as learning rate and momentum smartly does improve the overall network performance. During training, face recognition models are evaluated considering subject-dependent or independent settings. Considering subject-dependent protocol, in training set testing identities are predefined. As a result of this it becomes straightforward for classification of testing faces towards given identities. Subject-dependent face recognition is treated as classification process with features being separable. However, it is suited for small applications. Considering subject-independent protocol, identities for testing are disjoint with respect to training set. As a result of this, face recognition becomes challenging. Because it is not possible to perform face classification towards training set’s known identities, subject-independent representation is essential. Since human faces exhibit similar intra-subject variations, deep models display transcendental generalization ability when training is done with sufficiently large generic subjects set. Here, it is important towards learning discriminative deep features with large margin. All major face recognition benchmarks use subject-independent protocol.

6.1 Evaluation Rules To perform the evaluation performance for deep learning models considering real-life face recognition problems, several testing datasets with different tasks and scenes are developed [3]. Considering the testing tasks, recognition model’s performance is evaluated using face verification, close-set face identification, and open-set face identification settings [3]. Here every task has their corresponding performance metrics. Face verification evaluates considering control systems access, re-identification, and independent application aspects. It measures using receiver operating characteristic and estimated mean accuracy. The receiver operating characteristic analysis measures true acceptance and false acceptance rates for threshold considered. The estimated mean accuracy represents correct classification percent-

132

A. Chaudhuri

age. The degree of security is required more and more strictly by testing datasets in order to match the fact that customers concern more about true acceptance rate when false acceptance rate is kept at very low rate in all security certifications. Close-set face identification performs evaluation for user driven searches. It uses rank-N and cumulative match characteristic metrics. Rank-N is based on percentage of probe searches which return probe’s gallery mate considering top K ranks. The cumulative match characteristic curve reports probes percentage recognized considering given rank. There was a concern on recognition rates for rank-1 and rank-5. A challenge competition has systematically evaluated rank-1 recognition rate function for increasing gallery distractors. Rather than rank-N and cumulative match characteristic, there has been an application of precision-coverage curve in order to measure identification performance under variable threshold. The probe is rejected when its confidence score is lower than threshold. Open-set face identification performs evaluation for high throughput face search systems where recognition system rejects unknown subjects such as probes not present in gallery at test time. Presently very few databases cover open-set face recognition task. A benchmark uses decision error tradeoff curve in order to characterize false negative identification rate as function of false positive identification rate. The false positive and false negative identification rate considers score above and below thresholds with respect to probe and non-mate gallery templates.

6.2 Comparison of Experimental Results of Existing Facial Models This subsection highlights the experimental results for the deep learning models with respect to the simulated dataset. Table 2 highlights the results of deep face recognition models on the experimental dataset highlighted in Sect. 4. The images considered for performing the experiments on different deep face recognition techniques are also stated here. From Table 2 it is observed that DeepID1, DeepID2, DeepID2+, and DeepID3 achieve an accuracy of over 99% for the experimental

Table 2 The comparative analysis of deep face recognition models on experimental datasets Deep face recognition techniques DeepID1 DeepID2 DeepID2+ DeepID3 DeepFace Face++ FaceNet Baidu

Images considered 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 100,000,000 200,000,000 400,000,000

Accuracy % 99.50 99.53 99.53 99.53 97.86 97.96 98.96 98.99

Training time (min) 46 48 50 54 58 186 246 406

Deep Learning Models for Face Recognition: A Comparative Analysis

133

Table 3 The comparative analysis of deep face recognition models on LWF and YTF datasets Deep face recognition techniques DeepID1 DeepID2 DeepID2+ DeepID3 DeepFace Face++ FaceNet Baidu

Accuracy % (LWF) 99.27 99.34 99.34 99.34 97.69 97.70 98.75 98.89

Accuracy % (YTF) 99.24 99.37 99.37 99.39 97.68 97.69 98.77 98.86

Training time (min) 46 48 50 54 58 186 246 406

dataset. DeepFace and Face++ reach an accuracy of over 97%. FaceNet and Baidu achieve nearly 98% accuracy. The variant size of the image dataset is considered in order to get the best possible results. The accuracy of these techniques can be further improved through other image dataset variants. The corresponding training times are also highlighted.

6.3 Comparative Analysis of Deep Learning Models with Benchmark Datasets This subsection highlights the comparative analysis for the deep learning models with respect to the LWF and YTF benchmark datasets. Table 3 highlights the results of deep face recognition models on LWF and YTF datasets. These datasets are used as benchmark towards the results presented in the table above. From Table 3 it is observed that DeepID1, DeepID2, DeepID2+, and DeepID3 achieve an accuracy of over 99% for LWF and YTF datasets. DeepFace and Face++ reach an accuracy of over 97% for these two datasets. FaceNet and Baidu achieve an accuracy of over 98% for both the datasets. The corresponding training times are also highlighted. On comparing these results with those in Sect. 6.2, it is inferred that the experimental datasets achieve better recognition accuracies than those with LWF and YTF datasets. However, it is to be noted that the image dataset sizes were made non-uniform in order to have the best accuracy results in place. The accuracy of these techniques can be further improved through other image benchmark datasets.

7 Further Discussions In this section we present subtle face recognition issues as well as some open-end research questions [3] which set the direction towards future research. These are some explored aspects on which we are currently working on.

134

A. Chaudhuri

7.1 Other Recognition Issues As such we always look towards a stage when face recognition algorithms perform well for any test dataset. To achieve this, deep learning models require bigger training datasets with superior algorithms. Owing to certain security issues, the publicly available datasets are generally taken from celebrity pictures. These datasets have a gap in the sense that they do not present the same picture as the images taken from everyday life with several diversifications. In spite of reaching higher accuracy levels with LFW and other datasets as benchmarks, the present face recognition systems rarely meet real-life requirements. Generally people think the deep models performance can be enhanced though collection of large datasets from target images which is only partially true. Sincere efforts have been made towards addressing target images though superior algorithms with minimal data. Here we briefly discuss such issues. The discussion briefly caters around cross factor face recognition, heterogeneous face recognition, single or multiple media face recognition, and industry based face recognition. The cross factor face recognition revolves around cross pose face recognition, cross age face recognition, and makeup face recognition. The cross pose face recognition poses difficulties for extremely challenging images scenes. They mainly include one-to-many augmentation, many-to-one normalization, multi-input networks, and multitask learning methods. Generally frontalization in deep feature space is performed where deep residual equi-variant mapping block is dynamically added to residuals such that the input representation transforms profile face to frontal image. Some other notable works can be found in [3]. The cross age face recognition poses challenges arising from facial appearance changes which comes from aging. Here the input image is synthesized towards target age. A generative probabilistic approach models facial aging process at each short span stage. Identity preserving conditional generative adversarial networks uses conditional generative adversarial networks in order to generate face in which an identity-preserving module maintains identity information and age classifier produces generated face with target age. Some significant works are available in [3]. The makeup face recognition poses challenges towards face recognition considering significant facial appearance changes. Here matching makeup and non-makeup faces is used. Some important works can be found in [3]. The heterogeneous face recognition considers near infrared versus visual face recognition, low resolution face recognition, and photo sketch face recognition. Near-infrared spectrum images are widely used in surveillance systems because of their superior performance. Due to the presence of visible light spectrum images in enrolled databases, near-infrared face recognition from visible images gallery has been actively investigated. Some notable works are available in [3]. The low resolution face recognition uses deep learning networks due to their robustness towards low resolution degree. Some of the significant works here can be tracked in [3]. The photo sketch face recognition helps towards enforcing law for quick suspect identification. They are placed in two classes where one utilizes transfer

Deep Learning Models for Face Recognition: A Comparative Analysis

135

learning in order to directly match photos with sketches. Here deep networks are first trained using large photo face database which are further fine-tuned using small sketch database. The next class uses image-to-image translation where photo gets transformed to sketch or vice versa. Face recognition can then be done in one domain. The important works are listed in [3]. The single or multiple media face recognition takes into account low shot face recognition, template based face recognition, and video face recognition. The low shot face recognition finds several applications in surveillance and security. The face recognition system recognizes person with very limited training samples. The low shot learning has been placed into training data enlargement and learning more powerful features. Some of the significant works are available in [3]. The template based face recognition assumes that both probe and gallery sets are represented using media sets. After learning face representation sets from each medium individually the two strategies are adopted towards face recognition between sets. In one case for similarity comparison between media in two sets, the results are pooled into a single, final score. The other case aggregates face representations using max pooling which generates single representation considering every set followed by a comparison between two sets. Some notable works can be taken from [3]. The video face recognition addresses two important aspects in video face recognition. The initial one looks towards integration of information across different frames together in order to build video face representation. The next one handles video frames with severe blur, pose variations, and occlusions. Some important works are available in [3]. The industry based face recognition has 3D face recognition, partial face recognition, face anti-spoofing, and mobile device based face recognition. The 3D face recognition holds advantages over 2D methods. But it is not well developed due to the lack of large annotated 3D data. To enlarge 3D training datasets mostly one-to-many augmentation is used in order to synthesize 3D faces. However, the effective methods for extracting deep features of 3D faces still need to be explored. Some significant works are available in [3]. The partial face recognition presents only arbitrary-size face patches. It has become an emerging problem with increasing requirements of identification from CCTV cameras and embedded vision systems in mobile devices, robots, and smart home facilities. Some notable works are placed in [3]. The face anti-spoofing addresses print attacks, replay attacks, and 3D mask attacks. The face anti-spoofing is very critical step in order to recognize whether the face is live or spoofed. It also needs to recognize faces as true or false identity. Some works worth mentioning are given in [3]. The mobile device based face recognition has become very prominent with emergence of mobile phones, tablets, and augmented reality. Because of the computational limitations, the recognition tasks in these devices are carried in light but timely fashion. Some significant works can be found in [3]. In almost deep face recognition systems, CNNs have been widely used in order to achieve good performance. An appreciable amount of effort has been towards enhancing CNN though superior architectures and learning methods. This is generally attributed towards the effect of various covariates considering performance

136

A. Chaudhuri

on CNNs [3]. The covariates influence quality of image considering blur, JPEG compression, occlusion, noise, image brightness, contrast and missing pixels as well as characteristics of model considering architecture, color information, and descriptor computation. They have analyzed face verification performance impact for AlexNet, VGG-Face, GoogLeNet, and SqueezeNet. After thorough investigation it has been observed that high noise levels, blur, missing pixels, and brightness bear significant effect on performance verification for the models but changes of contrast impact as well as artifacts compression become less. The computation strategy for descriptor as well as information on color never influences the performance significantly.

7.2 Open Research Questions Some of the open-end research questions which set the tone for further research and exploration [3] are highlighted here. (a) How accurately and reliably deep face recognition systems are able to identify and analyze faces based on age, gender, and emotions of different faces despite appearance variations? The human face changes considerably with age. It never remains the same. It always keeps on changing. People from various age groups have different appearance variations on their faces. The problem becomes more complicated when the age factor is combined with gender. When emotion is added the complexity of the problem grows further. The question which needs to be addressed here is the accuracy and reliability of such systems. A fair amount of work has been done in this direction but still this is an active research problem in deep face recognition. (b) How well do the deep face recognition systems perform when confronted with real-life data? The real-life data is full of various anomalies and uncertainties. It is basically of unknown nature. It is filled with several unwanted facets particularly in the form of noise which further distorts the nature of the data. Any deep face recognition system will be put to test when it analyzes such data. The system needs to be self-adaptive in nature to address the issues present in such datasets. Several efficient deep face recognition systems have been in place but how well they perform on different real-life dataset is an active exploration area. (c) How fast the deep face recognition algorithms achieve generalization when trained across different training datasets? The generalization is the most sought after aspect for most of the deep face recognition algorithms. Basically, the algorithm is trained with larger datasets in order to achieve better generalization. When the algorithm is trained across different datasets the convergence towards the best generalization values remains an active question. This happens because the nature of the different

Deep Learning Models for Face Recognition: A Comparative Analysis

137

training datasets is never identical. As such the algorithm parameters needs to tuned in such a way so that better generalization for the algorithms is reached. This is an ever growing research area which will evolve with better placed deep face recognition systems. (d) What is the computational complexity involved in executing the deep face recognition algorithms in some of the fastest computers? The computational complexity of the deep face recognition algorithms is arrived through the number of parameters which form the algorithm. With the tremendous growth of the computational devices over the years in terms of processing power this problem is of tractable nature. However, as the number of parameters in the algorithms grows in order to address all the user requirements, the processing power of the computational devices should also grow at the same rate. If this is not achieved the newer algorithms with larger number of parameters will not achieve the required computational speed. (e) What is the level of perfection in terms of human level facial perception achieved by the deep face recognition systems? The human level facial expression is always changing with time. It never remains constant. In order to address this the deep face recognition system should be able to capture all the possible perceptions available on the human face. This again makes the problem difficult and the deep face recognition algorithms are presented with the challenge of achieving appreciable perfection levels. This is also an ever growing research area which will evolve with the growth of the deep face recognition systems. These questions can readily be taken up as active research topic in deep face recognition.

8 Conclusion Here we have presented a comparative analysis of important deep face recognition models which have evolved in the past decade. The models considered in this investigation include DeepIDs, DeepFace, Face++, FaceNet, and Baidu. A multistage strategy is used in order to collect large face datasets and thereby develop new datasets for performing experiments. The model training and evaluation are performed with new datasets. The experimental results for different deep face recognition methods are presented. The DeepID1, DeepID2, DeepID2+, and DeepID3 models achieve an accuracy exceeding 99% for experimental dataset. The DeepFace and Face++ have an accuracy exceeding 97%. The FaceNet and Baidu reach about 98% accuracy. The image datasets are made of different sizes to reach the best possible results. These results are further supplemented with comparative analysis against benchmark datasets. It is observed that the experimental datasets with different image dataset sizes reach better recognition accuracies than the LWF and YTF datasets. For the LWF and YTF benchmark datasets the DeepID1, DeepID2,

138

A. Chaudhuri

DeepID2+, and DeepID3 reach an accuracy exceeding 99%. The DeepFace and Face++ achieve an accuracy exceeding 97% for these two datasets. FaceNet and Baidu reach an accuracy exceeding 98% for both these datasets. Several deep face recognition issues and other open-end questions are placed which forms potential directions towards future research.

References 1. M. Wang, W. Deng, Deep face recognition: a survey, arXiv:1804.06655v7 (2018) 2. K. Grm, V. Štruc, A. Artiges, M. Caron, H.K. Ekenel, Strengths and weaknesses of deep learning models for face recognition against image degradations, arXiv:1710.01494v1 (2017) 3. A. Chaudhuri, Some Investigations on Deep Face Recognition Using Artificially Created Datasets, Technical Report TH-1696 (Samsung R & D Institute Delhi, Noida, 2016) 4. O.M. Parkhi, A. Vedaldi, A. Zisserman, Deep face recognition, in Proceedings of British Machine Vision Conference, ed. by X. Xie, M. W. Jones, G. K. L. Tam, (BMVA Press, Surrey, 2015), pp. 41.1–41.12 5. K. Chatfield, K. Simonyan, A. Vedaldi, A. Zisserman, Return of the devil in the details: delving deep into convolutional nets. arXiv:1405.3531 (2014) 6. G. Hu, Y. Yang, D. Yi, J. Kittler, W. Christmas, S.Z. Li, T. Hospedales, When face recognition meets with deep learning: an evaluation of convolutional neural networks for face recognition, arXiv:1504.02351v1 (2015) 7. S. Zhou, S. Xiao, 3D face recognition: a survey. Hum. Centric Comput. Inf. Sci. 8(35), 19–45 (2018) 8. M. Chihaoui, A. Elkefi, W. Bellil, C.B. Amar, A survey of 2D face recognition techniques. Computers 5(4), 21 (2016) 9. A. Chaudhuri, 2D and 3D Face Recognition Revisited Again, Technical Report TH-1486 (Samsung R & D Institute Delhi, Noida, 2014) 10. A. Chaudhuri, Studying Face Recognition Using Convolutional Neural Networks, Technical Report TH-1669 (Samsung R & D Institute Delhi, Noida, 2016) 11. A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep convolutional neural networks, in Proceedings of Neural Information Processing Systems, (MIT Press, Cambridge, 2012), pp. 1097–1105 12. Y. Taigman, M. Yang, M.A. Ranzato, L. Wolf, DeepFace: closing the gap to human level performance in face verification, in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (IEEE Computer Society Press, Los Alamitos, 2014), pp. 1701–1708 13. Y. LeCun, Y. Bengio, G. Hinton, Deep learning. Nature 521, 436–444 (2015) 14. S. Chaib, H. Yao, Y. Gu, M. Amrani, Deep feature extraction and combination for remote sensing image classification based on pre-trained CNN models, in Proceedings of 9th International Conference on Digital Image Processing, (IEEE, Piscataway, 2017) 15. Y. Liu, Y. Li, X. Ma, R. Song, Facial expression recognition with fusion features extracted from salient facial areas. Sensors (Basel) 17(4), 712 (2017) 16. J. Lu, G. Wang, J. Zhou, Simultaneous feature and dictionary learning for image set based face recognition. IEEE Trans. Image Process. 26(8), 4042–4054 (2017) 17. G. Hu, X. Peng, Y. Yang, T.M. Hospedales, J. Verbeek, Frankenstein: learning deep face representations using small data. IEEE Trans. Image Process. 27(1), 293–303 (2018) 18. M. Oquab, L. Bottou, I. Laptev, Learning and transferring mid-level image representations using convolutional neural networks, in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (IEEE Service Center, Piscataway, 2014), pp. 1717–1724 19. V. Campos, B. Jou, X. Giro-i-Nieto, From pixels to sentiment: fine-tuning CNNs for visual sentiment prediction. Image Vis. Comput. 65, 15–22 (2017)

Deep Learning Models for Face Recognition: A Comparative Analysis

139

20. V. Nagori, Fine tuning the parameters of back propagation algorithm for optimum learning performance, in Proceedings of 2nd International Conference on Contemporary Computing and Informatics, (IEEE, Piscataway, 2016), pp. 7–12 21. M. Tzelepi, A. Tefas, Exploiting supervised learning for finetuning deep CNNs in contentbased image retrieval, in Proceedings of 23rd International Conference on Pattern Recognition, (IEEE, Piscataway, 2016), pp. 2918–2923 22. Y. Li, W. Xie, H. Li, Hyperspectral image reconstruction by deep convolutional neural network for classification. Pattern Recogn. 63, 371–383 (2016) 23. W. Rawat, Z. Wang, Deep convolutional neural networks for image classification: a comprehensive review. Neural Comput. 29(9), 2352–2449 (2017) 24. X. Glorot, Y. Bengio, Understanding the difficulty of training deep feedforward neural networks. J. Mach. Learn. Res. 9, 249–256 (2010) 25. W. Shang, K. Sohn, D. Almeida, H. Lee, Understanding and improving convolutional neural networks via concatenated rectified linear units, in Proceedings of International Conference on Machine Learning, (IEEE, Piscataway, 2016), pp. 2217–2225 26. Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998) 27. S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training by reducing internal covariate shift, in Proceedings of International Conference on Machine Learning, (IEEE Service Center, Piscataway, 2015), pp. 448–456 28. J. Li, T. Qiu, C. Wen, K. Xie, F.Q. Wen, Robust face recognition using the deep C2D-CNN model based on decision level fusion. Sensors (Basel) 18(7), 2080 (2018) 29. J.Y. Choi, Y.M. Ro, K.N. Plataniotis, Color local texture features for color face recognition. IEEE Trans. Image Process. 21(3), 1366–1380 (2012) 30. Z. Lu, X. Jiang, A. Kot, A novel LBP-based color descriptor for face recognition, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, (IEEE Service Center, Piscataway, 2017), pp. 1857–1861 31. Z. Lu, X. Jiang, A. Kot, An effective color space for face recognition, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, (IEEE Service Center, Piscataway, 2016), pp. 849–856 32. Z. Lu, X. Jiang, A. Kot, A color channel fusion approach for face recognition. IEEE Signal Process. Lett. 22(11), 1839–1843 (2015) 33. K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (IEEE Service Center, Piscataway, 2016), pp. 770–778 34. Y. Sun, X. Wang, X. Tang, Deep learning face representation from predicting 10,000 classes, in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (IEEE Service Center, Piscataway, 2014), pp. 1891–1898 35. O. Tadmor, Y. Wexler, T. Rosenwein, S. Shalevshwartz, A. Shashua, Learning a metric embedding for face recognition using the multi batch method, in Proceedings of Neural Information Processing Systems, (MIT Press, Cambridge, 2016), pp. 1388–1389 36. Y. Wen, K. Zhang, Z. Li, Y. Qiao, A discriminative feature learning approach for deep face recognition, in Proceedings of European Conference on Computer Vision, (Springer Nature, Cham, 2016), pp. 499–515 37. Y. Zhang, K. Shang, J. Wang, N. Lia, M.M.Y. Zhang, Patch strategy for deep face recognition. IET Image Process. 12(5), 819–825 (2018) 38. F. Schroff, D. Kalenichenko, J. Philbin, FaceNet: a unified embedding for face recognition and clustering, arXiv:1503.03832v3 (2015) 39. S. Dodge, L. Karam, Understanding how image quality affects deep neural networks, in Proceedings of 8th IEEE International Conference on Quality of Multimedia Experience, (IEEE Service Center, Piscataway, 2016), pp. 1–6 40. B.R. Webster, S.E. Anthony, W.J. Scheirer, PsyPhy: a psychophysics driven evaluation framework for visual recognition, arXiv:1611.06448 (2016)

140

A. Chaudhuri

41. A. Dosovitskiy, J.T. Springenberg, M. Riedmiller, T. Brox, Discriminative unsupervised feature learning with convolutional neural networks, in Proceedings of Neural Information Processing Systems, (MIT Press, Cambridge, 2014), pp. 766–774 42. M.D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks, in Proceedings of European Conference on Computer Vision, (Springer International Publishing, Cham, 2014), pp. 818–833 43. K. Lenc, A. Vedaldi, Understanding image representations by measuring their equivariance and equivalence, in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (IEEE Computer Society, Los Alamitos, 2015), pp. 991–999 44. R. Ranjan, S. Sankaranarayanan, A. Bansal, N. Bodla, J.C. Chen, V.M. Patel, C.D. Castillo, R. Chellappa, Deep learning for understanding faces: machines may be just as good, or better than humans. IEEE Signal Process. Mag. 35(1), 66–83 (2018) 45. J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, arXiv:1709.01507 (2017) 46. A. Chaudhuri, Face Detection Using Deformable Parts Models, Technical Report, TH-1679 (Samsung R & D Institute Delhi, Noida, 2016) 47. H. Li, Z. Lin, X. Shen, J. Brandt, G. Hua, A convolutional neural network cascade for face detection, in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (IEEE Computer Society, Los Alamitos, 2015), pp. 5325–5334 48. Y. Sun, X. Wang, X. Tang, Deep convolutional network cascade for facial point detection, in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (IEEE Computer Society, Los Alamitos, 2013), pp. 3476–3483 49. X. Xiong, F.D.L. Torre, Supervised descent method and its applications to face alignment, in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (IEEE Computer Society, Los Alamitos, 2013), pp. 532–539 50. Y. Sun, Y. Chen, X. Wang, X. Tang, Deep learning face representation by joint identificationverification, in Proceedings of Neural Information Processing Systems, (MIT Press, London, 2014), pp. 1988–1996 51. Y. Sun, D. Liang, X. Wang, X. Tang, DeepID3: face recognition with very deep neural networks. arXiv:1502.00873v1 (2015) 52. E. Zhou, Z. Cao, Q. Yin, Naive-deep face recognition: touching the limit of LFW benchmark or not? arXiv:1501.04690v1 (2015)

Developing Cloud-Based Intelligent Touch Behavioral Authentication on Mobile Phones Zhi Lin, Weizhi Meng

, Wenjuan Li, and Duncan S. Wong

1 Introduction Mobile devices especially smartphones have become a necessity in people’s daily lives. A report from Deloitte [8] indicates that smartphone had a steep rise to 80% of market share in 2017, as compared to 81% for the laptop. They further predict that smartphone is one of the most used devices and its usage is likely to become even more intensive over the coming years. The International Data Corporation also predicts that the overall smartphone market can reach 1.646 billion units shipped in 2022 [19]. The capability of current smartphones has extended from phone calls to music players, book readers, gaming controllers, and mobile payment. In other words, smartphones can be treated as a personal assistant to each phone user. In this case, users are likely to store private and sensitive data on their phones, e.g., credit card number, credentials, and photos [10, 20, 47] and make transactions via their phones, e.g., shopping and bank transfer [50, 67]. However, smartphones have already become a major target for cyber-criminals [56]. Hackers often compromise

Z. Lin CyberTree, Shatin, Hong Kong SAR W. Meng () DTU Compute, Technical University of Denmark, Lyngby, Denmark e-mail: [email protected] W. Li Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong SAR DTU Compute, Technical University of Denmark, Lyngby, Denmark e-mail: [email protected] D. S. Wong Hong Kong Applied Science and Technology Research Institute, Shatin, Hong Kong SAR © Springer Nature Switzerland AG 2020 R. Jiang et al. (eds.), Deep Biometrics, Unsupervised and Semi-Supervised Learning, https://doi.org/10.1007/978-3-030-32583-1_7

141

142

Z. Lin et al.

users’ phones by manipulating applications from the phone market [48], or utilize advanced attacks like juice filming charging attacks to steal users’ private information [38, 40]. In addition, smartphones are easily lost or stolen, which provides many chances for cyber-criminals to exploit sensitive user information, or make use of the phone maliciously [3, 6, 43, 44]. Therefore, it is very important to develop proper user authentication mechanisms to ensure the security of smartphones. Excising user authentication mechanisms on mobile devices can be classified as: textual authentication and biometric authentication, including physiological authentication, e.g., facial authentication, and behavioral authentication, e.g., keystroke dynamics and touch dynamics. The former includes personal identification numbers (PIN) [4] and graphical passwords (e.g., Android unlock patterns) [9, 42], but such passwords suffer many issues in the aspects of both security and usability in practice [24, 70]. As an example, users are very likely to create and use simple passwords for the sake of long-term memory limitations [12, 21]. Similarly, graphical passwords also suffer many known security issues, i.e., authentication credentials are easily stolen via shoulder surfing, in which attackers can use direct observation techniques to obtain users’ information [54, 62]. Further, graphical passwords can be recovered by side channel attacks, i.e., Android unlock patterns can be identified via smudge attack, where attackers can extract recently touched locations on phone-screen by retrieving smudges [1]. In order to enhance the user authentication, research has moved to investigating the performance of biometric authentication. Generally, there are two types of biometric solutions: physiological authentication and behavioral authentication. The former often uses measurements from the human body to verify users such as fingerprints [29], iris [55], hand [7, 15], voice [18], face [68], and so on. The latter mainly uses measurements from human actions to authenticate users such as keystroke dynamics [2], mouse dynamics [49], etc. The major limitation of physiological authentication is one-time authentication, in which the system only verifies a user at the beginning of a session without a re-authentication, opening a hole for cyber-criminals. Additionally, it often requires to deploy special hardware to support the collection of physiological features. Although some biometrics have been applied on smartphones like face and fingerprint, more accurate features like iris and retina are still very expensive for practical implementation. Motivations In comparison, behavioral authentication has two major advantages: (1) it can provide continuous authentication, which enables continuously authenticating users when they use the phones [10]; and (2) there is no need to implement additional hardware for collecting behavioral features, since most smartphones have already installed various sensors for recording touch behaviors, such as touch events, touch coordinates, touch pressure, and so on. As touchscreens have dominated the phone market, touch dynamics has become a more important biometric on smartphones than keystroke dynamics [10]. In the literature, many touch behavioral authentication schemes have been proposed and studied, like FAST [10]. Machine learning techniques are often applied to behavioral authentication; however, Frank et al. [13] found that the performance of a classifier would be fluctuant based on the

Developing Cloud-Based Intelligent Touch Behavioral Authentication on. . .

143

specific data. Then they proposed Touchalytics and concluded that it cannot work as a standalone authentication mechanism for long-term authentication. This is caused by the unstable performance of a classifier [60]. Therefore, how to design a robust behavioral authentication remains a challenge. Contributions Motivated by this challenge, we previously developed an intelligent approach for machine-learning-based biometric authentication, in which the selection of a classifier will be updated periodically so as to maintain the authentication performance [30, 32, 41]. In this chapter, we extend our previous work and adopt the same cost-based intelligent mechanism. In a practical deployment, we found the intelligent mechanism may pose additional workload on the phone; we thus propose a cloud-based scheme to further reduce the workload that is mainly caused during the process of classifier training. The major contributions can be summarized as follows: • We adopt the same cost-based intelligent mechanism from our previous work [30, 32, 41], which aims to maintain the performance of classifiers by intelligently selecting a more accurate but less costly classifier for user authentication. Differently, in this chapter, we proposed a cloud-based scheme to help further reduce the workload. • For performance comparison, we employed the same touch-dynamics-based authentication scheme in the prior work [41], which consists of 9 gesture-related features, such as the number of touch movements, the number of single-touch events, the average time duration of touch movements, average speed of touch movement, touch size and touch pressure, etc. • In the evaluation, we perform a user study with 30 participants, and compare the performance between our cloud-based scheme and the original scheme designed in [41]. Our Experimental results indicate that our proposed scheme can help greatly reduce the workload on smartphones, as compared to the original scheme. The chapter is organized as follows. Section 2 introduces related research studies on biometric authentication in the literature, and Sect. 3 describes the employed cost-based intelligent mechanism. Section 4 introduces our adopted touch-dynamics-based authentication scheme and touch features. Section 5 presents our developed cloud-based scheme and show how to implement such authentication. Section 6 conducts a user study to evaluate our proposed mechanism. We discuss some open challenges in Sect. 7 and conclude our work in Sect. 8.

2 Related Work As touchscreen mobile devices have dominated the market, research on touch behavioral authentication started from around 2009. Numabe et al. [46] extracted coordinates of a tap on touch panels, and identified that how to tap a touch panel could be used to identify different fingers. Kim et al. [22] introduced a study of

144

Z. Lin et al.

defeating shoulder surfing attacks for PIN input on a machine using multi-touch actions. Fiorella et al. [11] provided a study of applying multi-touch input for 3D object manipulation on mobile devices (e.g., rotation, translation, and scaling). Currently, machine learning has extensively used for constructing a behavioral authentication mechanism. Fen et al. [10] developed FAST, which is a finger gesture-based authentication system on touchscreen devices, having both passive and continuous authentication mode based on users’ touch gestures. It could achieve a FAR of 4.66% and a FRR of 0.13% by means of a random forest classifier. The major limitation is that they used a digital sensor glove that is usually unavailable on a mobile device. Meanwhile, Meng et al. [32] developed a touch gesture-based user authentication scheme on smartphones, which has a total of 21 touch related features. In a study with 20 Android phone users, their scheme could reach an average error rate of about 3% with a combined classifier of PSO-RBFN. Then, Frank et al. [13] introduced a behavioral authentication scheme of Touchalytics, including 30 behavioral touch features. Their scheme could achieve a median equal error rate of below 4% when the authentication test was carried out 1 week after the enrollment phase. They further concluded that their scheme could only be implemented as a means to extend screen-lock time while cannot perform as a standalone mechanism. Afterwards, various touch behavioral authentication schemes have been proposed [34]. Zheng et al. [73] investigated the feasibility of verifying users based on their tapping behaviors on a passcode-enabled smartphone. They particularly used a one-class algorithm to compute the nearest neighbor distance for the training data. In a study with 80 participants, they showed an averaged equal error rate of nearly 3.65%. Sae-Bae et al. [52] proposed an algorithm for behavioral matching and introduced a scheme with 22 multi-touch gestures from hand and finger movement. Smith-Creasey and Rajarajan [59] used a stacked classifier approach, which could achieved an equal error rate of 3.77% for a single sample based on a dataset. Shahzad et al. [57] designed BEAT, an authentication scheme for touchscreen devices based on users’ certain actions on the touchscreens, like inputting a gesture or a signature. Their system mainly considered the features on how users input such as velocity, device acceleration, and stroke time. Sharma and Enbody [58] studied whether a mobile application running on a touch-enabled device can continuously and unobtrusively authenticate its users based solely on their interactions with the application interface. They used the SVM-based ensemble classifier and achieved a mean equal error rate of 7% for user authentication and a median accuracy of 93% for user identification. In our previous work [33, 41], we proposed a cost-based intelligent mechanism to help choose a less costly algorithm for user authentication. The experimental results presented that it was effective in maintaining the authentication accuracy at a relatively high and stable level, as compared to the use of a single classifier. Some other related work can be referred but not limited to [5, 16, 35–37, 39, 45, 51, 61, 63, 65, 66, 69, 72].

Developing Cloud-Based Intelligent Touch Behavioral Authentication on. . .

145

3 Cost-Based Intelligent Mechanism As the performance of a classifier is mostly fluctuant based on different features and datasets, it is very difficult to select an appropriate sole machine learning classifier to authenticate phone users for a long period of time [23]. Frank et al. [13] particularly figured out this issue, and concluded that machine learningbased behavioral authentication scheme cannot perform as a standalone mechanism. This is because phone users cannot ensure that they could interact with their phone in a constant way. To reduce the impact of unstable users’ behavior, we advocate that an intelligent mechanism can be used to maintain the classifier performance [31, 35]. In this chapter, we adopted the same intelligent mechanism with a cost-based metric from our previous studies [30, 41]. Such metric is designed to measure the performance of different classifiers from the view of cost. Next, we provide a set of definitions in relation to the cost-based measurement. Definition 1 A Cost Ratio (C) is calculated as C = Cβ /Cα , where Cα represents the cost of identifying an imposter as a legitimate user and Cβ represents the cost of identifying a legitimate user as an imposter. Definition 2 A cost-based decision tree can be built according to [14]. The probability of P 1 means the probability that the detector reports a legitimate user, P 2 means the conditional probability of legitimate user given that the detector identifies as a legitimate user, P 3 means the conditional probability of legitimate user given that the detector identifies as an imposter. More specifically, P 1, P 2, and P 3 can be calculated via Bayes’ Theorem: P1 = (1 − α)(1 − P ) + βP

(1)

P2 = (1 − α)(1 − P )/[(1 − α)(1 − P ) + βP ]

(2)

P3 = α(1 − P )/[α(1 − P ) + (1 − β)P ],

(3)

where α represents false positive rate (FP, P (I |L)), β represents false negative rate (FN, P (L|I )), and P represents the prior probability of detecting an imposter. Note that α and β are two parameters of a classifier. Definition 3 The Initial Expected Cost (Ciec ) is defined as the sum of the products of the probabilities of the detector’s outputs and the expected costs conditional on the outputs, which can be calculated as Ciec = min{CβP , (1 − α)(1 − P )} + min{C(1 − β)P , α(1 − P )}, based on [14]. The previous work [17] has shown that the Ciec suffered from some limitations α in real scenarios, i.e., it has nothing to do with α and β if CP < 1−β , (1 − α) ≈ (1 − P ) ≈ 1. To mitigate the issue, we tune a measure called relative expected cost based on [30].

146

Z. Lin et al.

Definition 4 Relative Expected Cost (Crec ) is defined as a relative sum tuned from the Initial Expected Cost, which can be calculated as follows: Crec = CβP + α(1 − P ).

(4)

From the view of cost, the metric of relative expected cost (Crec ) can be used to evaluate the performance of different classifiers in identifying behavioral anomalies and decide a less costly classifier. In particular, a classifier with a lower relative expected cost indicates a less information loss during the model training process. More details about the derivation of P 1, P 2, P 3 can refer to work [14, 30]. It is worth noting that α, β, and P can be computed in advance as long as we have the training dataset. The Adopted Mechanism Figure 1 presents the design of cost-based intelligent mechanism for user authentication, including four major phases: data collection, behavior modelling, classifier selection, and behavior matching. First of all, data collection aims to gather the predefined features required by the authentication scheme. Then, based on the data, the mechanism starts building a normal behavioral model according to particular classifiers. Then, classifier selection aims to choose the best classifier that has the lowest cost value (i.e., each classifier can be run as a system process). In practice, there will be a classifier pool that contains a set of classifiers. In the phase of behavior matching, the selected classifier is used to compare the current behavioral model with the normal model and decide whether there is a malicious event. Based on the decision, smartphone can perform a series of operations to protect the device, i.e., locking the phone-screen immediately. After a period of time, this mechanism should select the best classifier again after obtaining some new data items. The selection result can be a new classifier or the same classifier depending on the costbased metric.

Fig. 1 The cost-based intelligent mechanism for behavioral authentication on smartphones

Developing Cloud-Based Intelligent Touch Behavioral Authentication on. . .

147

4 Touch-Dynamics-Based Authentication Scheme Intuitively, the intelligent mechanism may cause additional workload on smartphones; thus, a lightweight authentication scheme is desirable. In this chapter, we adopted a 9-feature touch behavioral authentication scheme from our previous work [41]. Each feature is explained as follows. 1. 2. 3. 4. 5. 6. 7. 8. 9.

The number of touch movements per session (denoted NTM) The number of single-touch events per session (denoted NST) The number of multi-touch events per session (denoted NMT) The average time duration of touch movements per session (denoted ATTM) The average time duration of single-touch per session (denoted ATST) The average time duration of multi-touch per session (denoted ATMT) Average speed of touch movement (denoted ASTM) Average touch size (denoted ATS) Average touch pressure (denoted ATP)

These features are selected based on their capability of distinguishing different users (i.e., the previous study [32, 53] has validated that these features were good at characterizing the touch behavior of a user). In addition, the features like the number of touch movement and the average time duration of touch movement are relatively easier to calculate than the computation of speed related features. The speed related features often need much more processing time in computing the speed of a touch event for a direction. Thus, we employ a simpler speed related feature, called average speed of touch movement (denoted ASTM), which could require less computing resources as below n ASTM =

i=2

√

(Xn −Xn−1 )2 +(Yn −Yn−1 )2 Sn −Sn−1

n

(n ∈ N ),

(5)

where (Xn , Yn ) and (Xn−1 , Yn−1 ) are supposed to be two points within a touch movement, Sn and Sn−1 are the relevant system time and n is the number of recorded points within a touch movement. The previous study [41] has shown that the average speed of touch movement can be used to characterize and distinguish different users. In this case, this scheme can authenticate users by means of an authentication signature (AutSig) including a total of 9 features AutSig = {NTM, NST, NMT, ATTM, ATST, ATMT, ASTM, ATS, ATP}. To make a decision, the scheme has to compare the current authentication signature with the normal authentication signature.

148

Z. Lin et al.

5 Our Proposed Approach and Implementation In this section, we first propose a cloud-based scheme to help reduce the caused workload on smartphones, and then show how to implement the authentication system in practice.

5.1 Cloud-Based Scheme Intuitively, choosing a classifier in an intelligent way can cause additional workload on smartphones. The use of a lightweight authentication scheme is one solution, but this may decrease the scalability of our intelligent mechanism. There is a need to further reduce the workload of choosing a less costly classifier. In this chapter, we adopt a cloud-based scheme to enhance the original scheme by offloading the processes of classifier selection and behavioral modelling to a cloud environment. Figure 2 shows the high-level architecture of our cloud-based scheme. • Cloud. Cloud computing can be considered as a shared pool of configurable computer resources and higher-level services that can be rapidly provisioned with minimal management effort over the Internet. To reduce the workload, our scheme offloads both classifier selection and behavioral modelling to the cloud environment. In practice, it is more secure if users can employ a private cloud, or apply some privacy-preserving techniques for a public cloud. • Interactions among each phase. Firstly, in the phase of data collection, our scheme can upload the packed behavioral data to the phase of behavioral

Behavioral Modelling

Cloud Classifier Selection

Data Collection

Behavior Matching

Decision

Fig. 2 The proposed cloud-based scheme for intelligent touch behavioral authentication on smartphones

Developing Cloud-Based Intelligent Touch Behavioral Authentication on. . .

149

modelling in the cloud. Similarly, a normal behavioral model should be built for each classifier in the classifier pool. Then, classifier selection phase takes all classifiers into consideration and selects the best classifier that reaches the lowest cost value. Then, the selected classifier should be forwarded to the phone side, and be used to decide whether there is an imposter. Users can also implement some security policies to react to anomalies.

5.2 Data Collection Similar to prior studies [33, 41], we used a Google/HTC Nexus One Android phone for data collection. It has a 1 GHz CPU and 512M storage memory, with a multitouch capacitive touchscreen (resolution 480 × 800 px). In particular, we updated the Android OS with a self-modified custom version based on CyanogenMod.1 The changes were made in the application framework layer by inserting some log commands to help record raw input data from the touchscreen, like the timing of touch inputs, the coordinates x and y, the types of the input (e.g., press down, press up) and the touch pressure. For implementation, we particularly inserted two system-log commands to the new compiled OS such as Slog.v and Slog.i to output the captured behavioral data. By means of a log application, we can obtain two log items with different log titles (i.e., “V/Action Inputdevice” and “I/InputDevice”) but they actually present the same information. We only use these two log commands to examine the application. For data analysis, we only treat them as the same. Later, we utilized a separate application to extract and process the recorded data.2

5.3 Session Identification To judge whether the current user is legitimate or not, we have to extract an authentication signature from each session and compare them with the normal signature(s) accordingly. For this purpose, there is a need to identify a session for generating a signature. Generally, there are two ways of identifying a session: time-based method and event-based method. The former identifies a session by fixing a time period, i.e., using a 10-min session [32]. The latter identifies a session by collecting a predefined number of events. In some cases, it is hard to ensure that time-based method can collect enough touch gestures in a session, which may result in an inaccurate

1 http://www.cyanogenmod.com/. 2A

Beta version of our customized-Android OS can be downloaded from SourceForge: https:// sourceforge.net/projects/touchdynamicsauthentication/files/Android_OS/.

150

Z. Lin et al.

behavioral model. Similar to previous work [33, 41], we adopted the event-based method, where each session is defined to contain a total of 120 touch gestures. The former studies have shown that such number of events can be used to build a proper classifier model for user authentication. The beginning and the end of a session can be decided as below. • A session ends if the number of touch events reached or exceeded 120. • A new session starts when a new touch input is recorded and the last session has ended. Based on the above rules, it is easy to identify the session-start and session-end events by counting the number of touch gestures from the raw data record.

6 Evaluation In this part, we show a user study with 30 users to evaluate the performance of our approach. Next we introduce the study methodology, evaluation metrics, and the experimental results.

6.1 Study Methodology We recruited a total of 30 Android phone users (13 female and 17 male) in our experiments, and most of them were students in the college. Other participants include engineers and researchers in the college. All the participants were regular Android phone users and ranged in age from 20 to 45 years. Participants’ information is shown in Table 1. To reduce bias and control the environment, we provided the Android phone (Google/HTC Nexus One) with the customized OS to all the participants. The detailed procedure is similar to our previous work [33, 41]. First, we introduced the study goals to all participants and explained what kind of data might be collected. We highlighted that no private information like application usage would be collected, and seek an approval from each participant. In the user study, we requested all participants to use the phones the same way as they would use their own phones in their own places, such as browsing websites, accessing files, and interacting with any applications. Specifically, participants could do actual data Table 1 Information of participants in the study

Occupation Students Engineers Researchers

Male 13 2 2

Female 8 2 3

Developing Cloud-Based Intelligent Touch Behavioral Authentication on. . .

151

collection outside of our lab, allowing them to get familiar with the phone before the study. They could also decide when to start the collection process to avoid a rush and collect unusual data. All participants were required to record up to 30 sessions within 4 days when they were using the phone. As a result, we collected a raw dataset including 900 sessions of 120 touch events each.

6.2 Evaluation Metrics In this evaluation, we use two typical metrics to judge the performance between our cloud-based scheme and the original behavioral authentication scheme with the intelligent mechanism. • False Acceptance Rate (FAR): indicates the probability that an impostor is classified as a legitimate user. • False Rejection Rate (FRR): indicates the probability that a legitimate user is classified as an impostor. In practical deployment, there is a balance should be made between the false acceptance rate (security) and the false rejection rate (usability). Based on [32, 33, 41], a false rejection is less costly than a false acceptance, as a high false acceptance rate would definitely degrade the whole security level for user authentication. By contrast, a high false rejection rate would frustrate a legitimate user, which is still unfortunate but arguably less problematic than degrading the whole security level. A desirable authentication scheme is expected to have both a low FAR and a low FRR. We also used an average value of FAR and FRR in the evaluation.

6.3 Evaluation Results In this evaluation, we employed five common and popular classifiers according to our prior work [33, 41]. Thus, the classifier pool includes decision tree (J48), Naive Bayes (NBayes), radial basis function network (RBFN), back propagation neural network (BPNN), and support vector machine (SVM). All these classifiers were extracted from the WEKA platform [64] to avoid any implementation bias. Evaluation on Classifiers Similar to [41], we used 20 sessions (about 67% of the total sessions) as training data to build a normal behavioral model for each classifier and used the rest sessions for testing the classifier performance. The results about FAR, FRR, average error rate (AER), and standard deviation (SD) are described in Table 2. The performance evaluation was run in a ten-fold mode by setting the WEKA platform. It is observed that among all classifiers, SVM could reach the best

152

Z. Lin et al.

Table 2 Performance of different classifiers Measure FAR (%) FRR (%) AER SD in FAR SD in FRR

J48 8.78 7.32 8.05 7.24 8.22

NBayes 11.21 9.73 10.47 6.83 6.53

RBFN 5.12 4.55 4.84 6.21 5.19

BPNN 6.22 5.98 6.10 5.72 6.05

SVM 4.89 4.55 4.72 4.77 4.53

Table 3 The results of classifier selection with cost values for five participants User ID User 2 User 5 User 16 User 22 User 28

15 sessions SVM (1.123) SVM (1.2211) RBFN (1.3132) RBFN (1.2665) SVM (1.3442)

20 sessions BPNN (1.2022) J48 (1.2332) BPNN (1.2132) SVM (1.2774) BPNN (1.3112)

25 sessions SVM (1.2223) J48 (1.1774) BPNN (1.2523) BPNN (1.2214) J48 (1.2455)

30 sessions SVM (1.1132) RBFN (1.1922) BPNN (1.2532) J48 (1.2343) SVM (1.2221)

performance with an AER of 4.72% (where FAR is 4.89% and FRR is 4.55%). The classifier of RBFN could reach a very close performance with an AER of 4.84%. The rest classifiers of J48, NBayes, and BPNN could only achieve an AER above 5%. The results were similar to our previous work [41]. These results demonstrate that the SVM could achieve a better accuracy than other classifiers for the collected data. Evaluation on Intelligent Mechanism As users’ touch actions are often not stable, it is very hard for a classifier to build an accurate profile. The intelligent mechanism is thus used to select a better classifier periodically. To test the mechanism performance, we employed the same classifiers of J48, NBayes, RBFN, BPNN, and SVM. The cost ratio of C was set to 10 based on the simulation results in the previous work [30, 41]. In different network settings, the value of cost ratio could be tuned by security administrators. We first used 15 sessions for training and added 5 sessions each round (where the previous sessions were used for training). Table 3 depicts the classifier selection with cost values (relative expected cost) for 5 participants. It is found that the best classifier was usually not the same in each round. Taking User22 as an example, RBFN was selected first, then SVM, BPNN, and J48 were selected, respectively. These results validate that classifier performance was fluctuant according to different data items. Our results are in line with the observations in the previous work [30, 41]. Additional Caused Workload As described in previous work [33, 41], such intelligent mechanism can maintain the authentication performance at a relatively high and stable level, in the cost of

Developing Cloud-Based Intelligent Touch Behavioral Authentication on. . .

153

Original Scheme

20

Cloud-based Scheme

Additional Workload (%)

15

10

5

0 Classifier Selection

Behaviroal Modelling

Behavior Matching

Fig. 3 The increased workload between the original scheme and our cloud-based scheme

additional workload. To investigate this issue, we used a tool of CPU-Z 3 to record the workload of deploying the mechanism. This is a free application that can record much information of a mobile device, such as CPU load, CPU architecture, cores, clock speed, and so on. Figure 3 shows the average increased CPU workload for performing profile establishment, profile matching, and classifier selection. The classifier pool has five classifiers, and it is found that the average increased CPU workload is nearly 12% for classifier selection, 21% for behavioral modelling, and 6% for behavior matching. In comparison, our cloud-based scheme can greatly reduce the required workload. • Our scheme does not cause any workload for classifier selection, because this process is offloaded to the cloud. • Our scheme can reduce the workload for behavioral modelling to only 5%, which was caused by uploading the behavioral data. • Our scheme requires more workload for behavioral matching as compared with the original scheme by increasing 2%, which was caused by retrieving classifier information from the cloud environment. Although our scheme increased a bit more workload for behavioral matching, it can greatly reduce the workload caused by the other two processes (by 26%). Based 3 https://play.google.com/store/apps/details?id=com.cpuid.cpu_z.

154

Z. Lin et al.

on the collected data, we consider that it is worth implementing the cloud-based scheme on smartphones. In addition, as most smartphone vendors provide cloud service, e.g., iCloud from Apple, it is easy to deploy our scheme in practice. It is worth emphasizing that our adaptive user authentication scheme is not intended to replace the existing authentication methods like PINs, but attempts to complement and work together with existing authentication mechanisms on mobile phones for better user authentication. For example, by working with a PIN-based authentication, the false rates of our touch behavioral user authentication scheme can be further decreased in real-world applications (e.g., FAR is less than 2% and FRR is close to zero [71]).

7 Discussion Our work presents a cloud-based scheme to further decrease the workload caused by previously developed cost-based intelligent mechanism on smartphones. There are still many challenges for future studies. Classifier Pool In this work, we focused on the performance of five common classifiers, and conducted an evaluation accordingly. In the literature, there are a plenty of algorithms available, like random forest and deep learning algorithms. One of the future works is to consider a larger set of classifiers in the evaluation, e.g., various ensembles like PSO-RBFN, deep learning classifiers, and reinforcement learning. However, there is a need to balance the workload and the number of classifiers. Generally, a bigger classifier pool, a higher workload may be caused. Workload Reduction Some strategies can be used to help reduce the workload. For example, the mechanism can train classifiers and select a classifier when the phone is not frequently used (i.e., updating at night, or when the phone is charging). To meet different requirements, it is better for users to set the updating time and frequency. In this work, we test the workload of deploying the mechanism, while even larger experiments could be conducted to validate this issue. Phone Differences and Potential Impact Similar to [41], this work also adopted the Google/HTC Nexus One Android phone as the evaluation platform. It is worth noting that our scheme is still applicable to other phones as long as their sensors can capture the required touch features. Generally, the modern phones have a stronger capability than our existing platform, i.e., current smartphones feature a larger screen and more sensors. This presents that more accurate touching data items can be collected and used to build a behavioral profile. To evaluate our scheme performance in other phone types is one of our future works.

Developing Cloud-Based Intelligent Touch Behavioral Authentication on. . .

155

Involved Users The number of participants is an important metric to decide whether a user study is convincing, but it is very hard to invite enough participants in practice. Participants with different background may result in distinct datasets, making it difficult to compare the schemes across different studies. To solve this issue, conducting a larger user study with even more users is always desirable. In addition, there is a need for a unified platform to check different authentication schemes.

8 Conclusion In the literature, research has moved to touch dynamics and started investigating how to design an effective touch behavioral user authentication mechanism. Most schemes often adopt machine learning techniques to help build behavioral profile and make a decision on whether current users are legitimate. However, many studies have shown that the performance of a classifier may vary with particular datasets. To mitigate this issue, we previously designed a cost-based intelligent mechanism, which can intelligently select a less costly algorithm for user authentication. In practice, we found that additional workload would be caused by such mechanism. In this chapter, we advocate the effectiveness of intelligent touch behavioral authentication, and propose a cloud-based scheme to further reduce the workload by offloading both classifier selection and behavioral modelling to the cloud environment. Our experimental results with 30 Android phone users demonstrate that our scheme can greatly reduce the required workload as compared to the original scheme. There are many potential directions in future. Future work could include combining the behavioral authentication with other authentication solutions like PINs, and conducting an even larger study to validate the obtained results. Future work could also include investigating how to apply privacy-preserving techniques to protect data privacy when offloading the behavioral data to a cloud environment [25–28]. Acknowledgement The authors would like to thank all participants for their work in the user study.

References 1. A.J. Aviv, K. Gibson, E. Mossop, M. Blaze, J.M. Smith, Smudge attacks on smartphone touch screens, in Proceedings of the 4th USENIX Conference on Offensive Technologies (WOOT) (USENIX Association, Berkeley, 2010), pp. 1–10 2. F. Bergadano, D. Gunetti, C. Picardi, User authentication through keystroke dynamics. ACM Trans. Inf. Syst. Secur. 5(4), 367–397 (2002) 3. N.D.W. Cahyani, B. Martini, K.K.R. Choo, A.M.N. Al-Azhar. Forensic data acquisition from cloud-of-things devices: windows Smartphones as a case study. Concurrency Comput.: Practice and Experience 29(14), e3855 (2017)

156

Z. Lin et al.

4. N.L. Clarke, S.M. Furnell, Telephones–a survey of attitudes and practices. Comput. Secur. 24(7), 519–527 (2005) 5. N.L. Clarke, S.M. Furnell, Authenticating mobile phone users using keystroke analysis. Int. J. Inf. Secur. 6(1), 1–14 (2007) 6. L. Chang, Smartphone usage soars in US as other devices’ popularity declines (2015). Available at: https://www.digitaltrends.com/mobile/us-smartphone-usage-soars/ 7. J. Dai, J. Zhou, Multifeature-based high-resolution palmprint recognition. IEEE Trans. Pattern Anal. Mach. Intell. 33(5), 945–957 (2011) 8. Deloitte, Global Mobile Consumer Survey (2017). https://www2.deloitte.com/content/dam/ Deloitte/be/Documents/technology-media-telecommunications/global-mobile-consumersurvey-2017_belgian-edition.pdf 9. P. Dunphy, A.P. Heiner, N. Asokan, A closer look at recognition-based graphical passwords on mobile devices, in Proceedings of the 6th Symposium on Usable Privacy and Security (SOUPS) (ACM, New York, 2010), pp. 1–12 10. T. Feng, Z. Liu, K.-A. Kwon, W. Shi, B. Carbunary, Y. Jiang, N. Nguyen, Continuous mobile authentication using touchscreen gestures, in Proceedings of the 2012 IEEE Conference on Technologies for Homeland Security (HST) (IEEE, Piscataway, 2012), pp. 451–456 11. D. Fiorella, A. Sanna, F. Lamberti, Multi-touch user interface evaluation for 3D object manipulation on mobile devices. J. Multimodal User Interfaces 4(1), 3–10 (2010) 12. D. Florencio, C. Herley, A large-scale study of web password habits, in Proceedings of the 16th International Conference on World Wide Web (WWW) (ACM, New York, 2007), pp. 657–666 13. M. Frank, R. Biedert, E. Ma, I. Martinovic, D. Song, Touchalytics: on the applicability of touchscreen input as a behavioral biometric for continuous authentication. IEEE Trans. Inf. Forensics Secur. 8(1), 136–148 (2013) 14. J.E. Gaffney, J.W. Ulvila, Evaluation of intrusion detectors: a decision theory approach, in Proceedings of the 2001 IEEE Symposium on Security and Privacy (2001), pp. 50–61 15. M. Goel, J.O. Wobbrock, S.N. Patel, GripSense: using built-in sensors to detect hand posture and pressure on commodity mobile phones, in Proceedings of the 25th Annual ACM symposium on User Interface Software and Technology (UIST) (ACM, New York, 2012), pp. 545–554 16. N.Z. Gong, R. Moazzezi, M. Payer, M. Frank, Forgery-resistant touch-based authentication on mobile devices, in Proceedings of the 11th ACM Asia Conference on Computer and Communications Security (2016), pp. 499–510 17. G. Gu, P. Fogla, W. Lee, B. Skoric, Measuring intrusion detection capability: an informationtheoretic approach, in Proceedings of the 2006 ACM Symposium on Information, Computer and Communications Security (ASIACCS) (ACM, New York, 2006), pp. 90–101 18. N. Gunson, D. Marshall, F. McInnes, M. Jack, Usability evaluation of voiceprint authentication in automated telephone banking: sentences versus digits. Interacting Comput. 23(1), 57–69 (2011) 19. IDC. With Expectations of a Positive Second Half of 2018 and Beyond. https://www.idc.com/ getdoc.jsp?containerId=prUS44240118. 20. A.K. Karlson, A.B. Brush, S. Schechter, Can i borrow your phone?: understanding concerns when sharing mobile phones, in Proceedings of the 27th International Conference on Human Factors in Computing Systems (CHI) (ACM, New York, 2009), pp. 1647–1650 21. M. Keith, B. Shao, P. Steinbart, The usability of passphrases for authentication: an empirical field study. Int. J. Hum. Comput. Stud. 65(1), 17–28 (2007) 22. D. Kim, P. Dunphy, P. Briggs, J. Hook, J.W. Nicholson, J. Nicholson, P. Olivier, Multi-touch authentication on tabletops, in Proceedings of the 28th International Conference on Human Factors in Computing Systems (CHI) (ACM, New York, 2010), pp. 1093–1102 23. L. Kotthoff, I.P. Gent, I. Miguel, An evaluation of machine learning in algorithm selection for search problems. AI Commun. 25(3), 257–270 (2012) 24. R. Lemos, Passwords: the weakest link? hackers can crack most in less than a minute (2002) http://news.com./2009-1001-916719.html

Developing Cloud-Based Intelligent Touch Behavioral Authentication on. . .

157

25. J. Li, J. Li, X. Chen, C. Jia, W. Lou, Identity-based encryption with outsourced revocation in cloud computing. IEEE Trans. Commun. 64(2), 425–437 (2015) 26. J. Li, Z. Liu, X. Chen, F. Xhafa, X. Tan, D.S. Wong, L-EncDB: a lightweight framework for privacy-preserving data queries in cloud computing. Knowl.-Based Syst. 79, 18–26 (2015) 27. J. Li, H. Yan, Z. Liu, X. Chen, X. Huang, D.S. Wong, Location-sharing systems with enhanced privacy in mobile online social networks. IEEE Syst. J. 11(2), 439–448 (2017) 28. J. Li, Y. Zhang, X. Chen, Y. Xiang, Secure attribute-based data sharing for resource-limited users in cloud computing. Comput. Secur. 72, 1–12 (2018) 29. D. Maio, D. Maltoni, J.L. Wayman, A.K. Jain, Fvc2000: fingerprint verification competition. IEEE Trans. Pattern Anal. Mach. Intell. 24(3), 402–412 (2002) 30. Y. Meng, Measuring intelligent false alarm reduction using an ROC curve-based approach in network intrusion detection, in Proceedings of the 2012 IEEE International Conference on Computational Intelligence for Measurement Systems and Applications (CIMSA) (2012), pp. 108–113 31. Y. Meng, L.F. Kwok, Adaptive false alarm filter using machine learning in intrusion detection, in Proceedings of the 6th International Conference on Intelligent Systems and Knowledge Engineering (ISKE), Advances in Intelligent and Soft Computing (Springer, Berlin, 2011), pp. 573–584 32. Y. Meng, D.S. Wong, R. Schlegel, L.F. Kwok, Touch gestures based biometric authentication scheme for touchscreen mobile phones, in Proceedings of the 8th China International Conference on Information Security and Cryptology (INSCRYPT). Lecture Notes in Computer Science (Springer, Heidelberg, 2012), pp. 331–350 33. Y. Meng, D.S. Wong, L.-F. Kwok, Design of touch dynamics based user authentication with an adaptive mechanism on mobile phones, in Proceedings of the ACM Symposium on Applied Computing (2014), pp. 1680–1687 34. W. Meng, D.S. Wong, S. Furnell, J. Zhou, Surveying the development of biometric user authentication on mobile phones. IEEE Commun. Surv. Tutorials 17(3), 1268–1293 (2015) 35. W. Meng, Evaluating the effect of multi-touch behaviours on Android unlock patterns. Int. J. Inf. Comput. Secur. 24(3), 277–287 (2016) 36. W. Meng, W. Li, D.S. Wong, J. Zhou, TMGuard: a touch movement-based security mechanism for screen unlock patterns on smartphones, in Proceedings of the 14th International Conference on Applied Cryptography and Network Security (ACNS) (2016), pp. 629–647 37. W. Meng, W. Li, L. Jiang, L. Meng, On multiple password interference of touch screen patterns and text passwords, in Proceedings of ACM Conference on Human Factors in Computing Systems (2016), pp. 4818–4822 38. W. Meng, W.H. Lee, S.R. Murali, S.P.T. Krishnan, JuiceCaster: towards automatic juice filming attacks on smartphones. J. Netw. Comput. Appl. 68, 201–212 (2016) 39. W. Meng, W. Li, L.-F. Kwok, K.-K.R. Choo, Towards enhancing click-draw based graphical passwords using multi-touch behaviours on smartphones. Comput. Secur. 65, 213–229 (2017) 40. W. Meng, L. Jiang, Y. Wang, J. Li, J. Zhang, Y. Xiang, JFCGuard: detecting juice filming charging attack via processor usage analysis on smartphones. Comput. Secur. 76, 252–264 (2018) 41. W. Meng, W. Li, D.S. Wong, Enhancing touch behavioral authentication via cost-based intelligent mechanism on smartphones. Multimed. Tools Appl. 77(23), 30167–30185 (2018) 42. W. Meng, Z. Liu, TMGMap: designing touch movement-based geographical password authentication on smartphones, in The 14th International Conference on Information Security Practice and Experience (ISPEC 2018) (2018), pp. 373–390 43. Millennial Media. Mobile mix: The mobile device index (2012). Available at: http://www. millennialmedia.com/research 44. Mobile and NCSA. Report on Consumer Behaviors and Perceptions of Mobile Security (2012). Available at: http://docs.nq.com/NQ_Mobile_Security_Survey_Jan2012.pdf 45. T.V. Nguyen, N. Sae-Bae, N. Memon, DRAW-A-PIN: authentication using finger-drawn PIN on touch devices. Comput. Secur. 66, 115–128 (2017)

158

Z. Lin et al.

46. Y. Numabe, H. Nonaka, T. Yoshikawa, Finger Identification for touch panel operation using tapping fluctuation, in Proceedings of the IEEE 13th International Symposium on Consumer Electronics (2009), pp. 899–902 47. S. Pokharel, K.K.R. Choo, J. Liu, Mobile cloud security: an adversary model for lightweight browser security. Comput. Stand. Interfaces 49, 71–78 (2017) 48. R. Potharaju, A. Newell, C. Nita-Rotaru, X. Zhang, Plagiarizing smartphone applications: attack strategies and defense techniques, in Proceedings of the 2012 International Symposium on Engineering Secure Software and Systems (ESSoS). Lecture Notes in Computer Science (Springer, Heidelberg, 2012), pp. 106–120 49. M. Pusara, C.E. Brodley, User Re-authentication via mouse movements, in Proceedings of the 2004 ACM Workshop on Visualization and Data Mining for Computer Security (VizSEC/DMSEC) (ACM, New York, USA, 2004), pp. 1–8 50. D. Quick, K.K.R. Choo, Pervasive social networking forensics: intelligence and evidence from mobile device extracts. J. Netw. Comput. Appl. 86, 24–33 (2017) 51. J. Ranjan, K. Whitehouse, Automatic authentication of smartphone touch interactions using smartwatch, in Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing (2016), pp. 361–364 52. N. Sae-Bae, N. Memon, K. Isbister, K. Ahmed, Multitouch gesture-based authentication. IEEE Trans. Inf. Forensics Secur. 9(4), 568–582 (2014) 53. H. Saevanee, P. Bhattarakosol, Authenticating user using keystroke dynamics and finger pressure, in Proceedings of the 6th IEEE Conference on Consumer Communications and Networking Conference (CCNC) (IEEE, Piscataway, 2009), pp. 1078–1079 54. F. Schaub, R. Deyhle, M. Weber, Password entry usability and shoulder surfing susceptibility on different smartphone platforms, in Proceedings of the 11th International Conference on Mobile and Ubiquitous Multimedia (MUM) (ACM, New York, 2012), pp. 1–10 55. N.A. Schmid, M.V. Ketkar, H. Singh, B. Cukic, Performance analysis of iris-based identification system at the matching score level. IEEE Trans. Inf. Forensics Secur. 1(2), 154–168 (2006) 56. A. Shabtai, Y. Fledel, U. Kanonov, Y. Elovici, S. Dolev, C. Glezer, Google Android: a comprehensive security assessment. IEEE Secur. Priv. 8(2), 35–44 (2010) 57. M. Shahzad, A.X. Liu, A. Samuel, Behavior based human authentication on touch screen devices using gestures and signatures. IEEE Trans. Mob. Comput. 16(10), 2726–2741 (2017) 58. V. Sharma, R. Enbody, User authentication and identification from user interface interactions on touch-enabled devices, in Proceedings of the 10th ACM Conference on Security and Privacy in Wireless and Mobile Networks (WiSec) (2017), pp. 1–11 59. M. Smith-Creasey, M. Rajarajan, A continuous user authentication scheme for mobile devices, in Proceedings of the 14th Annual Conference on Privacy, Security and Trust (PST) (2016), pp. 104–113 60. R. Sommer, V. Paxson, Outside the closed world: on using machine learning for network intrusion detection, in Proceedings of the 2010 IEEE symposium on Security and Privacy (2010), pp. 305–316 61. Y. Song, Z. Cai, Z.-L. Zhang, Multi-touch authentication using hand geometry and behavioral information, in Proceedings of IEEE Symposium on Security and Privacy (2017), pp. 357–372 62. F. Tari, A.A. Ozok, S.H. Holden, A comparison of perceived and real shoulder-surfing risks between alphanumeric and graphical passwords, in Proceedings of the 2nd Symposium on Usable Privacy and Security (SOUPS)(ACM, New York, 2006), pp. 56–66 63. M. Temper, S. Tjoa, M. Kaiser, Touch to authenticate—continuous biometric authentication on mobile devices, in Proceedings of the 2015 International Conference on Software Security and Assurance (ICSSA) (2015), pp. 30–35 64. The University of Waikato. WEKA-Waikato Environment for Knowledge Analysis. Available at: http://www.cs.waikato.ac.nz/ml/weka/

Developing Cloud-Based Intelligent Touch Behavioral Authentication on. . .

159

65. P.S. The, N. Zhang, A.B.J. Teoh, K. Chen, Recognizing your touch: towards strengthening mobile device authentication via touch dynamics integration, in Proceedings of the 13th International Conference on Advances in Mobile Computing and Multimedia (MoMM) (2015), pp. 108–116 66. S. Trewin, C. Swart, L. Koved, J. Martino, K. Singh, S. Ben-David, Biometric authentication on a mobile device: a study of user effort, error and task disruption, in Proceedings of the 28th Annual Computer Security Applications Conference (ACSAC) (2012), pp. 159–168 67. D. Van Thanh, Security issues in mobile eCommerce, in Proceedings of the 11th International Workshop on Database and Expert Systems Applications (DEXA) (IEEE, Piscataway, 2000), pp. 412–425 68. R. Wallace, M. McLaren, C. McCool, S. Marcel, Cross-pollination of normalisation techniques from speaker to face authentication using Gaussian mixture models. IEEE Trans. Inf. Forensics Secur. 7(2), 553–562 (2012) 69. D. Wang, H. Cheng, P. Wang, X. Huang, G. Jian, Zipf’s law in passwords. IEEE Trans. Inf. Forensics Secur. 12(11), 277–2791 (2017) 70. J. Yan, A. Blackwell, R. Anderson, A. Grant, Password memorability and security: empirical results. IEEE Secur. Priv. 2(5), 25–31 (2004) 71. S. Zahid, M. Shahzad, S.A. Khayam, M. Farooq, Keystroke-based user identification on smart phones, in Proceedings of Recent Advances in Intrusion Detection. Lecture Notes in Computer Science (Springer, Berlin, 2009), pp. 224–243 72. X. Zhao, T. Feng, W. Shi, I.A. Kakadiaris, Mobile user authentication using statistical touch dynamics images. IEEE Trans. Inf. Forensics Secur. 9(11), 1780–1789 (2014) 73. N. Zheng, K. Bai, H. Huang, H. Wang, You are how you touch: user verification on smartphones via tapping behaviors, in Proceedings of the 2014 International Conference on Network Protocols (ICNP) (2014), pp. 221–232

Constellation-Based Deep Ear Recognition Dejan Štepec, Žiga Emeršiˇc, Peter Peer, and Vitomir Štruc

1 Introduction This chapter introduces COM-Ear, a deep constellation model for ear recognition. Different from competing solutions, COM-Ear encodes global as well as local characteristics of ear images and generates descriptive ear representations that ensure competitive recognition performance. The model is designed as dual- path convolutional neural network (CNN), where one path processes the input in a holistic manner, and the second captures local images characteristics from image patches sampled from the input image. A novel pooling operation, called patchrelevant-information pooling, is also proposed and integrated into the COM-Ear model. The pooling operation helps to select features from the input patches that

The authors “Dejan Štepec” and “Žiga Emeršiˇc” contributed equally to this work. D. Štepec XLAB d.o.o., Ljubljana, Slovenia e-mail: [email protected] Ž. Emeršiˇc () · P. Peer Computer Vision Laboratory, Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia e-mail: [email protected]; [email protected] V. Štruc Laboratory of Artificial Perception, Systems and Cybernetics, Faculty of Electrical Engineering, University of Ljubljana, Ljubljana, Slovenia e-mail: [email protected] © Springer Nature Switzerland AG 2020 R. Jiang et al. (eds.), Deep Biometrics, Unsupervised and Semi-Supervised Learning, https://doi.org/10.1007/978-3-030-32583-1_8

161

162

D. Štepec et al.

are locally important and to focus the attention of the network to image regions that are descriptive and important for representation purposes. The model is trained in an end-to-end manner using a combined cross-entropy and center loss. Extensive experiments on the recently introduced Extended Annotated Web Ears (AWEx) dataset demonstrate the competitiveness of COM-Ear compared to existing ear recognition models. Ear recognition refers to the task of recognizing people from ear images using computer vision techniques. Ears offer appealing characteristics when used in automated recognition systems, such as the ability to distinguish identical twins [49], the potential to supplement other biometric modalities (e.g., faces) [66, 73], or the ability to capture images from a distance and without explicit cooperation of the subjects one is trying to recognize. Person recognition based on ear images has seen steady rise of popularity over recent years. Nevertheless, despite significant advancements in this area and the shift towards deep-learning-based models, nuisance factors such as ear occlusions and the presence of ear accessories still adversely affect performance of existing recognition models. Moreover, while research on ear recognition has long been focused on recognition problems in controlled imaging conditions, the recent switch to unconstrained image acquisition conditions brought about new challenges related to extreme appearance variability caused by blur, illumination, and view-direction changes, which were thus far not considered problematic for ear recognition. These extreme conditions pose considerable challenges to existing ear recognition models and have so far not been addressed properly in the literature. Existing approaches to address the challenges encountered in unconstrained imaging conditions focused mostly on fine-tuning existing deep learning models. In a recent competition, organized around the problem of unconstrained ear recognition [25], for example, most participants used established models, such as VGG-16 or Inception-ResNets pretrained on ImageNet data as a baseline and then fine-tuned the models on the training data of the competition. Other recent deep learning solutions [5, 23, 34, 54, 79] in this area also followed a similar approach and used pre-existing models or employed transfer learning and domain adaptation techniques to adapt the models for ear recognition. A common aspect of these works is the fact, the model was not designed specifically for ear recognition and processed ear images holistically ignoring the particularities and existing problems of ear recognition technology. In this chapter, we take a step further and present a novel (deep) constellation model for ear recognition (COM-Ear) that addresses some of the problems seen with competing ear recognition approaches. As we show in the experimental section the model ensures state-of-the-art recognition performance for unconstrained ear recognition and exhibits a significant increase in robustness to the presence of partial ear occlusions (typically caused by ear accessories) compared to other techniques in this area. The proposed COM-Ear model is designed around a Siamese architecture that takes an ear image and local image patches (sampled from a fixed grid) as input and generates an image representation that encodes both global and local ear characteristics at the output. To generate the output image representation

Constellation-Based Deep Ear Recognition

163

features from different patches are first combined using a newly proposed pooling operation, called patch-relevant-information pooling (or PRI-pooling), and then concatenated with the global images features. Our constellation model is not limited to specific model topologies and can be built around any recent deep-learning model. We, hence, evaluate and analyze different backbone models (i.e., ResNet18, ResNet50, and ResNet152) for the implementation of COM-Ear. We train the proposed constellation model end-to-end using a combination of cross-entropy and center losses as our learning objectives. To the best of our knowledge, this is the first attempt at designing and training constellation deep model in the field of ear recognition. We evaluate the model in rigorous experiments on the challenging Extended Annotated Web Ears (AWEx) dataset [25, 26]. To demonstrate the robustness of the COM-Ear model to ear accessories and occlusions, we perform additional experiments using an artificially generated dataset [27] of images where accessories are added to the ear images. The results of our experiments show that the proposed COM-Ear model is a viable solution for the problem of ear recognition that ensures state-of-the-art performance and exhibits a considerable level of robustness to various factors adversely affecting the recognition accuracy of competing approaches. To summarize, the main contributions of this chapter are the following: • We present and describe COM-Ear, the first ever deep constellation model for the problem of ear recognition that ensures state-of-the-art performance on the most challenging dataset of ear image available. • We introduce a novel pooling operation, called patch-relevant-information pooling (or PRI-pooling) that is able to select features from image patches that are locally important and integrate it into the COM-Ear model. • We make all code, models, and trained weights publicly available via http:// ears.fri.uni-lj.si and provide a strong baseline for future research in the field of unconstrained ear recognition. The rest of the paper is structured as follows. In Sect. 2 we present the background and related work. Here, we discuss techniques for ear recognition in constrained setting as well as methods focusing on ear recognition in the wild. In Sect. 3 we describe the proposed constellation model in detail and elaborate on the idea behind the model, its architecture, and training procedure. In Sect. 4 we present a rigorous experimental evaluation of COM-Ear and discuss results. We also present a qualitative analysis to highlight the characteristics of our model. We conclude the chapter in Sect. 5 with some final remarks and direction for future work.

2 Related Work The literature on automated ear recognition is extensive, starting with the early geometry-based recognition techniques to more recent deep-learning model. The field has also seen a shift recently away from ear datasets from constrained

164

D. Štepec et al.

environments towards unconstrained settings, which more closely reflect real-world imaging conditions. In this section we briefly survey the most important work in the field to provide the necessary context for our contributions. For a more complete coverage of ear recognition technology, the reader is referred to some recent surveys [26, 56].

2.1 Ear Recognition in Constrained Conditions Until recently, most of the research on ear recognition was focused on controlled imaging conditions where the appearance variability of ear images was carefully controlled and typically limited to small head rotations and minute changes in the external lighting conditions [26]. Techniques for ear recognition (from 2D color/intensity images) proposed in the literature during this period can conveniently be grouped into the following categories: [26, 56]: (1) geometric approaches, (2) global (holistic) approaches, (3) local (descriptor-based) approaches, and (4) hybrid approaches. Geometric approaches dominated the early days of ear recognition [2, 26] and were often aided by manual intervention. Techniques from this group rely on certain geometric properties of ears and exploit relationships between predefined parts of ears. The first fully automated ear recognition procedure exploiting geometric characteristics was presented by Moreno et al. [46] and made use of ear geometric description and a compression network. Some other examples of geometric ear recognition include [11, 15, 16, 47, 60]. One of the more recent publications using a geometric approach is the work of Chowdhury et al. [18]. Here, the authors present a complete recognition pipeline including an AdaBoost-based ear detection technique. The Canny edge detector is employed to extract edge features from images and ear comparisons are done using similarity measurements. The second group of ear recognition approaches, global (also referred to as holistic) techniques, exploit the global appearance of the ear and encode the ear structure in a holistic manner. Even though techniques from this group seem to represent an obvious way to tackle ear recognition and improve upon the geometric methods, they are relatively sensitive to variations in illumination, pose, or presence of occlusions. Some of the earliest examples of global approaches include the work from Hurley et al. [37] that relied on the force field transform to encode ear images, methods using principal component analysis [14, 68], and others [1, 3, 8, 48, 72, 75, 77]. The third group of approaches, local techniques, extracts information from local image areas and use the extracted information for identity inference. As emphasized in the survey by Emersic et al. [26], two groups of techniques can in general be considered local-based: techniques that first detect interest points in the image and then compute descriptors for the detected interest points, and techniques that compute descriptors densely over the entire images based on a sliding window approach. Examples of techniques from the first group include [7, 12, 59].

Constellation-Based Deep Ear Recognition

165

A common characteristic of these techniques is the description of the interest points independently one from another, which makes it possible to design matching techniques with robustness to partial occlusions of the ear area. Examples of techniques from the second group include [6, 9, 13, 17, 39, 42, 70]. These techniques also capture the global properties of the ear in addition to the local characteristics, which commonly result in higher recognition performance, but the dense descriptorcomputation procedure comes at the expense of the robustness to partial occlusions. Nonetheless, trends in ear recognition favored dense descriptor-based techniques primarily due to their computational simplicity and high recognition performance. The last group of ear recognition approaches, hybrid methods, typically describe the input images using both global and local information. Techniques from this group first represent ear images using some local (hand-crafted) image descriptor (e.g., SIFT, BSIF, or HOG) that captures local image properties and then encode the extracted descriptor using a global (holistic) subspace projection technique, such as principal component analysis (PCA), linear discriminant analysis (LDA), or other related techniques [56]. Hybrid approaches and (powerful) local-descriptor-based methods represented the state-of-the-art in ear recognition for a considerable period of time and were only recently outperformed by deep-learning-based models [25, 29]. The model introduced in this chapter builds on these techniques and similar to hybrid methods also tries to capture global ear characteristics as well as local ear details. In this sense it is related to hybrid techniques from the literature, but offers significant performance improvements as evidenced by the results presented in Sect. 4. In addition to the ear recognition approaches described above, some works focus on solving specific issues regarding ear recognition, such as occlusions or alignment [55, 61, 62, 69, 76, 78, 80]. Furthermore, existing research related to ear recognition also studies multi-modal biometrics systems that incorporate ear images into the recognition procedure [4, 31, 53], different data-acquisition techniques, such as light-field cameras [63] or other ways of ear-based recognition that do not rely on visual information [45].

2.2 Ear Recognition in Unconstrained Conditions More recent work on ear recognition is increasingly looking at unconstrained image acquisition conditions, where the appearance variability of ear images is considerably greater from what is seen in constrained settings. Several ear datasets for research in ear recognition have been proposed towards such unconstrained scenarios, starting with the Annotated Web Ears (AWE) [26], The Unconstrained Ear Recognition Challenge (UERC) dataset [25], and others. Research on these datasets is dominated by deep learning models based primarily on convolutional neural networks (CNNs), while techniques using local, hand-crafted descriptors are far and few between.

166

D. Štepec et al.

Numerous deep learning models have been presented for ear recognition over the course of the last 2 years [23, 25, 30, 34, 54, 65, 67, 79], all significantly outperforming local-descriptor-based and hybrid methods in the most challenging scenarios [24, 25, 28, 67]. One of the earliest approaches to ear recognition using deep neural used aggressive data augmentation to suffice the training needs [24]. Another example of one of the earliest uses of deep learning was presented by Galdámez et al. [32]. Here, the authors used a Haar-based detection procedure, but used a CNN for recognition. Another work using deep neural networks to facilitate ear recognition was presented by Tian and Mu [67]. However, the authors focused on datasets captured in the constrained environments and did not evaluate the performance of their model on more recent datasets captured in unconstrained conditions. The work of Eyiokur et al. [30] introduced a new (constrained) dataset of ear images, which was used to train AlexNet, VGG, and LeNet for ear recognition. The models were later fine-tuned on the UERC data and tested in unconstrained conditions. Another work of using AlexNet for ear recognition includes [5]. Dodge et al. [23] developed a deep neural network for ear recognition and tested their model on the unconstrained AWE and CVLE datasets. In the work of Ying et al. [74] a shallow CNN architecture was presented, but the ear dataset used for testing was not described. In the work of Zhang et al. [79] the authors presented a new dataset of ear video sequences captured with a mobile phone. For recognition the authors used the fully convolutional SPPNet (spatial pyramid pooling network) capable of accepting variable sized input images. In this chapter we build on the outlined body of work and introduce a novel deep-learning model for ear recognition in unconstrained conditions. Similar to existing work, our model relies on a CNN to learn an descriptive representation of ear images, but unlike competing solutions does not capture only global information, but also local image cues that may be important for recognition purposes. Moreover, due to the design of the model, the amount of local information that is added to the computed ear representation is adaptively added to the learned descriptor depending on the given content of the input images.

3 COM-Ear: A Deep Constellation Model for Ear Recognition In this section we present the general structure and the idea behind our (deep) constellation model for ear recognition (COM-Ear). COM-Ear represents, to the best of our knowledge, the first part-based deep learning model for ear recognition. Compared to competing deep learning models, which process images in a holistic manner, COM-Ear also takes into account local image information and combines it with holistic features, which (as we show in the experimental section) improves robustness to occlusions and results in state-of-the-art results on established benchmarks.

Constellation-Based Deep Ear Recognition

167

3.1 Motivation Deep learning models, and particularly convolutional neural networks (CNNs), which learn discriminative image representations from the whole input image have recently achieved great results in all areas of biometrics, including ear recognition [24]. However, global methods that process input images as a whole are in general sensitive to illumination changes, occlusions, pose variations, and other factors typically present in unconstrained real-world environments. Local methods that extract discriminative information from local image regions, on the other hand, are by typically more robust to occlusions and related nuisance factors and represent a viable alternative to global approaches. CNNs are by design global in nature, but with their hierarchical design and characteristics, such as convolutional kernels with local connectivity, high dimensionality, and non-linearity, are also capable of encoding local discriminative information exceptionally well. However, because of the nature of operations in CNNs, this local information is aggregated and propagated along the model layers and the amount of local information that is preserved is limited to the most discriminative parts of the input. In unconstrained setting different parts of the input might be occluded, differently illuminated and especially in the case of ear recognition, different accessories might be present on different parts of the ear which greatly affects the performance of such methods. With COM-Ear we try to address these issues and present a deep model with the following characteristics: • Aggregation of global and local information: We design our model to follow the approach of hybrid techniques, which were popular before the era of deep learning and utilized both global and local information. Compared to traditional hybrid approaches, we design COM-Ear in a fully convolutional way such that both, global and local information is captured by a single CNN model, resulting in a highly descriptive and discriminative image representation that can be used for identity inference. • Selective attention to image parts: Local features are combined in a novel way which gives the model the capability to focus its attention on locally important discriminative parts. The proposed model also offers a straight-forward way of exploring the importance of each image part which contributes towards high explainability of the proposed model. • Robustness to occlusions: Our method is designed to specifically address issues known to be problematic for ear recognition. Specifically, it offers a natural way of decreasing sensitivity to partial occlusions of the ear that are typically caused by the presence of ear accessories.

3.2 Overview of COM-Ear The architecture of the COM-Ear model is presented in Fig. 1. The model is designed as a dual-path architecture, where the first path (marked 1) in Fig. 1)

168

D. Štepec et al.

Fig. 1 Overview of the proposed Deep Constellation Model for Ear Recognition (COM-Ear). The model is designed as a two-path architecture. First path, denoted by (1) represents the global processing path that encodes the input image at a holistic level using a backbone CNN-based feature extractor denoted by (1b). The second path, denoted by (2) represents the local patch-based processing path which extracts features from local image patches via the backbone CNN, denoted by (2b). Local features are then combined with the PRI-Pooling operator, denoted by (2c). Global and local features are concatenated in (3) and used in the fully connected layers (4) and (5) to predict outputs and to compute losses during training

encodes global information about the ear appearance and the second path (marked 2) in in Fig. 1) captures local image cues to supplement the information extracted from the global processing path. For both paths a CNN-based feature extractor is used as the backbone model. Below we describe all parts of the model in detail.

3.2.1

The Global Processing Path

The input to the global processing path (1) is the whole ear image (1a) as shown in Fig. 1. From this input, a feature representation is computed using a backbone CNN (1b). While different models could be used for this task, we select different ResNet incarnations (ResNet-18, ResNet-50, ResNet-152) [35] because of their state-of-theart performance in many vision tasks and the fact that open source implementations are readily available. To make full use of the ResNet models, d-dimensional features are taken from the last pooling layer of the models and the pooling operation is replaced by adaptive pooling to make the model applicable to differently sized input images.

3.2.2

The Local Processing Path

For the second processing path (2), the input image is first split into N smaller patches (2a). The patches are then fed to the COM-Ear model for processing. The

Constellation-Based Deep Ear Recognition

169

local processing path is designed as a parallel Siamese architecture (2b) with shared model parameters and the same backbone feature extractor as used in the global processing path (1b). Feature representations are extracted from each of the patches and aggregated using max-pooling with kernel size of 1 (i.e., max operation along patch dimension) (2c). We decided for a Siamese architecture to reduce the number of parameters that need to be learned during training and to decrease the possibility of over-fitting. With our scalable design we are able to exploit information from a variable number of input patches with no influence on the number of parameters and without changes to the network topology. To aggregate information from the local features we use a novel pooling procedure we refer to as patch-relevant-information pooling or PRI-pooling for short. The idea behind PRI-pooling is to use only the most relevant information from the local image patches in the final image representation. For the proposed pooling procedure, every patch is first passed through the Siamese CNN ensemble to get a set of N corresponding d-dimensional feature representations—similar to TI-Pooling [43]. A max pooling operation is then applied on the N feature vectors along the patch dimension to generate the final aggregated d-dimensional feature representation for the local processing path. PRI-pooling is applied on the feature vector in order to obtain features that are locally important. We argue that with patches as an input to the PRI-pooling topology the features learned capture local information that is relevant and may supplement the global features produced by the global processing path of the COMEar model. With this approach the network can automatically infer, which parts of the input image are important and, vice versa, these regions can be identified from the composition of the final aggregated feature vector produced by the local processing path. Thus, the PRI-pooling operation gives our COM-Ear model as a level of explainability not available with purely holistic competitors. We provide a few qualitative examples of this explainability in the experimental section.

3.2.3

Combining Information

Global and local features are combined in the final part of the COM-Ear model by simple concatenation. Given that the feature representation from each model path is d-dimensional the combined feature representation comprises 2d elements. As we show in the experimental section, both types of features are important for the performance of the COM-Ear model, as with holistic features or local features alone we are not able to match the performance of the combined representation. Especially, the performance of the local features is observed to be limited when no holistic information is used. We believe the reason for this setting is that global processing path of our model affects the learning of the local processing path due to end-to-end learning procedure designed for the proposed architecture. The combined holistic and global features are passed through another series of fully

170

D. Štepec et al.

connected layers where the final layer is softmax layer upon which a loss is defined during training. The softmax classification layer can also be used during run-time for closed-set recognition experiments.

3.3 Model Training To train the COM-Ear model, we design an end-to-end learning procedure and a combined training objective, defined as follows: Ltotal = LS + λLC ,

(1)

where LS denotes the cross-entropy loss defined on the softmax COM-Ear layer, LC stands for the center loss defined on features that represent inputs to the first fully connected layer, and λ represents a hyperparameter that balances the two losses. The motivation behind the center loss is to enhance the discriminative power of the learned features. The cross-entropy loss forces the deep features of different classes to be separable, while the center loss efficiently pulls the deep features closer to their corresponding centers (learned on mini batches). In this way inter-class feature differences are enlarged and intra-class feature variations are reduced effectively making features more discriminative. Our preliminary experiments during the design phase suggested that the inclusion of the center loss is highly important as the results without center loss were significantly less convincing.

3.4 Implementation Details We implement our model using PyTorch and use built-in implementations of ResNet (ResNet-18, ResNet-50, ResNet-152) models as the backbone CNN-based feature extractor. All backbone models are used with pretrained weight on the ImageNet dataset. We modify the models in order to accept arbitrary sized input images and replace all average pooling layers with adaptive average pooling operations. The outputs of the adaptive average pooling layers are used as features in both the global and local processing paths. As described above, local features from the patches are aggregated with the proposed PRI-pooling operation, which is implemented as an element-wise maximum over patch dimension. Both the aggregated local feature and the global holistic features are of the same dimension (512 for ResNet-18 and 2048 for ResNet50 and ResNet-152) and are combined using a simple concatenation operation. Concatenated inputs are transformed via the fully connected layer of the same output dimension as the input (e.g., 512 for ResNet-18) and this represents the input to the final fully connected softmax layer.

Constellation-Based Deep Ear Recognition

171

4 Experiments and Results In this section we describe the experiments performed to highlight the main characteristics of the proposed COM-Ear model. Since our focus is the ability of the proposed model to perform well in unconstrained environments we used Extended Annotated Web Ears (AWEx) for our experiments. However, the dataset contains a limited number of images. Based on our previous experience and findings [24] we used severe data augmentation to stimulate training and to prevent overfitting. The AWEx is then used to train and evaluate different variations of the model—reduced patch size and omission of center loss. We compare our model directly to some of the state-of-the-art approaches. Furthermore, to evaluate the performance of our proposed COM-Ear model as well as possible, we also present a comparison with the deep-learning approaches submitted to the 2017 Unconstrained Ear Competition Challenge [25]—a recent group-benchmarking effort of ear recognition technology applied to data captured in unconstrained conditions. Additionally, we also evaluate the robustness of our model in regard to one of the most problematic aspects of ear recognition—occlusions. For this part of our analysis we generate a synthetic dataset with artificial ear accessories superimposed over ear images. Lastly, we present an in-depth qualitative analysis, where we first visualize the impact of the proposed patch-based processing by analyzing performance of separate parts and then visually compare ranking performance of the proposed model vs the performance of the deep-learning approaches from the 2017 Unconstrained Ear Recognition Challenge [25].

4.1 Experimental Datasets We conduct experiments on the Extended Annotated Web Ears (AWEx) dataset, which represents one of the largest datasets of unconstrained ear images available. Images from the dataset were gathered from the web and therefore exhibit a significant level of variability due to differences in head rotations, illumination, age, gender, race, occlusion, and other factors. A few example images intended to highlight the difficulty of the dataset are presented in Fig. 2. The ear images in the dataset are tightly cropped and are not normalized in size. A total of 336 subjects and 4004 images is available in the AWEx dataset and is used in our experiments. We use the dataset in identification experiments and follow various experimental protocols to be able to compare our model with published results from the literature. These protocols include two evaluation protocols from UERC 2017 [25] which allows for a direct comparison with approaches from the challenge.

172

D. Štepec et al.

Fig. 2 Sample images from the AWEx dataset. As can be seen, the images exhibit a wide range of appearance variability due to different resolution, ethnicity, varying levels of occlusion, presence/absence of accessories, head rotations and other similar factors

4.2 Performance Metrics As already indicated above, we perform identification experiments to evaluate the COM-Ear model and compare it to existing approaches. Identification aims at predicting the identity of the given ear image, as opposed to verification experiments where the prediction is binary—whether the observed ear image belongs to a given subject or not. To measure performance in our experiments, we report the following performance metrics, wherever possible: • The rank one recognition (rank-1): is the percentage of probe images, for which an image of the correct identity was retrieved from the gallery as the top match. If there are multiple images per class available in the gallery, the most similar image is selected and used for the rank calculation. • The rank five recognition (rank-5): is the percentage of probe images, for which an image of the correct identity was among the top five matches retrieved from the gallery. Same, as for rank-1, if there are multiple images per class in the gallery, the most similar sample is considered. This procedure applies for all the rank calculations. • The area under the CMC curve (AUC): is the normalized area under the cumulative match score curve (CMC), which is similar to the standard AUC

Constellation-Based Deep Ear Recognition

173

measure typically computed for receiver operating characteristic (ROC) curves. This metric measures the overall performance of the tested recognition model and is commonly used in identification experiments [25]. These identification metrics are widely used in literature and have, therefore also been selected for this work. For all described performance metrics a higher value means better performance. The rank values range from 0 to the number of classes present in the test set, whereas for the AUC score, values range between 0 and 1 and denote the fraction of the surface area under the CMC curve. For a more in-depth explanation of the metrics used in the experiments, we refer readers to [38].

4.3 Training Details We train the COM-Ear model using images from the AWEx dataset. For the training procedure we use the training objective in (1) with a value of λ = 0.003, as used in original paper [71], to balance the impact of the cross-entropy and center losses. We set the learning rate to 0.01 for the cross-entropy loss and to 0.5 for center loss. We train the model for 100 epochs with stochastic gradient descent (SGD) and a step size of 50, decay rate of 0.1, and a batch size of 32 input images and their corresponding patches. We sample patches from the input images in a gridlike fashion with overlap.1 A summary of the hyperparameters used during training is given in Table 1. To avoid overfitting, we perform data augmentation to increase the variability of the data. The importance of data augmentation in the ear recognition domain was first mentioned in [24]. However, compared to [24] we used online augmentations (i.e., augmenting the data on the fly) so that the network almost never sees the exact same image multiple times in order to improve generalization performance.We perform data augmentation with the Imgaug2 Python library and use the following image transformations: • • • •

horizontal flipping, blurring with Gaussian filters with σ in the range (0, 0.5), scaling by a factor in the range (0.9, 1.2), and rotation in the range ±30◦ .

Table 1 Hyperparameters used during training

1 Note

Number of epochs Weight decay Learning rate for loss function Learning rate for center loss λ (as defined above)

100 0 0.01 (0.9 momentum) 0.5 0.003

that we study the influence of the size of the patches in the experimental section.

2 https://github.com/aleju/imgaug.

174

D. Štepec et al.

All listed operations are performed in random order and each operation is applied with 50% chance. With this setting we leave the chance that there could be no augmentations applied at all—albeit with a very low probability. The images are also normalized with per channel mean and standard deviation values from ImageNet [21] as is general practice. Image patches are cropped after performing augmentations on the image so that both the image and patches are transformed in the same way. With this we ensure that holistic and local models are looking at the same input.

4.4 Ablation Study In our first series of experiments we investigate the impact of some of our design choices when developing the COM-Ear model. For this ablation study we, therefore, focus on separate parts of the proposed COM-Ear model and observe how specific design choices affect the performance of our model. For this experiment we follow the experimental protocol from [29] and split the available data from the AWEx dataset into two, subject disjoint sets, i.e.: • A training set comprising 1804 images of 116 subjects. These images are used to learn the parameters of the COM-Ear model (and its variants) and monitor training progress via a validation set during the learning procedure. • A testing set comprising 2200 images of 220 subjects intended for final performance evaluation. Images from the set are used to compute performance metrics and report results. To allow for open-set identification experiments, we perform network surgery on the COM-Ear model and use the 2d-dimensional concatenated global and local features as the descriptor for the given input ear image. To measure similarity between ear descriptors we compute cosine similarities. For the experiments, we use an initial image size of 224 × 224 pixels and a patch size of 112 × 112 pixels. Patches are sampled with a 50% overlap resulting in a total of 9 patches for the local processing path of COM-Ear. Using the above protocol, we first explore the performance of the backbone ResNet feature extractors and compare the performance of different ResNet variants, i.e., ResNet-18, ResNet-50, and ResNet-152. We train all models on our training data using the same loss as for the COM-Ear model (see Eq. (1)) and use features from the penultimate model layer with the cosine similarity for recognition. The results in Table 2 show no significant difference in the performance of the backbone models. We, therefore, select ResNet-18 as the final backbone model for COM-Ear due to its light-weight architecture compared to the other two ResNet variants. Next, we report results for the COM-Ear model obtained with and without the use of center loss. We observe that the performance of COM-Ear drops by a larger margin when no center loss is used, which points to the importance of

Constellation-Based Deep Ear Recognition

175

Table 2 Ablation study for the COM-Ear model Method ResNet-18 ResNet-50 ResNet-152 COM-Ear COM-Ear [no center loss] COM-Ear [patch size/2]

Rank-1 [%] 26.1 26.1 26.1 31.1 27.1 29.4

Rank-5 [%] 52.2 50.8 49.9 54.6 52.5 52.1

AUCMC [%] 92.7 92.6 92.4 93.2 92.6 91.6

The bold values denote best performing results

the combined loss during training. Additionally, when looking at the performance difference between the backbone ResNet-18 model and COM-Ear, we see that the addition of the local processing path significantly improves performance, as the rank-1 recognition improved from 26.1% (for ResNet-18) to 31.1% (for COM-Ear). Finally, we report results for COM-Ear using smaller patches of size 56 × 56 pixels—patches are still sampled from the input image with a 50% overlap. In comparison with the initial patch size of 112 × 112 pixels sampled with a 50% overlap results are worse. These results suggest that patches need to be of a sufficient size in order to carry enough context to be informative. Smaller patches can also carry background information which is not beneficial for recognition purposes and may also introduce ambiguities among different subjects. Examples of 112 × 112 pixel input patches with 50% overlap can be viewed in Fig. 1.

4.5 Performance Evaluation Against the State-of-the-Art In our next series of experiments we benchmark the COM-Ear model against stateof-the-art models from the literature. We conduct two types of experiments to match the experimental protocols most often used by other researchers.

4.5.1

Comparison with Competing Methods

In the first experiment of this series we use the same protocol as during the ablation study. This protocol is taken from [29] and we report results from this publication for comparison purposes. Specifically, we include results for dense-descriptorbased methods relying on local binary patterns (LBPs [10, 26, 33, 57, 58]), (rotation invariant) local phase quantization features (RILPQ and LPQ [51, 52]), binarized statistical image features (BSIF [26, 40, 57]), histograms of oriented gradients (HOG, [19, 20, 26, 57]), the dense scale invariant feature transform (DSIFT, [22, 26, 42]), and patterns of oriented edge Magnitudes (POEM, [26, 70]). For deep-learning-based models, we report results for ResNet-18, ResNet-

176

D. Štepec et al.

Table 3 Comparative evaluation of the COM-Ear model Method ResNet-18 [35] ResNet-50 [35] ResNet-152 [35] MobileNet (1/4) [36] MobileNet (1/2) [36] MobileNet (1) [36] LBP [50] HOG [19] DSIFT [44] BSIF [40] LPQ [51] RILPQ [52] POEM [70] COM-Ear

Rank-1 [%] 24.5 25.9 26.1 17.1 16.0 26.9 17.8 23.1 15.2 21.4 18.8 17.9 19.8 31.5

Rank-5 [%] 48.5 49.9 52.8 36.1 38.5 50.0 32.2 41.6 29.9 35.5 34.1 31.4 35.6 55.9

AUCMC [%] 91.4 92.0 92.6 88.0 88.5 91.8 79.6 87.9 77.5 81.6 81.0 79.8 81.5 93.3

50, and ResNet-152 (taken from [29]). Additionally, we provide results for the MobileNet model, which represent a light-weight CNN architecture, developed with mobile and embedded vision applications in mind. The architecture uses two main hyperparameters that efficiently trade-off between latency and accuracy [36]. These hyperparameters allow to tweak the size of the model in accordance with the problem domain and use-case scenarios. In this work we evaluate three such versions with different width multipliers. The lower the value the more lightweight the model, the higher the value (highest, being 1) the heavier the footprint. In Table 3 we report results for three levels of multipliers: 1/4, 1/2, and 1 [36]. The results of this experiment are presented in Table 3 and Fig. 3. We observe that COM-Ear achieves the best overall results, improving upon the state-of-theart by a large margin. With a rank one recognition rate of 31.05% it significantly outperforms all traditional feature extraction methods as well as all tested deeplearning-based models. The closest competitor is MobileNet (1) with a rank one recognition rate of 26.9%. Descriptor-based methods are less convincing with the best performing method from this group achieving a rank-1 recognition rate of 23.1%, 8% less (in absolute terms) than the proposed COM-Ear model.

4.5.2

Comparison with Results from the 2017 Unconstrained Ear Recognition Challenge (UERC)

In the next experiments we compare COM-Ear on the data and experimental protocol used in the 2017 Unconstrained Ear Recognition Challenge (UERC 2017). UERC 2017 was organized as a group benchmarking effort in the scope of the 2017 International Joint Conference on Biometrics (IJCB 2017) and focused on

Constellation-Based Deep Ear Recognition

177

1 0.9 0.8

Recognition Rate

0.7 COM-Ear ResNet-18 ResNet-50 ResNet-152 MobileNet 0.25 MobileNet 0.5 MobileNet 1.0 LBP HOG DSIFT BSIF LPQ RILPQ POEM

0.6 0.5 0.4 0.3 0.2

10

0

10

1

10

2

Rank

Fig. 3 CMC curves of the comparative evaluation of the COM-Ear Model. The results are presented in logarithmic scale to better visualize the performance differences at the lower ranks, which are more important from an application point of view. The figure is best viewed in color

accessing ear recognition technology in unconstrained settings. The challenge was conducted in part on the AWEx dataset using a slightly different protocol as used in the previous section. The reader is referred to [25] for details on the protocol. Several groups participated in the challenge and submitted results. Here, we include results for all deep-learning-based methods from UERC 2017—briefly summarized in Table 4—and for the three ResNet variants also tested in the previous sections. The results of the comparison are presented in Table 5 and Fig. 4. Similarly to the results from Table 3 we observe that among the ResNet models, ResNet-18 performs the best (rank-1 = 28%). The overall top performer is again COM-Ear, which achieves state-of-the-art results, improving upon the best results included in the comparison (IAU) by more than 3% in terms of rank-1 and more than 5% rank-5 recognition results. The best performing entry from Islamic Azad University (IAU) in UERC 2017 was built around a VGG-16 architecture [64] and transfer learning. The idea of the IAU approach was to leave part of the pretrained VGG-16 model as is (i.e., with frozen weights) while retraining other parts of the model that are relevant for transferring to the new domain, i.e., ear recognition. Specifically, the authors added two fully connected layers on top of the 7th layer of the pretrained

178

D. Štepec et al.

Table 4 Summary of deep-learning-based approaches from UERC 2017 included in the comparison Approach IAU ICL IITK ITU I ITU II LBP-baseline VGG-baseline

Description VGG network (trained on ImageNet) and transfer learning Deformable model and Inception-ResNet VGG network (trained on the VGG face dataset) VGG network (trained on ImageNet) and transfer learning Ensemble method (LBP + VGG-network) Descriptor-based (uniform LBPs) VGG network trained solely on the UERC training data

Descriptor type Learned

Alignment No

Flipping No

Learned

Yes

Yes

Learned

No

Yes

Learned

No

Yes

Learned + Hand-crafted Hand-crafted

No

Yes

No

No

Learned

No

No

The table provides a short description of each approach, information on whether ear alignment and flipping was performed and the model size (if any). See [25] for details Table 5 Comparison with results from the Unconstrained Ear Recognition Challenge (UERC) [25] Method ICL [25] IAU [25] IITK [25] ITU-I [25] ITU-II [25] LBP-baseline [25] VGG-baseline [25] ResNet-18 ResNet-50 ResNet-152 COM-Ear

Rank-1 [%] 5.3 38.5 22.7 24.0 27.3 14.3 18.8 28.0 22.1 15.9 41.8

Rank-5 [%] 14.8 63.2 43.6 46.0 48.3 28.6 37.5 57.1 46.8 40.7 67.7

AUCMC [%] 71.17 94.0 86.1 89.0 87.7 75.9 86.6 93.0 90.5 88.8 94.7

The results were generated on the testing split of the AWEx dataset

VGG model. Only the newly added FC layers were trained on the UERC data. Learning only certain layers while leaving other layers untouched (e.g., learned only on ImageNet) is beneficial especially in the case of smaller datasets like the one used for UERC 2017, as it prevents overfitting and thus results in features that generalize better to the new task. In our case we used center loss to make the learned features more discriminative and learn a descriptive model using limited training data. In the next experiment, we evaluate how the proposed COM-Ear model scales with larger probe and gallery sets. For this experiment we use the scale experimental protocol employed for the scale experiments in UERC 2017. The results of this test are shown in the form of CMC curves in Fig. 5. The curves were generated using

Constellation-Based Deep Ear Recognition

179

1 0.9 0.8

Accuracy Rate

0.7 0.6 0.5 0.4

COM-Ear ICL IAU IITK ITU-I ITU-II VGG-Base LBP-Base

0.3 0.2 0.1 0 0

20

40

60

80

100

120

140

160

180

Rank Fig. 4 CMC curves of the direct comparison to the UERC 2017 [25] recognition results. The results were generated on the testing split of the AWEx dataset and are shown in linear scale

7442 probe images belonging to 1482 subjects and 9500 gallery images of 3540 subjects. The gallery also contained identities that were not in the probe set. These samples act as distractors for the recognition techniques [25, 41]. The numerical results in Table 6 show that the proposed model perform comparable to the ITU-II approach in terms of rank-1 and rank-5 recognition rates and is very competitive even when a large number of distractor samples are introduced to the experiments. It also needs to be noted that the ITU-II technique combined two complementary CNN models and hand-crafted features to achieve this performance, COM-Ear, on the other hand, is a coherent model that relies on the same feature representation but considers aggregates global and local information about the appearance of the ears for identity inference.

4.6 Robustness to Occlusions In our last experiment we evaluate the robustness of our model to occlusions of the ear. We use the same protocol as during the ablation study and run two types of experiments: with and without occlusions. The experiments without occlusions

180

D. Štepec et al. 1 0.9 0.8

Accuracy Rate

0.7 0.6

COM-Ear ICL IAU IITK ITU-I ITU-II VGG-Base LBP-Base

0.5 0.4 0.3 0.2 0.1 0 10 0

10 1

10 2

10 3

Rank Fig. 5 CMC curves of the comparison to the UERC 2017 [25] recognition results. The results were generated on the complete test dataset of UERC containing all 3540 subjects of the AWEx dataset and multiple distractor identities. The results are again shown in logarithmic scale to highlight the performance differences at the lower ranks. The figure is best viewed in color Table 6 Comparison with results from the 2017 Unconstrained Ear Recognition Challenge (UERC) [25]

Method ICL [25] IAU [25] IITK [25] ITU-I [25] ITU-II [25] LBP-Base [25] VGG-Base [25] COM-Ear

Rank-1 [%] 0.9 16.2 6.7 14.6 17.0 8.7 9.7 16.8

Rank-5 [%] 2.8 28.3 11.8 28.1 29.4 16.7 19.3 30.0

AUCMC [%] 73.8 90.5 77.5 93.6 91.9 84.3 88.3 90.2

The results were generated on the complete test dataset containing all 3540 subjects of the UERC dataset

are equivalent to the experiments already presented above. For the experiments with occlusions we simulate the presence of ear accessories and place images of accessories on random places over the cropped ear images. The added accessories are of different shapes and color and simulate a broad spectrum of real-world

Constellation-Based Deep Ear Recognition

181

Fig. 6 Example of some of the inputs with added computer generated accessories. The ear accessories were generated in different shapes, positions and with varying sizes Table 7 Results with accessory-based presentation attack

Experiment Without occlusions With occlusions

Method ResNet-18 COM-Ear ResNet-18 COM-Ear

Rank-1 [%] 24.5 31.1 16.1 22.3

The upper values show baseline values without the attack, and the values below the delimiting line show the results when attacked with earring images

accessories. Some of the generated images are presented in Fig. 6. As we can see, the accessories mostly cover a small area of the image, but may be as big as 20% of the image area. Results for this series of experiments are presented in Table 7 for the COM-Ear model as well as the baseline ResNet-18 model. We see that both models deteriorate in performance, but the degradation is worse for ResNet-18. These results suggest that the local processing path that encodes local image details is indeed beneficial and contributes not only to state-of-the-art performance on unconstrained ear images, but also improves robustness when accessories are present in ear images.

4.7 Qualitative Evaluation In this section we show some qualitative results related to the COM-Ear modeland also with regard to approaches from UERC 2017. As discussed earlier, COM-Ear aggregates local features with the proposed PRI-pooling operation, which takes a maximum over the patch dimension to produce the 512-dimensional feature vector

182

D. Štepec et al.

(in the case of a ResNet-18 backend) from the set of local feature vectors extracted from the image patches. The COM-Ear model allows us to determine, which patch is represented in what proportion in the aggregated (local) feature vector if we examine where each value of the aggregated feature vector came from (i.e., argmax operation). We show some example ear images from the AWEx dataset and their corresponding image patches in Fig. 7. Here, the fraction of features each patch contributes into the aggregated feature vector is shown below the patches. The examples in Fig. 7a–c represent the same subject with images captured in different conditions and ears in slightly different positions. We can see that similar image features are considered important for all three examples and that the importance of all patches is very similar, especially important seems to be the bottom-left patch which has a distinct ear shape. The images in Fig. 7d, e represent input samples with earrings. We can see that patches with earrings are not weighted heavily as one might expect. This is because the training set contains data with and without earrings, so the model can learn that earrings are not necessarily important—this may also be one of the reasons why COM-Ear is much more robust with to the presence of accessories than ReNet-18. The examples in Fig. 7f, g show images with large occlusions and presence of earrings which makes it difficult to perform recognition, as many distinct features are not clearly visible. In the example Fig. 7f we see that the top-left patch is the most important as there is almost no occlusion. Similarly for the Fig. 7g where the upper sections of the ear are completely occluded, the patch that has no occlusions is chosen to be the most discriminative. Patches with occlusions such as hair are downweighted as hair is highly variable and the model learns this fact during training. Figure 7h has no occlusions or accessories but the image is captured in low light conditions and at an difficult angle. The distribution of the importance of the patches is, therefore, much more equal as there are more relevant features present along the whole ear area and one does not dominate. In order to compare the proposed COM-Ear to others qualitatively, we show what type of images the model and the deep learning approaches from UERC 2017 retrieve from the gallery as the first (rank-1) and as the second match (rank-2) for a given probe images. We also show the first correct prediction (note that there are multiple images of the correct subject in the gallery) and provide the rank, at which it was retrieved. The first correct prediction is considered to be the image that is closest in the ranking and has the same identity as the probe image. For this experiment we again use the entire UERC test data with 9500 images in the gallery set. The described qualitative analysis is shown in Fig. 8 for five randomly selected probe images—shown on the left. With the top performing approaches, the images retrieved at ranks 1 and 2 exhibit a high visual similarity to the probes, as expected. Thus, even when predictions fail, the closest matches visually resemble the probe image. However, with the second, fourth, and fifth probe images all 5 evaluated techniques fail. For the fifth probe image the low-resolution of the probe is likely the reason for the failure. The fourth probe image contains high contrast illumination that could be the cause of the error.

Constellation-Based Deep Ear Recognition

183

Fig. 7 Example of input patches and their importance in the aggregated feature vector of the local processing path of the COM-Ear model

184

Fig. 7 (continued)

D. Štepec et al.

Constellation-Based Deep Ear Recognition

185

Fig. 8 Qualitative analysis with selected probe images. The figure shows selected probe images (on the left) and the first and second match generated by the evaluated approaches. The first retrieved image with the correct identity is also shown together with the corresponding rank, at which it was retrieved

186

D. Štepec et al.

In the second example (the second probe), the image looks fairly easy to recognize, since illumination is good, the ear is well visible and there are no ear accessories. However, we assume that the cause for the bad performance in this case could be attributed to the fact that there are many images from other subjects in the dataset that look similar. The visual similarity of images found as the closest matches seems to confirm this observation.

5 Conclusion In this chapter we introduced the first deep constellation model for ear recognition, termed COM-Ear. We evaluated the model in extensive experiments on the Extended Annotated Web Ears (AWEx) dataset and improved upon state-of-the-art results by a large margin. We showed that with the COM-Ear constellation model we not only achieve state-of-the-art results, but also contribute towards more stable recognition performance in challenging setting when parts of the ears are occluded or ear accessories are present in the images. The design of the COM-model and the novel pooling procedure, proposed in this chapter allowed us to visualize certain aspect of the learned ear representations and resulted in a level of interpretability not seen with competing models. With reversing the aggregation operation (i.e., the proposed PRIpooling) we were able to obtain patch level importance which presents an additional novelty of our proposed model. This has important implications for future research as similar concepts could be integrated into other models and could provide a novel mechanism to better understand. Acknowledgements This research was supported in parts by the ARRS (Slovenian Research Agency) Research Program P2-0250 (B) Meteorology and Biometric Systems, the ARRS Research Program P2-0214 (A) Computer Vision. The authors thank NVIDIA for donating the Titan Xp GPU that was used in the experiments. This work was also partially supported by the European Commission through the Horizon 2020 research and innovation program under grants 688201 (M2DC) and 690907 (IDENTITY).

References 1. A.F. Abate, M. Nappi, D. Riccio, S. Ricciardi, Ear recognition by means of a rotation invariant descriptor, in International Conference on Pattern Recognition, vol. 4 (IEEE, Piscataway, 2006), pp. 437–440 2. A. Abaza, A. Ross, C. Hebert, M.A.F. Harrison, M. Nixon, A survey on ear biometrics. ACM Comput. Surv. 45(2), 1–22 (2013) 3. M. Abdel-Mottaleb, J. Zhou, Human ear recognition from face profile images, in Advances in Biometrics (Springer, Berlin, 2006), pp. 786–792 4. C. Akin, U. Kacar, M. Kirci, A multi-biometrics for twins identification based speech and ear. CoRR abs/1801.09056 (2018)

Constellation-Based Deep Ear Recognition

187

5. A.A. Almisreb, N. Jamil, N.M. Din, Utilizing AlexNet deep transfer learning for ear recognition, in International Conference on Information Retrieval and Knowledge Management (IEEE, Piscataway, Mar 2018), pp. 1–5 6. H.A. Alshazly, M. Hassaballah, M. Ahmed, A.A. Ali, Ear biometric recognition using gradientbased feature descriptors, in Proceedings of the International Conference on Advanced Intelligent Systems and Informatics, 2018, ed. by A.E. Hassanien, M.F. Tolba, K. Shaalan, A.T. Azar. Advances in Intelligent Systems and Computing (Springer, Berlin, 2019), pp. 435–445 7. B. Arbab-Zavar, M.S. Nixon, Robust log-Gabor filter for ear biometrics, in International Conference on Pattern Recognition (IEEE, Piscataway, 2008), pp. 1–4 8. Z. Baoqing, M. Zhichun, J. Chen, D. Jiyuan, A robust algorithm for ear recognition under partial occlusion, in Chinese Control Conference (2013), pp. 3800–3804 9. A. Basit, M. Shoaib, A human ear recognition method using nonlinear curvelet feature subspace. Int. J. Comput. Math. 91(3), 616–624 (2014) 10. A. Benzaoui, A. Kheider, A. Boukrouche, Ear description and recognition using ELBP and wavelets, in International Conference on Applied Research in Computer Science and Engineering (2015), pp. 1–6 11. M. Burge, W. Burger, Ear biometrics, in Biometrics, ed. by A.K. Jain, R. Bolle, S. Pankanti (IEEE, Piscataway, 1996), pp. 273–285 12. J.D. Bustard, M.S. Nixon, Toward unconstrained ear recognition from two-dimensional images. IEEE Trans. Syst. Man Cybern. A: Syst. Humans 40(3), 486–494 (2010) 13. T.S. Chan, A. Kumar, Reliable ear identification using 2-D quadrature filters. Pattern Recogn. Lett. 33(14), 1870–1881 (2012) 14. K. Chang, K.W. Bowyer, S. Sarkar, B. Victor, Comparison and combination of ear and face images in appearance-based biometrics. IEEE Trans. Pattern Anal. Mach. Intell. 25(9), 1160– 1165 (2003) 15. M. Chora´s, Perspective methods of human identification: ear biometrics. Opto-Electr. Rev. 16(1), 85–96 (2008) 16. M. Choras, R.S. Choras, Geometrical algorithms of ear contour shape representation and feature extraction, in International Conference on Intelligent Systems Design and Applications (IEEE, Piscataway, 2006), pp. 451–456 17. D.P. Chowdhury, S. Bakshi, G. Guo, P.K. Sa, On applicability of tunable filter bank based feature for ear biometrics: a study from constrained to unconstrained. J. Med. Syst. 42(1), 11/1–20 (2017) 18. M. Chowdhury, R. Islam, J. Gao, Robust ear biometric recognition using neural network, in Conference on Industrial Electronics and Applications (IEEE, Piscataway, Jun 2017), pp. 1855–1859 19. N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in Conference on Computer Vision and Pattern Recognition (IEEE, Piscataway, 2005), pp. 886–893 20. N. Damar, B. Fuhrer, Ear recognition using multi-scale histogram of oriented gradients, in Conference on Intelligent Information Hiding and Multimedia Signal Processing (2012), pp. 21–24 21. J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, L. Fei-Fei, Imagenet: a large-scale hierarchical image database, in Conference on Computer Vision and Pattern Recognition (IEEE, Piscataway, 2009), pp. 248–255 22. K. Dewi, T. Yahagi, Ear photo recognition using scale invariant keypoints, in Computational Intelligence (2006), pp. 253–258 23. S. Dodge, J. Mounsef, L. Karam, Unconstrained ear recognition using deep neural networks. IET Biom. 7(3), 207–214 (2018) 24. Ž. Emeršiˇc, D. Štepec, V. Štruc, P. Peer, Training convolutional neural networks with limited training data for ear recognition in the wild, in International Conference on Automatic Face and Gesture Recognition – Workshop on Biometrics in the Wild (IEEE, Piscataway, 2017), pp. 987–994

188

D. Štepec et al.

25. Ž. Emeršiˇc, D. Štepec, V. Štruc, P. Peer, A. George, A. Ahmad, E. Omar, T.E. Boult, R. Safdari, Y. Zhou, S. Zafeiriou, D. Yaman, F.I. Eyiokur, H.K. Ekenel, The unconstrained ear recognition challenge, in International Joint Conference on Biometrics (IEEE/IAPR, Piscataway, 2017), pp. 715–724 26. Ž. Emeršiˇc, V. Štruc, P. Peer, Ear recognition: more than a survey. Neurocomputing 255, 26–39 (2017) 27. Ž. Emeršiˇc, N.O. Playà, V. Štruc, P. Peer, Towards accessories-aware ear recognition, in International Work Conference on Bioinspired Intelligence (IEEE, Piscataway, Jul 2018), pp. 1–8 28. Ž. Emeršiˇc, B. Meden, P. Peer, V. Štruc, Evaluation and analysis of ear recognition models: performance, complexity and resource requirements. Neural Comput. Applic., 1–16 (2018). https://doi.org/10.1007/s00521-018-3530-1 29. Ž. Emeršiˇc, J. Križaj, V. Štruc, P. Peer, Deep ear recognition pipeline, in Recent Advances in Computer Vision, ed. by M. Hassaballah, K.M. Hosny, vol. 804 (Springer, Berlin, 2019), pp. 333–362 30. F.I. Eyiokur, D. Yaman, H.K. Ekenel, Domain adaptation for ear recognition using deep convolutional neural networks. IET Biom. 7(3), 199–206 (2017) 31. T. Fan, Z. Mu, R. Yang, Multi-modality recognition of human face and ear based on deep learning, in International Conference on Wavelet Analysis and Pattern Recognition (Jul 2017), pp. 38–42 32. P.L. Galdámez, W. Raveane, A. González Arrieta, A brief review of the ear recognition process using deep neural networks. J. Appl. Log. 24, 62–70 (2017) 33. Y. Guo, Z. Xu, Ear recognition using a new local matching approach, in International Conference on Image Processing (IEEE, Piscataway, 2008), pp. 289–292 34. E.E. Hansley, M.P. Segundo, S. Sarkar, Employing fusion of learned and handcrafted features for unconstrained ear recognition. IET Biom. 7(3), 215–223 (2018) 35. K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in Conference on Computer Vision and Pattern Recognition (IEEE, Piscataway, 2016), pp. 770–778 36. A.G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, H. Adam, MobileNets: efficient convolutional neural networks for mobile vision applications, CoRR abs/1704.04861 (2017) 37. D.J. Hurley, M.S. Nixon, J.N. Carter, Automatic ear recognition by force field transformations, in Colloquium on Visual Biometrics (IET, London, 2000), pp. 7/1–5 38. A. Jain, A. Ross, K. Nandakumar, Introduction to Biometrics (Springer, Berlin, 2011) 39. U. Kacar, M. Kirci, Ear recognition with score-level fusion based on CMC in long-wave infrared spectrum, CoRR abs/1801.09054 (2018) 40. J. Kannala, E. Rahtu, BSIF: binarized statistical image features, in International Conference on Pattern Recognition (IEEE, Piscataway, 2012), pp. 1363–1366 41. I. Kemelmacher-Shlizerman, S.M. Seitz, D. Miller, E. Brossard, The MegaFace benchmark: 1 million faces for recognition at scale, in Conference on Computer Vision and Pattern Recognition, pp. 4873–4882 (IEEE, Piscataway, 2016) 42. J. Križaj, V. Štruc, N. Pavešić, Adaptation of SIFT features for robust face recognition, in Image Analysis and Recognition (Springer, Berlin, 2010), pp. 394–404 43. D. Laptev, N. Savinov, J.M. Buhmann, M. Pollefeys, TI-POOLING: transformation-invariant pooling for feature learning in convolutional neural networks, in Conference on Computer Vision and Pattern Recognition (IEEE, Piscataway, 2016), pp. 289–297 44. D.G. Lowe, Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004) 45. S. Mahto, T. Arakawa, T. Koshinak, Ear acoustic biometrics using inaudible signals and its application to continuous user authentication, in European Signal Processing Conference (Sep 2018), pp. 1407–1411 46. B. Moreno, A. Sánchez, J. F. Vélez, On the use of outerear images for personal identification in security applications, in: Proceedings of the International Carnahan Conference onSecurity Technology, IEEE, 1999, pp. 469–476.

Constellation-Based Deep Ear Recognition

189

47. Z. Mu, L. Yuan, Z. Xu, D. Xi, S. Qi, Shape and structural feature based ear recognition, in Advances in Biometric Person Authentication (Springer, Berlin, pp. 663–670, 2004) 48. I. Naseem, R. Togneri, M. Bennamoun, Sparse representation for ear biometrics, in Advances in Visual Computing (Springer, Berlin, 2008), pp. 336–345 49. H. Nejati, L. Zhang, T. Sim, E. Martinez-Marroquin, G. Dong, Wonder ears: identification of identical twins from ear images, in International Conference on Pattern Recognition (IEEE, Piscataway, 2012), pp. 1201–1204 50. T. Ojala, M. Pietikainen, D. Harwood, Performance evaluation of texture measures with classification based on Kullback discrimination of distributions, in International Conference on Pattern Recognition, Computer Vision & Image Processing, vol. 1 (IEEE/IAPR, Piscataway, 1994), pp. 582–585 51. V. Ojansivu, J. Heikkilä, Blur insensitive texture classification using local phase quantization, in Image and Signal Processing (Springer, Berlin, 2008), pp. 236–243 52. V. Ojansivu, E. Rahtu, J. Heikkilä, Rotation invariant local phase quantization for blur insensitive texture analysis, in International Conference on Pattern Recognition (IEEE, Piscataway, 2008), pp. 1–4 53. I. Omara, G. Xiao, M. Amrani, Z. Yan, W. Zuo, Deep features for efficient multi-biometric recognition with face and ear images, in International Conference on Digital Image Processing, vol. 10420 (International Society for Optics and Photonics, Bellingham, Jul 2017), pp. 1–6 54. I. Omara, X. Wu, H. Zhang, Y. Du, W. Zuo, Learning pairwise SVM on hierarchical deep features for ear recognition. IET Biom. 7(6), 557–566 (2018) 55. R.N. Othman, F. Alizadeh, A. Sutherland, A novel approach for occluded ear recognition based on shape context, in International Conference on Advanced Science and Engineering (IEEE, Piscataway, Oct 2018), pp. 93–98 56. A. Pflug, C. Busch, Ear biometrics: a survey of detection, feature extraction and recognition methods. IET Biom. 1(2), 114–129 (2012) 57. A. Pflug, P.N. Paul, C. Busch, A comparative study on texture and surface descriptors for ear biometrics, in International Carnahan Conference on Security Technology (IEEE, Piscataway, 2014), pp. 1–6 58. M. Pietikäinen, A. Hadid, G. Zhao, T. Ahonen, Computer vision using local binary patterns, in Computational Imaging and Vision (Springer, Berlin, 2011) 59. S. Prakash, P. Gupta, An efficient ear recognition technique invariant to illumination and pose. Telecommun. Syst. 52(3), 1435–1448 (2013) 60. M. Rahman, M.R. Islam, N.I. Bhuiyan, B. Ahmed, A. Islam, Person identification using ear biometrics. Int. J. Comput. Intern. Manag. 15(2), 1–8 (2007) 61. M. Ribiˇc, Ž. Emeršiˇc, V. Štruc, P. Peer, Influence of alignment on ear recognition: case study on AWE dataset, in International Electrotechnical and Computer Science Conference, vol. 25-B (IEEE, Piscataway, 2016), pp. 131–134 62. U. Saeed, M.M. Khan, Combining ear-based traditional and soft biometrics for unconstrained ear recognition. J. Electr. Imaging 27(5), 051220/1–10 (2018) 63. A. Sepas-Moghaddam, F. Pereira, P.L. Correia, Ear recognition in a light field imaging framework: a new perspective. IET Biom. 7(3), 224–231 (2018) 64. K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014) 65. H. Sinha, R. Manekar, Y. Sinha, P.K. Ajmera, Convolutional neural network-based human identification using outer ear images, in Soft Computing for Problem Solving, ed. by J.C. Bansal, K.N. Das, A. Nagar, K. Deep, A.K. Ojha. Advances in Intelligent Systems and Computing (Springer, Singapore, 2019), pp. 707–719 66. T. Theoharis, G. Passalis, G. Toderici, I.A. Kakadiaris, Unified 3D face and ear recognition using wavelets on geometry images. Pattern Recogn. 41(3), 796–804 (2008) 67. L. Tian, Z. Mu, Ear recognition based on deep convolutional network, in International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (IEEE, Piscataway, Oct 2016), pp. 437–441

190

D. Štepec et al.

68. B. Victor, K. Bowyer, S. Sarkar, An evaluation of face and ear biometrics, in International Conference on Pattern Recognition, vol. 1 (IEEE, Piscataway, 2002), pp. 429–432 69. R. Vidyasri, B. Priyalakshmi, M.R. Raja, S. Priyanka, Recognition of ear based on partial features fusion. Int. J. Biom. 10(2), 105–120 (2018) 70. N.S. Vu, A. Caplier, Face recognition with patterns of oriented edge magnitudes, in European Conference on Computer Vision (Springer, Berlin, 2010), pp. 313–326 71. Y. Wen, K. Zhang, Z. Li, Y. Qiao, A discriminative feature learning approach for deep face recognition, in European Conference on Computer Vision (Springer, Berlin, 2016), pp. 499– 515 72. Z. Xie, Z. Mu, Ear recognition using LLE and IDLLE algorithm, in International Conference on Pattern Recognition (IEEE, Piscataway, 2008), pp. 1–4 73. X. Xu, Z. Mu, L. Yuan, Feature-level fusion method based on KFDA for multimodal recognition fusing ear and profile face, in International Conference on Wavelet Analysis and Pattern Recognition (IEEE, Piscataway, 2007), pp. 1306–1310 74. T. Ying, W. Shining, L. Wanxiang, Human ear recognition based on deep convolutional neural network, in Chinese Control And Decision Conference (Jun 2018), pp. 1830–1835 75. L. Yuan, Z.C. Mu, Ear recognition based on 2D images, in Conference on Biometrics: Theory, Applications and Systems (IEEE, Piscataway, 2007), pp. 1–5 76. L. Yuan, H. Zhao, Y. Zhang, Z. Wu, Ear alignment based on convolutional neural network, in Biometric Recognition, ed. by J. Zhou, Y. Wang, Z. Sun, Z. Jia, J. Feng, S. Shan, K. Ubul, Z. Guo, pp. 562–571. Lecture Notes in Computer Science (Springer, Berlin, 2018) 77. H.J. Zhang, Z.C. Mu, W. Qu, L.M. Liu, C.Y. Zhang, A novel approach for ear recognition based on ICA and RBF network, in International Conference on Machine Learning and Cybernetics, vol. 7 (IEEE, Piscataway, 2005), pp. 4511–4515 78. Y. Zhang, Z. Mu, L. Yuan, H. Zeng, L. Chen, 3D ear normalization and recognition based on local surface variation. Appl. Sci. 7(1), 104/1–21 (2017) 79. Y. Zhang, Z. Mu, L. Yuan, C. Yu, Ear verification under uncontrolled conditions with convolutional neural networks. IET Biom. 7(3), 185–198 (2018) 80. Y. Zhou, S. Zaferiou, Deformable models of ears in-the-wild for alignment and recognition, in International Conference on Automatic Face and Gesture Recognition (IEEE, Piscataway, May 2017), pp. 626–633

Person Retrieval in Surveillance Videos Using Deep Soft Biometrics Hiren J. Galiyawala, Mehul S. Raval, and Anand Laddha

1 Introduction Biometrics provides a reliable solution to establish an individual's identity. A physical (e.g., face, fingerprint, and iris) and behavioral (e.g., voice, gait, and signature) biometric characteristics are useful in security applications [21]. Traditional biometric attributes are successful in various systems, but restricted due to the following reasons: • • • •

They are difficult to capture for the uncooperative person. The attribute from the constraint environment has poor quality. Retrieval fails for input data with poor quality. Biometric attribute (e.g., face and fingerprint) acquisition is difficult for the considerable distance between the camera and the person.

Traditional biometric-based surveillance systems fail to establish identity due to the reasons mentioned above. The following scenarios elaborate on the above points. Figure 1 shows samples of surveillance frames from the AVSS 2018 challenge II database [9] in which the person of interest is in the yellow bounding box. It is difficult to retrieve the person in Fig. 1a using a face recognition system due to low resolution and poor light. A scenario with a crowd is in Fig. 1b. Here, retrieval fails as illumination is low. However, attributes like height, hair color, gender, ethnicity, cloth color, cloth type are useful for person retrieval under challenging conditions. Such attributes are known as soft biometrics [19, 20, 23]. H. J. Galiyawala () · M. S. Raval School of Technology, Pandit Deendayal Petroleum University, Gandhinagar, India e-mail: [email protected]; [email protected] A. Laddha Bhabha Atomic Research Centre (BARC), Mumbai, India e-mail: [email protected] © Springer Nature Switzerland AG 2020 R. Jiang et al. (eds.), Deep Biometrics, Unsupervised and Semi-Supervised Learning, https://doi.org/10.1007/978-3-030-32583-1_9

191

192

H. J. Galiyawala et al.

Fig. 1 Surveillance frame samples [9]

Fig. 2 Person retrieval using semantic description

Conventionally, operators carry out person retrieval by manually searching through hours of videos. It is very inefficient; hence, automatic person retrieval using soft biometrics is gaining the attention of the researchers. Semantic description based person retrieval is in Fig. 2. The description uses height (tall), gender (male), cloth color, and cloth type (black t-shirt and pink short) as soft biometric attributes. Such semantic descriptions are easy to understand and naturally describe the person. Soft biometrics based person retrieval has the following advantages [1, 2]: • Soft biometric attributes are useful to generate a description which human understands. Thus, it bridges the semantic gap between human description and person retrieval system. • Soft biometric attributes are extractable from the low-quality videos, whereas it is difficult to do so with traditional biometrics due to various constraints. • Nonintrusive capturing of soft biometrics is possible, i.e., without the cooperation of the subject. • Soft biometric attributes are detectable from a distance, e.g., cloth color. • Multiple soft biometrics reduces search space in the video.

Person Retrieval in Surveillance Videos Using Deep Soft Biometrics

193

A soft biometric trait does not establish one-to-one mapping between the query and the match, unlike the conventional biometrics. It produces many to many matchings between the query and the person. Therefore, it is essential to use most discriminative soft biometrics for person retrieval. A study identifies 13 most discriminative soft biometrics [18]. Out of them, the chapter discusses the use of height, cloth color, and gender for person retrieval. The selection of these attributes has certain advantages, e.g., a person’s height is invariant to view angle and the distance. The use of color [6, 25] has the following advantages: • Color can provide information about the person’s clothing. • Color sensitivity is independent of direction, view angle, and dimension. • Color also provides immunity against noise. It is important to note that the use of multiple soft biometrics also helps to nullify the effects of the far and near field view of the person. The gender and cloth color are identifiable in both; far and the near field views. By training with color patches of different hues, the reduction in the recognition accuracy due to shades of the color improves. The calibrated camera also helps to estimate height in the near or far field accurately. The descriptive queries have some fuzziness in them, e.g., the height of a person is “Short,” “Medium,” and “Tall.” Thus, height is mapped over a range, e.g., 160–180 cm. The motivation for the chapter is to bridge the semantic gap between the individual description and the machine. Note that individual soft biometric may generate the nonunique mapping. However, the chapter showcases unique mapping between the descriptive query and the person using multiple soft biometrics. The objectives of the chapter are as follows: 1. Presenting a person retrieval problem in the context of soft biometrics. 2. Study discriminative power of various soft biometrics. 3. The state of the art review on deep learning models for person retrieval using soft biometrics. 4. Showcase a deep soft biometrics person retrieval based on height, cloth color, and gender and their use as linear filters. 5. The chapter has discussions on the performance metrics of the retrieval system. 6. It carries the qualitative and quantitative analysis of the person retrieval system. The chapter focuses on person retrieval using height, cloth color, and gender. The person retrieval algorithm uses Mask R-CNN [10] for the detection and semantic segmentation of the person. Torso color and gender are classified using fine-tuned AlexNet [12] and the AVSS 2018 challenge II (task-2) dataset [9] is used to evaluate the algorithm. Section 2 covers a literature review for person retrieval using soft biometrics. It envelops person retrieval using handcrafted and deep features. It also covers the deep learning architecture used for detection, segmentation, and classification. The detail about the person retrieval algorithm using deep features is in Sect. 3. The detail about the dataset, various evaluation metrics, and test results are discussed in Sect. 4. Finally, Sect. 5 concludes the chapter and discusses future explorations.

194

H. J. Galiyawala et al.

Soft biometrics provides a solution in surveillance where traditional biometric fail to retrieve the person. Researchers are developing algorithms based on computer vision and deep learning. The next section covers the comprehensive review for person retrieval using soft biometrics with handcrafted and deep features. It also covers the overview of deep learning architectures used for detection, segmentation, and classification in the chapter.

2 Literature Review Technological advancement has helped to improve retrieval using biometrics. Early work shows the possibilities of utilizing the various soft biometric attributes to improve the recognition accuracy of the traditional biometric system [11]. The initial aim of incorporating soft biometrics is to augment the recognition process by fusing multiple attributes that are sufficient to discriminate against the population rather than the identification of individuals. Such experiments [11] conducted on 263 users’ database show that fingerprint-based system accuracy can be improved significantly by using gender, height, and ethnicity soft biometrics. Traditional biometrics fails to establish identity in surveillance videos due to low-quality, poor illumination, uncontrolled environment, and the considerable distance between person and camera. Thus, further research is necessary for person identification and re-identification using soft biometrics attributes. Further subsections of this review discuss methods based on handcrafted and deep features based attribute recognition for person retrieval.

2.1 Attribute Recognition for Person Retrieval Using Handcrafted Features Initial work uses the handcrafted features and trains multiple classifiers for the individual attribute. Vaquero et al. [28] propose an attribute-based search to localize a person in the surveillance video. A person searching is done based on the parsing of the human body and exploring part attributes. Visual attribute includes facial hair (mustache, beard, without facial hair), the presence of eyewear (glasses, sunglasses, without glasses) and headwear (hair, hat, bald), and full body features such as the dominant color of the torso and legs. Nine Viola-Jones detectors are trained to extract each facial attribute. A normalized color histogram for the torso and leg is calculated in HSL space. Layne et al. [13] propose human attribute recognition by training Support Vector Machines (SVMs) classifiers. The attribute recognition result is used to assist person re-identification. Zhu et al. [31] introduce Attributed Pedestrians in Surveillance (APiS) database which includes pedestrian semantic attribute annotations. The AdaBoost classifier is trains with textures and color features for binary attribute classification. Weighted

Person Retrieval in Surveillance Videos Using Deep Soft Biometrics

195

K—Nearest Neighbors (KNN) classifier with color features classifies multi-class attributes. Halstead et al. [8] discuss the person search in the surveillance frame using a particle filter. The approach builds the target query based on torso color, leg color, height, build, and leg clothing type. Denman et al. [5] create an avatar using a search query in the form of channel representation. The search query uses height, dominant color (torso and leg), and clothing type (torso and leg).

2.2 Deep Convolutional Neural Network (DCNN) In the past few years, Convolutional Neural Networks (CNNs) have been an active area of research in the computer vision community. It has achieved many successes in the area of detection, segmentation, and recognition. CNNs are designed to work on data which are in the form of multiple arrays, for example, a grayscale image is a 2D array and a color image is a combination of three 2D arrays. CNNs are deeper, feedforward, generalized, and much better networks than the fully connected neural networks. The driving force architecture for the community is proposed by Krizhevsky et al. [12], which trains on a 1000 class ImageNet dataset [3] of 1.2 million images. They proposed a network for classification using deep CNN, which contains eight layers: 5 convolutional and 3 fully connected layers. It is popularly known as AlexNet which is shown in Fig. 3. It uses non-saturating Rectified Linear Units (ReLUs) instead of sigmoid and tanh activation function. It helps to accelerate the training speed by 6 times with the same accuracy. The algorithm avoids overfitting by using dropout instead of regularization. Top-1 and top-5 error rates reduce by 0.4% and 0.3%, respectively. The AlexNet achieved state-of-the-art performance in the year 2012 with top-1 and top-5 error rate of 37.5% and 17.0%, respectively. The success of AlexNet has to revolutionized machine learning and computer vision. Deep CNNs are now a dominant approach in detection, segmentation, and recognition algorithms. One such state-of-the-art framework is Mask R-CNN [10] which is useful for object detection and semantic segmentation. It is one of the

Fig. 3 AlexNet deep CNN architecture [12]

196

H. J. Galiyawala et al.

regions based Convolutional Neural Network (R-CNN) which has two significant components, extraction of region proposals and classification. Region Proposal Network (RPN) generates proposals about the region with the possibility of an object. The second stage predicts the object class, refines the bounding box, and generates a mask at the pixel level for the object in the region proposal. Mask R-CNN is an extension of Faster R-CNN [22] and it is created by adding a branch for prediction of segmentation masks on each Region of Interest (RoI). The segmentation mask prediction works in parallel with the classification and bounding box regression task. Each RoI is passed through Fully Convolutional Network (FCN) to predict the segmentation mask in a pixel-to-pixel manner. The chapter showcases the application of such algorithms for person retrieval. Further subsection summarizes research which uses deep learning architecture for classification, detection, or segmentation task in the surveillance frame.

2.3 Attribute Recognition for Person Retrieval Using Deep Features In recent years deep learning based approaches are gaining prominence for person attribute recognition due to its feature learning ability. Dangwei Li et al. [14] identify the limitation of handcrafted features (e.g., color histograms) and explore the relationship between different attributes. They study the independent attribute as well as the relation among these attributes. Each attribute is an independent component and a deep learning based single attribute recognition model (DeepSAR) is used to recognize each attribute. A multi-attribute recognition (DeepMAR) model analyses the relation among the attributes, and it jointly recognizes multiple attributes. Sudowe et al. [26] study possible dependencies between different attributes and exploit the advantages of the relationship between them. They propose the Attribute Convolutional Net (ACN) to jointly-train the holistic CNN model. It allows the sharing of weights and effectively re-uses knowledge among attributes. They consider the effect of modeling the N/A class and introduce a novel benchmark PARSE-27k. Some of the exciting work also explores the human body structure to support attribute recognition. In one implementation, Zhu et al. [32] explore the relation of various attributes with body parts, e.g., short hair is most relevant to the head part. The multi-label convolutional neural network (MLCNN) then predicts multiple attributes together in a single framework. The network uses pedestrian full body image splitting it into 15 overlapping blocks of size 32 × 32. Each block is filtered independently and combined in the cost layer. Yao et al. [30] developed a robust recognition framework that is adaptive to the actual surveillance scenarios. The authors label the attributes as global or local based on attribute relationship to various human body parts. For example, gender is a global attribute that can be classified using the whole body and hairstyle is a local attribute that is related to a face region. The adaptive region localization method obtains relevant body regions.

Person Retrieval in Surveillance Videos Using Deep Soft Biometrics

197

The Joint Recurrent Learning (JRL) model [29] uses attribute context and correlation in order to improve attribute recognition when training data is small sized with poor quality. A unified framework models inter-person image context and learn intra-person attribute correlation. A novel Recurrent Neural Network (RNN) encoder–decoder architecture covers learning jointly for pedestrian attribute correlations in a pedestrian image and in particular their sequential ordering dependencies. Sarfraz et al. [24] utilize visual cues that hint at attribute localization and inference of person’s attributes (e.g., backpack, hair) which are dependent on the human view angle. View-sensitive attribute inference helps in better end-to-end learning of the View-sensitive Pedestrian Attribute approach (VeSPA) framework. The model jointly predicts the coarse view of the person and learns specialized view-specific multi-label attribute predictions. The approaches based on handcrafted features [5, 8, 13, 28, 31] and deep features [14, 24, 26, 29, 30, 32] extract or learn the features from the person in the image which is covered by the bounding box. It creates the following problems: 1. The bounding box contains background pixels along with the person. Figure 4a shows person detection with a bounding box containing background pixels. The cluttered background may impact the person retrieval accuracy specifically for low-resolution video. 2. The bounding box is also utilized to estimate the person’s height. Figure 4b and c shows the head and feet points (marked with red) with respect to the bounding box. Nevertheless, they are not precise concerning the person’s actual head and feet points. Moreover, different pose and view change the feet points’ location in the image. It leads to incorrect height estimation which is derived with respect to the bounding box.

Fig. 4 Problems with bounding box approach. (a) bounding box with clutter background, (b) and (c) head and feet point with respect to bounding box

198

H. J. Galiyawala et al.

Fig. 5 (a) Semantic segmentation without clutter background, (b) and (c) head and feet point with respect to semantic boundary

The chapter focuses on the approach which helps to resolve the above-discussed problems. Possible solutions to the problems are as follows: 1. Semantic segmentation [10] within the bounding box provides a precise boundary around the person. Figure 5 shows the images with semantic segmentation within the bounding box. It has the following advantages: a. Removal of the unwanted background clutter from the semantic boundary (Fig. 5a). Such background pixels will not contribute to the classification process, e.g., torso color classification. b. The semantic boundary helps to extract accurate head and feet points as shown in Fig. 5b and c. It leads to better height estimation irrespective of pose and view. c. It also helps to extract precise torso region which gives better color classification. 2. Accurate height estimation also helps to distinguish between sitting and the upright person in the frame. It reduces search space for the upright person by considering height as a filter. This section detailed person retrieval based on handcrafted as well as deep features. It also discusses various problems in existing approaches and provides possible solutions. The next section discusses the algorithm which incorporates such solutions to improve person retrieval.

Person Retrieval in Surveillance Videos Using Deep Soft Biometrics

199

3 Person Retrieval System Using Deep Soft Biometrics This section discusses a linear filtering approach to retrieve the person using height, cloth color, and gender. Height is estimated using camera calibration approach, while color and gender classification is done using deep features. Each soft biometric attribute acts as a linear filter to reduce the search space in the video frame. Figure 6 depicts the proposed approach diagram. Person(s) detection block uses Mask R-CNN [10] for detection with semantic segmentation. For all detected persons, precise head and feet points are extracted using semantic boundary. Height is calculated using camera calibration parameters for all detected persons concerning their respective head and feet points. The module outputs person(s) with matched height class (e.g., 160–180 cm). Thus, height acts as a first filter to reduce the search space in the frame. Height estimation output may contain multiple persons with the same height class. The torso color and gender, are classified using AlexNet [6, 12] which trains on the local dataset of the color patch and pedestrian images for gender. Search space is further reduced by torso color detection. The cloth color detection block extracts the background free color patch from torso using semantic segmentation. Fine-tuned AlexNet is used to classify torso color. Classified color is matched with the description color which reduces the number of persons further in the frame. The final output improves by using gender classification which is also done using finetuned AlexNet. Further subsections discuss the process of height estimation, color, and gender classification for person retrieval.

Fig. 6 Proposed approach of person retrieval using soft biometrics

200

H. J. Galiyawala et al.

3.1 Height Estimation Using Camera Calibration Face recognition based person identification systems are not suitable for surveillance videos due to its poor quality and long distance between camera and person. However, many human attributes are identifiable from a distance e.g., person’s height which is view and distance invariant. The approach uses height to discriminate between sitting and upright persons. Height estimation is done using Tsai [27] camera calibration parameters. The AVSS 2018 challenge II database [9] (Sect. 4.1) provides Tsai camera calibration parameters for six surveillance cameras. Table 1 shows these parameters. Camera calibration parameters help to convert image coordinate to real-world coordinate for height estimation. Figure 6 shows the head and feet point extraction using a semantic boundary. Height estimation steps are as follows: 1. Calculate intrinsic parameters matrix (k), rotation matrix (R), and a translation vector (t) using calibration parameters. ⎤ ⎡ f 0 0 k = ⎣ 0 f 0⎦ 0 0 1

(1)

Table 1 Tsai camera calibration parameters [27] Parameters Geometric parameters

Notation – Ncx Nf x dx dy

Intrinsic parameters

Extrinsic parameters

dpx dpy f kappa1 (k1 ) cx , cy sx

Rx , Ry , Rz Tx , Ty , Tz

Description Image width and height A number of sensor elements in the X—direction A number of pixels in a line as sampled by the computer Center to center distance between adjacent sensor elements in X (scan line) direction Center to center distance between adjacent CCD sensor in the Y—direction Effective X dimension of a pixel in the frame Effective Y dimension of a pixel in the frame The focal length of the camera 1st order radial lens distortion coefficient Co-ordinates of the center of radial lens distortion The scale factor to account for any uncertainty due to imperfections in hardware timing for scanning and digitization Rotation angles for the transformation between the world and camera coordinates Translation components for the transformation between the world and camera coordinates

Person Retrieval in Surveillance Videos Using Deep Soft Biometrics

⎡ ⎤ R11 R12 R13 R = ⎣R21 R22 R23 ⎦ R31 R32 R33

201

(2)

R11 = cos(Ry )cos(Rz ) R12 = cos(Rz )sin(Rx )sin(Ry ) − cos(Rx )sin(Rz ) R13 = sin(Rx )sin(Rz ) + cos(Rx )cos(Rz )sin(Ry ) R21 = cos(Ry )sin(Rz ) R22 = sin(Rx )sin(Ry )sin(Rz ) + cos(Rx )cos(Rz )

(3)

R23 = cos(Rx )sin(Ry )sin(Rz ) − cos(Rx )sin(Rz ) R31 = −sin(Ry ) R32 = cos(Ry )sin(Rx ) R33 = cos(Rx )cos(Ry ) ⎡ ⎤ Tx t = ⎣T y ⎦ Tz

(4)

2. Calculate the perspective transformation matrix. ⎤ ⎡ C11 C12 C13 C14 C = k[R|t] = ⎣C21 C22 C23 C24 ⎦ C31 C32 C33 C34

(5)

3. Undistort head and feet image points using the lens distortion parameter (k1 ). dx1 = dx ×

xd =

Ncx Nf x

dx1 × (xf − cx ) sx

(6)

(7)

yd = (yf − cy ) × dy r=

xd2 + yd2

xu = xd × (1 + k1 r 2 ) yu = yd × (1 + k1 r 2 ) xf , yf = image coordinates and xu , yu = undistorted image coordinates.

(8)

(9)

202

H. J. Galiyawala et al.

4. The camera model is given by ⎡ ⎤ ⎡ ⎤ ⎤ X ⎡ ⎤ ⎡ X xu C11 C12 C13 C14 ⎢ ⎥ ⎢ ⎥ ⎣yu ⎦ = C ⎢ Y ⎥ = ⎣C21 C22 C23 C24 ⎦ ⎢ Y ⎥ ⎣Z ⎦ ⎣Z ⎦ 1 C31 C32 C33 C34 1 1

(10)

xu , yu = undistorted image coordinates and X, Y, Z = world coordinates. Assume that the person is standing on X, Y plane and Z is pointing in a vertical direction. Thus, the world coordinate Z for feet is; Z = 0. ⎤⎡ ⎤ ⎡ ⎤ ⎡ X C11 C12 C14 xu ⎣yu ⎦ = ⎣C21 C22 C24 ⎦ ⎣ Y ⎦ 1 1 C31 C32 C34

(11)

Calculate world coordinate X, Y by the inverse transformation of C for feet image coordinates xu , yu . 5. Calculate world coordinate Z for head image coordinate yu by considering X, Y (step-4) using the equation given below: yu =

C21 X + C22 Y + C23 Z + C24 C31 X + C32 Y + C33 Z + C34

(12)

This world coordinate Z for the head is a person’s height. The output of the height estimation contains only person(s), which matches with the height given in description (e.g., 130–160 cm). Person height is not unique to discriminate against the person from the crowd. However it can be used to classify people in various height classes to reduce the search space in the surveillance frame. All filtered persons are useful for torso color classification. The height estimation model is developed using annotated head and feet points given in the training dataset. The average height (H avg) is the estimated height for the person. The average height computed using the extracted head and feet point (by the semantic boundary) is more significant than (H avg). It leads to the wrong estimation. Hence, the difference is subtracted from the estimated height during testing to normalize the error.

3.2 Torso Color Detection Torso and leg regions are separated using the golden ratio for human height. Figure 7 shows the separation of torso and leg regions using the golden ratio which is decided with respect to the top of the bounding box. Upper 20–50% is considered as torso region and 50–100% is considered as leg region. Precise torso patch is extracted

Person Retrieval in Surveillance Videos Using Deep Soft Biometrics

203

Fig. 7 Separation of torso and leg region from body

from the torso region using semantic segmentation. Figure 6 (cloth color detection block) shows the torso color patch extraction. The color of the extracted color patch is classified using fine-tuned AlexNet. The detail about AlexNet training for color classification is in Section “AlexNet Training for Color Classification”.

3.3 Gender Classification The linear filtering approach retrieves the person using height and color accurately. Gender filter is useful when the output of cloth color detection (Fig. 6) contains multiple persons, i.e., persons with the same height class and torso color. AlexNet is fine-tuned using full body images for gender classification. The detail about AlexNet training for gender classification is in Sect. 3.4.2.

3.4 Implementation Details for Deep Learning Models Mask R-CNN [10] deep learning model is used for person detection and semantic segmentation. Weights of Mask R-CNN which is pre-trained on Microsoft COCO [16] dataset is used for person detection and semantic segmentation. AlexNet [12] pre-trained on ImageNet [3] dataset is fine-tuned separately for color and gender classification. The AlexNet trains on a machine with Intel Xeon core processor and 8 GB NVIDIA Quadro K5200 GPU.

3.4.1

AlexNet Training for Color Classification

Torso color is categorized into 12 culture colors in the dataset [9]. It contains 1704 color patches and some additional patches are extracted manually from the torso

204

H. J. Galiyawala et al.

region of the person. The model training uses a total of 8362 color patches. Such a small training set may over-fit AlexNet training. Hence, color patches augmentation is done by increasing brightness with a gamma factor of 1.5. Data augmentation generates 16,724 color patches. 80% of these samples are useful for training and 20% of samples constitute a validation set. The color model requires fine-tuning of the last four layers (Conv5, fc6, fc7, and fc8) of AlexNet. The model trains for 30 epochs with a batch size of 128. The learning rate and dropout are 0.001 and 0.50, respectively. The validation accuracy achieved for the color model is 71.2%.

3.4.2

AlexNet Training for Gender Classification

Full body images are collected to train the AlexNet for gender classification. 56,802 full body images are collected from AVSS challenge II (task-1 and task-2) [9], RAP [15], PETA [4], GRID [17], and VIPeR [7]. Data augmentation is applied to increase the data samples for training. Each image rotates by 10 angles {−5°, −4°, −3°, −2°, −1°, 1°, 2°, 3°, 4°, 5°}, horizontally and vertically flipped and brightness increased with gamma factor of 1.5. Data augmentation generates 681,624 images for gender training. The training and validation set splits into 80% and 20% of the total training images, respectively. The gender model fine-tunes the last two layers (fc7 and fc8) of the AlexNet. The model trains for 10 epochs with a batch size of 128. The learning rate and dropout are 0.01 and 0.40, respectively. The validation accuracy achieved for the color model is 71.26%. This section described the person retrieval algorithm using height, cloth color, and gender. The algorithm evaluation on the dataset is in a further section.

4 Experimental Results and Discussions This section discusses details about the dataset used, various evaluation metric, qualitative and quantitative analysis on the test dataset. It also covers discussion on challenging conditions when the algorithm fails to retrieve the person correctly.

4.1 Dataset Overview The algorithm is evaluated on the AVSS 2018 challenge II database [9] for semantic person retrieval using soft biometrics. The implementation of the algorithm (Sect. 3) is accomplished on this dataset. In challenge II, the dataset has the following two tasks: Task-1 Person re-identification—identify a person using the semantic description from an image gallery.

Person Retrieval in Surveillance Videos Using Deep Soft Biometrics

205

Task-2 Surveillance imagery search—localize a person using the semantic description in a given surveillance video. The chapter uses the dataset for the task-2, i.e., surveillance search. The videos have varying levels of crowd density and crowd flow and they are from six surveillance cameras located within the university campus. Dataset also provides Tsai [27] calibration parameters with some videos to model the background. The training dataset contains video sequences of 110 persons with varying sequence lengths from 21 to 290. A person of interest may not be present in each frame of video sequences. Hence, the first initial 30 frames are for background modeling. Also, these 30 frames allow a person to enter the camera view completely. The testing dataset consists of video sequences of 41 persons. Every 41 sequences contain a single person of interest to retrieve using the given description. The given XML file containing full annotation for the person in the dataset. Annotation consists of 9 body markers to localize the person in the frame and 16 soft biometric attributes. 9 body markers are: top of the head, left and right neck, left and right shoulder, left and right waist, approximate toe position of the feet. The algorithm uses these markers to estimate the height and torso patch extraction during training. Ground truth bounding box during testing is created using these markers. The y-coordinate of the head and the lowest y-coordinate of the feet construct the height of the bounding box. The width is constructed using two most extreme markers from the neck, shoulder, waist, or feet. Table 2 shows the 16 soft biometric attributes annotated in the dataset for each person. The unknown label is “−1” in the XML file for each attribute. The algorithm uses height, torso color, and gender to retrieve the person from the surveillance video. The height has five classes: very short (130–160 cm), short (150–170 cm), average (160–180 cm), tall (170–190 cm), very tall (180–210 cm). The male and

Table 2 Soft biometric attributes Attribute Color Texture Torso type Leg type Age Gender Height Build Skin Hair Luggage

Class/Labels Unknown, black, blue, brown, green, grey, orange, pink, purple, red, white, yellow, skin Unknown, plain, check, diagonal stripe, vertical stripe, horizontal stripe, spots, pictures Unknown, long sleeve, short sleeve, no sleeve Unknown, long pants, dress, skirt, long shorts, short shorts Unknown, 0–20, 15–35, 25–45, 35–55, >50 Unknown, male, female Unknown, very short (130–160 cm), short (150–170 cm), average (160–180 cm), tall (170–190 cm), very tall (180–210 cm) Unknown, very slim, slim, average, large, very large Unknown, light, medium, dark Unknown, blonde, brown, dark, red, grey, other Unknown, yes, no

206

H. J. Galiyawala et al.

Fig. 8 Distribution of class for height, torso color-1, and gender in training dataset Table 3 Table of abbreviations Abbreviation Positive condition (P) Negative condition (N) True positive (TP) False positive (FP) True negative (TN) False negative (FN)

Description The number of real positive cases in the sample data The number of real negative cases in the sample data Correctly identified Incorrectly identified Correctly rejected Incorrectly rejected

female class represents gender. Twelve culture colors represent color. Torso color is represented by Torso-1 and Torso-2, which are two dominating colors on the torso of the person. The distribution of attributes height, Torso-1, and gender is in Fig. 8. It showcases the class imbalance for each attribute. The height seems to follow the Gaussian distribution and it represents the natural human height distribution wherein the average (160–180 cm) height persons are more than the other class. There is a large variability in the number of patches for each color. Male and female gender class also shows the minor difference in the number of samples.

4.2 Evaluation Metric Various performance measures are used to evaluate the performance of an algorithm. Researches show the uses of various measures like True Positive (TP) rate, Intersection over Union (IoU), F1 score. This section provides a summary of such evaluation metrics used for person retrieval in literature. Some abbreviations are in Table 3 which will be used further to define the metric. Accuracy (ACC) It is a measure of how well a classifier identifies or rejects the sample. It gives the proportion of true results among the total number of samples in the test. It is used to represent the overall efficiency of the system. The accuracy equation is given by ACC =

TP +TN TP +TN = P +N T P + T N + FP + FN

(13)

Person Retrieval in Surveillance Videos Using Deep Soft Biometrics

207

Precision It is the fraction of relevant samples among the classified samples. It measures how accurate is the classification of samples. For example, an algorithm retrieves a person from 8 frames out of 12 total frames. Let, 5 out of 8 are a person of interest, i.e., true positives and 3 are false positives. The algorithm’s precision is 5/8 = 0.625. The precision, Precision =

TP T P + FP

(14)

Recall or True Positive Rate (TPR) or Sensitivity It is the fraction of relevant samples that have been classified over the total number of relevant samples. It measures how well the algorithm finds all positive samples. In the above example, the algorithm’s recall is 5/12 = 0.4167. The recall, Recall or TPR =

TP T P + FN

(15)

Specificity, Selectivity, or True Negative Rate (TNR) It measures the proportion of actual negative samples that are correctly classified as negative. TNR equation is given by TNR =

TN T N + FP

(16)

Mean Average Precision (mAP) It is a metric used to measure the accuracy of the object detector. The mean average precision represents the average of maximum precisions at different recall. It is, mAP =

1 Precision(Recalli ) TotalSamples

(17)

Recalli

F1-Score It is a measure of a test’s accuracy. The score is computed using both precision and recall values. The F1-score best value is “1,” i.e., perfect precision and recall values. The equation is given by, F=2×

Precision × Recall Precision + Recall

(18)

Intersection-over-Union (IoU) It is an evaluation metric used to measure object localization accuracy of the detector where the output is in the form of the bounding box around the object. It is the overlapping area over the union. It measures how useful the detector is localizing objects for ground truth, i.e., actual object boundary. Figure 9 depicts the IoU metric and it is calculated by equation

208

H. J. Galiyawala et al.

Ground truth bounding box

Predicted bounding box

Fig. 9 Intersection-over-Union (IoU)

IoU =

D ∩ GT Area of overlap = Area of union D ∪ GT

(19)

D = bounding box output of algorithm and GT = ground truth bounding box. The linear filtering approach for person retrieval is evaluated using TPR and IoU. The discussion of different results is in the next section.

4.3 Qualitative and Quantitative Results This section discusses the qualitative results using the output video frames of the person retrieval algorithm. The quantitative results are discussed using evaluation metrics true positive rate (TPR) and Intersection-over-Union (IoU). IoU computes the localization accuracy of the algorithm. Using the bounding box coordinates, it checks the overlapping region between ground truth and predicted bounding box. It is a useful measure when the ground truth bounding box is available for comparison. The range of values it can take is between 0 and 1. The algorithm test AVSS challenge II (task-2) dataset of 41 persons. Ground truth bounding box during experimentation is created using body markers. The y-coordinate of the head and the lowest y-coordinate of the feet construct the height of the bounding box. The width is constructed using two most extreme markers from the neck, shoulder, waist, or feet. Similar to training, initial 30 frames for each video sequence are useful for modeling, and it allows a person to enter fully into the camera view.

Person Retrieval in Surveillance Videos Using Deep Soft Biometrics

4.3.1

209

True Positive Cases of Person Retrieval

Figure 10 shows sample frames in which a person is identified correctly using a semantic description supplied to the algorithm. Images from left to right indicate input test frame for the person, detection output, the result of height filter, cloth color detection, and the output of gender filter, respectively. The abbreviations in the caption are as follows: for example, TS.37, F.36 indicates test subject 37 with frame number 36 in the test dataset. Figure 10a shows a person retrieval using only height as an attribute. It shows a person from TS.37, F.36 with semantic description height (150–170 cm), torso color (red), and gender (male). It is the scenario where the target person is visible in the frame, and there is no complexity (e.g., occlusion, multiple persons). The algorithm retrieves the person using only height for such scenarios. A person TS.4, F.76 with semantic description height (160–180 cm), torso color (pink), and gender (female) is in Fig. 10b. Multiple persons are with the same height class. The person of interest is retrieved using torso color to the height filtered output. The algorithm uses height and torso color to retrieve the person. A person TS.9, F.39 with semantic description height (170–190 cm), torso color (black), and gender (female) is in Fig. 10c. The color filter provides an output with

Fig. 10 True positive cases of person retrieval. (a) TS.37, F.36: height (150–170 cm), torso color (red), and gender (male). Person retrieved using only height (Single attribute-based retrieval). (b) TS.4, F.76: height (160–180 cm), torso color (pink), and gender (female). Person retrieved using height and torso color (Two attribute-based retrieval). (c) TS.9, F.39: height (170–190 cm), torso color (black), and gender (female). Person retrieved using height, torso color, and gender. (Three attribute-based retrieval)

210

H. J. Galiyawala et al.

Fig. 11 TPR (%) and IoU for test dataset

two persons with the same torso color (black). The person of interest is retrieved using a gender filter. The algorithm uses all three attributes, i.e., height, torso color, and gender to retrieve the person. Thus, the algorithm retrieves a person by utilizing all the soft biometrics attributes. The quantitative results are analyzed using evaluation metrics, TPR (%) and IoU. The TPR (%) is a frame with correct retrieval over the total number of frames for the person. As discussed previously, the first 30 frames are not considered for quantitative analysis also. The algorithm correctly retrieves 28 out of 41 test subjects. Figure 11 shows the TPR (%) and IoU (on the Y-axis) and index of a retrieved person (on X-axis). The algorithm achieves 75.37% average TPR, and the average IoU is 0.4982 for 28 correctly retrieved persons. Among 28 persons, 20 have TPR higher than 60%, 4 with TPR between 30–60% and 4 persons with TPR between 0–30%. The height and color model work exceptionally well for 20 persons (TP rate > 60%). For all 41 persons, the average TPR is 58.92%, and the average IoU is 0.3905. The algorithm achieves an IoU greater than or equal to 0.4 for 56.17% of frames for all 41 test subjects.

4.3.2

Challenges in Person Retrieval

Figure 12 shows challenging conditions like poor illumination, similar persons matching all three attributes, occlusion, and crowded scene. It shows various situations where the algorithm fails to retrieve the person correctly. The person of interest is in the bounding box. Figure 12a shows the retrieval failure due to torso color classification failure. It represents a person of TS.16, F.31 with semantic description height (160–180 cm), torso color (pink), and gender (female). The color classification probability score is highest for black, i.e., torso color. The second highest probability score is for brown, i.e., torso second color. Thus, the description torso color pink classifies as black and brown. It has failed to classify the correct color because the region of the torso color patch contains brown hair at the backside of a person.

Person Retrieval in Surveillance Videos Using Deep Soft Biometrics

211

Fig. 12 Challenges in person retrieval. (a) incorrect color classification with multiple persons, (b) multiple persons with occlusion, (c) multiple persons with same torso color and height class, (d) person detection fails, and (e) crowded scene

Figure 12b contains occlusion due to multiple persons from TS.25, F.34 with semantic description height (150–170 cm), torso color (green), and gender (female). Mask R-CNN person detection output shows only a single bounding box for two persons. Due to the larger bounding box, height estimation produces large real-world height compare to description height and hence person retrieval fails. Quantitative measures for TS.25 (TPR = 17.5%, IoU = 9.06) are very poor due to occlusion. Similarly, in TS.20 (TPR = 14.28%, IoU = 7.36) person retrieval fails due to occlusion. The TS.18, F.31 with semantic description height (180–210 cm), torso color (white), and gender (male) contains multiple persons with the same height, torso color, and gender (Fig. 12c). Hence, the algorithm fails to retrieve a person uniquely. In TS.1 the Mask R-CNN could not detect the person due to poor illumination, occlusion, and person’s body part merging with the black background (Fig. 12d) with semantic description height (180–210 cm), torso color (black), and gender (male). Figure 12e shows a video frame with occlusion and the heavy crowd for TS.40, F.72 with semantic description height (130–160 cm), torso color (yellow), and gender (male). TS.35 sequences also show the heavy crowd scenario. In TS.23 (TPR = 15.38%, IoU = 10.15) and TS.39 (TPR = 8.32%, IoU = 1.01) torso color (brown) is incorrectly classified. It is due to imbalanced color patches (Fig. 8) used in. The color intensity varies with illumination conditions; for example, in Fig. 12e, yellow color appears brighter when a person enters into the room while it changes as the person moves inside the room. Person retrieval is challenging due to the above-discussed problems in real surveillance scenarios. The algorithm utilizes each attribute as a filter which helps to reduce the search space in the surveillance frame. It helps to reduce the computation at later stages, i.e., color and gender classification. However, it has a potential weakness of error propagation. The error due to height estimation will propagate for color and gender classification. By incorporating additional attributes in the algorithm, problem of similar persons (which matches all 3 attributes) in retrieval is solved. TS.18, F.31 (Fig. 12c) contains two persons with the same height, torso color, and gender, but with different leg color. The person of interest has a blue leg color, while another person has leg color as black. Figure 13 shows the correct person retrieval using the leg color. Thus, rank-1 accuracy is achievable by incorporating more soft biometric attributes in the system.

212

H. J. Galiyawala et al.

Fig. 13 Use of leg color for person retrieval

5 Conclusion and Future Work The proposed approach retrieves the person in surveillance video using deep soft biometrics, i.e., height as a geometric feature, while color and gender as deep features. Person detection with semantic segmentation allows better height estimation and precise torso color patch extraction which leads to person retrieval under the challenging dataset. 28 out of 41 persons are correctly retrieved. The algorithm achieves an average TPR of 58.92% and an average IoU of 0.3905 for all 41 test subjects. 56.17% of frames have an IoU greater than or equal to 0.4. The retrieval accuracy improvises by considering the following future exploration: • Persons matching all three attributes can be further filtered using additional attribute, e.g., leg color. • Cloth type can be used to modulate the window for precise torso patch extraction. This will improve the color classification accuracy. • Incorporate softer decision making in the filtered approach to weighing each attribute appropriately in the person retrieval. • Color patch datasets can be augmented with different brightness levels to handle the challenge of color variations due to illumination conditions. • Use of other deep learning architectures (e.g., ResNet, DenseNet) instead of AlexNet for better classification of color and gender. Acknowledgements The authors would like to thank the Board of Research in Nuclear Sciences (BRNS) for a generous grant (36(3)/14/20/2016-BRNS/36020). We acknowledge the support of NVIDIA Corporation for a donation of the Quadro K5200 GPU used for this research. We would also like to thank the SAIVT Research Labs at Queensland University of Technology (QUT) for creating the challenging dataset. Thanks to Kenil Shah and Vandit Gajjar for their help during the implementation of the work.

Person Retrieval in Surveillance Videos Using Deep Soft Biometrics

213

References 1. A. Dantcheva, C. Velardo, A. D’Angelo, J.L. Dugelay, Bag of soft biometrics for person identification. Multimed. Tools Appl. 51(2), 739–77 (2011) 2. A. Dantcheva, P. Elia, A. Ross, What else does your biometric data reveal? a survey on soft biometrics. IEEE Trans. Inf. Forensics Secur. 11(3), 441–67 (2016) 3. J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, L. Fei-Fei, ImageNet: a large-scale hierarchical image database, in IEEE Conference on 2009 Computer Vision and Pattern Recognition, CVPR (IEEE, Piscataway, 2009), pp. 248–255 4. Y. Deng, P. Luo, C.C. Loy, X. Tang, Pedestrian attribute recognition at far distance, in Proceedings of the 22nd ACM international conference on Multimedia (ACM, New York, 2014), pp. 789–792 5. S. Denman, M. Halstead, C. Fookes, S. Sridharan, Searching for people using semantic soft biometric descriptions. Pattern Recogn. Lett. 68, 306–315 (2015) 6. H. Galiyawala, K. Shah, V. Gajjar, M.S. Raval, Person retrieval in surveillance video using height, color, and gender, in 15th IEEE International Conference on Advanced Video and Signal based Surveillance (AVSS) (IEEE, New Zealand 2018) 7. D. Gray, S. Brennan, H. Tao, Evaluating appearance models for recognition, reacquisition, and tracking. In Proceedings of IEEE International Workshop on Performance Evaluation for Tracking and Surveillance (PETS), vol. 3, no. 5, (2007), pp. 1–7. Citeseer 8. M. Halstead, S. Denman, S. Sridharan, C.B. Fookes, Locating people in video from semantic descriptions: a new database and approach, in Proceedings of the 22nd International Conference on Pattern Recognition (IEEE, Piscataway, 2014), pp. 4501–4506 9. M. Halstead, S. Denman, C. Fookes, Y. Tian, M. Nixon, Semantic person retrieval in surveillance using soft biometrics: AVSS 2018 challenge II, in IEEE International Conference on Advanced Video and Signal based Surveillance (AVSS) (IEEE, New Zealand, 2018) 10. K. He, G. Gkioxari, P. Dollar, R. Girshick, Mask R-CNN, in 2017 IEEE International Conference on Computer Vision (ICCV) (IEEE, Piscataway, 2017), pp. 2980–2988 11. A.K. Jain, S.C. Dass, K. Nandakumar, Soft biometric traits for personal recognition systems, in International Conference on Biometric Authentication (ICBA) (Springer, Berlin, 2004), pp. 731–738 12. A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep convolutional neural networks, in Advances in Neural Information Processing Systems (2012), pp. 1097– 1105 13. R. Layne, T.M. Hospedales, S. Gong, Q. Mary, Person re-identification by attributes. In BMVC, vol. 2, no. 3, (2012), p. 8 14. D. Li, X. Chen, K. Huang, Multi-attribute learning for pedestrian attribute recognition in surveillance scenarios, in 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), (IEEE, Piscataway, 2015), pp. 111–115 15. D. Li, Z. Zhang, X. Chen, K. Huang, A richly annotated pedestrian dataset for person retrieval in real surveillance scenarios. IEEE Trans. Image Process. 28(4), 1575–1590 (2019) 16. T.Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, C.L. Zitnick, Microsoft coco: common objects in context, in European Conference on Computer Vision (Springer, Cham, 2014), pp. 740–755 17. C. Liu, S. Gong, C.C. Loy, X. Lin, Person re-identification: what features are important?, in European Conference on Computer Vision (Springer, Berlin, 2012) pp. 391–401 18. M.D. MacLeod, J.N. Frowley, J.W. Shepherd, Whole body information: its relevance to eyewitnesses, in Adult Eyewitness Testimony: Current Trends and Developments, eds. by D.F. Ross, J.D. Read, M.P. Toglia (Cambridge University Press, New York, 1994), pp. 125–143 19. M.S. Raval, Digital video forensics: description based person identification. CSI Commun. 39(12), 9–11 (2016)

214

H. J. Galiyawala et al.

20. D.A. Reid, M.S. Nixon, Imputing human descriptions in semantic biometrics, in Proceedings of the 2nd ACM Workshop on Multimedia in Forensics, Security and Intelligence 2010 (ACM, New York, 2010), pp. 25–30 21. D.A. Reid, S. Samangooei, C. Chen, M.S. Nixon, A. Ross, Soft biometrics for surveillance: an overview, in Handbook of Statistics, vol. 31 (Elsevier, Amsterdam, 2013), pp. 327–352 22. S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: towards real-time object detection with region proposal networks, in Advances in Neural Information Processing Systems (2015), pp. 91–99 23. S. Samangooei, M. Nixon, B. Guo, The use of semantic human description as a soft biometric, in Proceedings of 2008 IEEE Second International Conference on Biometrics: Theory, Applications and Systems (IEEE, Arlington, 2008) 24. M.S. Sarfraz, A. Schumann, Y. Wang, R. Stiefelhagen, Deep view-sensitive pedestrian attribute inference in an end-to-end model, in BMVC, 2017. 25. P. Shah, M.S. Raval, S. Pandya, S. Chaudhary, A. Laddha, H. Galiyawala, Description based person identification: use of clothes color and type, in Computer Vision, Pattern Recognition, Image Processing, and Graphics: 6th National Conference, NCVPRIPG 2017 (Mandi, 2017), Revised Selected Papers 6 (Springer, Singapore, 2018), pp. 457–469 26. P. Sudowe, H. Spitzer, B. Leibe, Person attribute recognition with a jointly-trained holistic CNN model, in Proceedings of the IEEE International Conference on Computer Vision Workshops (2015), pp. 87–95 27. R. Tsai, A versatile camera calibration technique for high-accuracy 3D machine vision metrology using off-the-shelf TV cameras and lenses. IEEE J. Robot. Autom. 3(4), 323–344 (1987) 28. D.A. Vaquero, R.S. Feris, D. Tran, L. Brown, A. Hampapur, M. Turk, Attribute-based people search in surveillance environments, in 2009 Workshop on Applications of Computer Vision (WACV) (IEEE, Piscataway, 2009), pp. 1–8 29. J. Wang, X. Zhu, S. Gong, W. Li, Attribute recognition by joint recurrent learning of context and correlation, in IEEE International Conference on Computer Vision (ICCV) (2017) 30. C. Yao, B. Feng, D. Li, J. Li, Hierarchical pedestrian attribute recognition based on adaptive region localization, in 2017 IEEE International Conference on Multimedia and Expo Workshops (ICMEW) (IEEE, Piscataway, 2017), pp. 471–476 31. J. Zhu, S. Liao, Z. Lei, D. Yi, S. Li, Pedestrian attribute classification in surveillance: database and evaluation, in Proceedings of the IEEE International Conference on Computer Vision Workshops (2013), pp. 331–338 32. J. Zhu, S. Liao, D. Yi, Z. Lei, S.Z. Li, Multi-label CNN based pedestrian attribute learning for soft biometrics, in 2015 International Conference on Biometrics (ICB) (IEEE, Piscataway, 2015), pp. 535–540

Deep Spectral Biometrics: Overview and Open Issues Rumaisah Munir and Rizwan Ahmed Khan

1 Introduction A biometric system is a reliable identity/personal recognition system which helps in the accurate identification of individuals. Biometric systems have diverse applications, and are deployed in various industries across the world, for instance, for the identification of employees, at hospitals where security of patient data is imperative, in the banking sector where secure financial transactions need to be made and for border control where surveillance is necessary [1]. Biometric systems work by recording behavioral or physiological traits such as, the iris, the face, the gait, etc. of the individuals. These traits are recorded in the form of images and are processed by the system. During the processing phase, the information in these images is processed by help of methods collectively termed as feature extraction methods. Feature extraction methods, extract certain features from the images, consequently reducing information in these images [2, 3]. These extracted features are then saved as templates in a database, for later use. Once the templates have been saved, the system is presented with a subject’s image (or test image). Features are extracted from this image and then compared against templates in the database [4]. Current biometric systems record physiological traits of the individuals in visible range of the electromagnetic spectrum. The visible range in which biometric systems operate is from 350 to 740 ηm. However, the electromagnetic spectrum

R. Munir Faculty of IT, Barrett Hodgson University, Karachi, Pakistan R. Ahmed Khan () Faculty of IT, Barrett Hodgson University, Karachi, Pakistan LIRIS, Université Claude Bernard Lyon 1, Lyon, France e-mail: [email protected] © Springer Nature Switzerland AG 2020 R. Jiang et al. (eds.), Deep Biometrics, Unsupervised and Semi-Supervised Learning, https://doi.org/10.1007/978-3-030-32583-1_10

215

216

R. Munir and R. Ahmed Khan

Fig. 1 Electromagnetic spectrum: bands and their wavelengths [5]

consists of a range of bands, sub-bands (or wavelengths) and corresponding frequencies that can also be used, besides the visible range, to acquire physiological or behavioral traits of the individual in a biometric system. The electromagnetic spectrum can be seen in Fig. 1. Spectral imaging is a process which is used to acquire images in various bands or sub-bands of the electromagnetic system, besides the visible range. This helps in extraction of information in those bands that cannot be observed by human vision, which is only capable of processing information in the visible range of the electromagnetic spectrum. In this chapter, we use the term spectral imaging to include both multispectral imaging and hyperspectral imaging. In a multispectral imaging system, images are captured in tens of spectral bands. On the other hand, in hyperspectral imaging, images are captured in hundreds of spectral bands [6, 7]. Information from these multiple images can then be combined together or fused into a single image. Fusion of an image is a process where multiple images captured for the same scene, are combined to acquire images with better spatial resolution [8]. Features are extracted from this single image, followed by classification. Spectral imaging has been used extensively in remote sensing [9]. It has also been used for spectral bio-imaging of the eye [10], colorimetry [11], study of biological experiments [12] and to capture diseases at an early stage via spectral microscopy [13]. Spectral imaging became popular, recently for its application in biometric systems because it has the ability to counter illumination changes and spoof attacks at the sensor [14–19]. Biometric systems deployed today, operating in the visible range, largely suffer from errors arising due to illumination conditions [20] and

Deep Spectral Biometrics: Overview and Open Issues

217

spoof attacks at the sensor [21], regardless of the trait captured. By use of spectral imaging, the performance of current biometric systems can be enhanced further. A spectral biometric system works by acquiring images of a biometric trait (such as, the face, the iris or the palmprint) of the individual. Figure 2 shows different biometric traits that have been captured at different wavelengths of the electromagnetic spectrum. According to [4], the traits are chosen if they fulfill the following criteria: • Universality (U): The biometric trait must be universal. It should be such that no two individuals should possess the same trait. • Distinctiveness (D): The biometric trait allows the two individuals to be identified as distinct. • Permanence (Perm): The biometric trait must not change overtime. • Collectability (Co): The biometric trait should be such that it can be obtained easily from all individuals. • Performance (Perf): The biometric trait must provide high matching accuracy. • Acceptability (A): The biometric trait must be acceptable to the individual, to act as an identifier. • Circumvention (Ci): The biometric trait must not be easily reproduced for spoof attacks. We have reproduced in Table 1, the biometric traits that have been used for spectral imaging, thus far.

Fig. 2 Images of different biometric traits, captured by spectral sensors, at different wavelengths of the electromagnetic spectrum Table 1 In this table, we have reproduced the biometric identifiers relevant for spectral imaging as used by Jain et al. [4] Identifier Face [22–26] Facial thermogram [27–29] Fingerprint [30, 31] Dorsal hand [32] Iris [33, 34] Palmprint [35, 36]

U High High

D Low High

Perm Co Medium High Low High

Perf A Low High Medium High

Ci High Low

Medium Medium High Medium

High Medium High High

High Medium High High

High Medium High High

Medium Low Low Medium

Medium Medium Medium Medium

Medium Medium Low Medium

The details of the initialisms are as follows: U: Universality, D: Distinctiveness, Perm: Permanence, Co: Collectability, Perf: Performance, A: Acceptability, Ci: Circumvention

218

R. Munir and R. Ahmed Khan

Once the trait has been chosen, a biometric system has to be designed to capture the trait. Our next Sect. 2 deals with the generic design of a spectral biometric system. This section is followed by discussion on spectral bands in Sect. 3. The subsequent sections deal with three biometric traits: face, iris and the palmprint for spectral imaging, with applications of deep learning methods. These are followed by conclusion and future direction.

2 A Spectral Biometric System: Design Perspective For developing a biometric system based on the chosen trait, two sets of images are required. One set of images is called the gallery set images. This set consists of images of the biometric trait, and is fed to the spectral biometric system. These images are captured in a specific band of the electromagnetic spectrum. After these images are fed, a feature extraction method is applied to reduce the amount of information in these images. The resultant information can be called features, which is stored in the system as templates. The gallery set is made up of templates instead of biometric trait images. This process of storing templates in the gallery set is called the enrolment phase. The other set of images is called the testing set. In this set, images are captured by the spectral sensor. Feature extraction methods are applied to the images. Once the reduced information or features have been extracted from the test image, also called the probe image, they are passed on for identity recognition to the comparator module. The comparator module performs template matching or classification, producing a positive (acceptance) or negative (rejection) result [4]. Figure 3 shows a generic spectral biometric system. Figure 5 shows the steps required to analyze the spectral image of a biometric trait [38]. The first step requires identification of landmark features in the spectral image of the trait. Feature extraction takes place after the first step, in which information is reduced and templates are produced from the resultant information. This is followed by classification. After classification, the image gets accepted or rejected by the system. Classification methods, depend largely upon feature extraction methods in the preceding steps of the recognition process. Due to this dependency, the drawback of these methods is intra-class variability. Deep learning based methods are relatively Fig. 3 A spectral biometric system [37]

Deep Spectral Biometrics: Overview and Open Issues Cl feature maps

C2 feature maps Sl feature maps

219

S2 feature maps Output

Input

Full Connection Convolutions

Convolutions Subsampling

Convolutions

Subsampling

Fig. 4 A CNN architecture. Figure reproduced from [40]

Fig. 5 A pipeline for spectral biometric recognition [38]

new methods which learn multiple deep representations in the images to find higher level features that prove to be most discriminant. As a result, a much better accuracy is achieved when inter-class variability is high and intra-class variability is low. There have been many breakthroughs with deep learning based methods. The deep learning based methods work by learning deep features from data. A Convolutional Neural Network (CNN) is an important Deep Neural Network architecture [39]. Figure 4 shows the architecture of CNN, which was first proposed by LeCun

220

R. Munir and R. Ahmed Khan

[40]. The CNN is a multi-layer architecture in which every stage contributes towards extraction of features. For this reason, each stage has an input and output made up of feature maps. Patterns or features, extracted from locations of the input map can be found in the output feature map. Every stage has three layers defined as under [41, 42]: 1. Convolution layer: This layer makes use of filters, which are convolved with the image, producing activation or feature maps. 2. Rectification layer: This layer makes use of nonlinear functions to get positive activation maps. 3. Feature Pooling layer: This layer is inserted to reduce the size of the image representation, to make the computation efficient. The number of parameters is also reduced which in turn controls overfitting. 4. Fully-Connected (FC) layer: In this layer, each node is connected to every node in the following and preceding layers. One challenge for a spectral imaging system, is the nature of spectral bands in which the images of a biometric trait have been captured. Templates from gallery set images are recorded in the visible range, whereas the incoming probe images are captured by another spectral sensor, for instance, NIR sensor. This poses a challenge for the system which should be able to match features from the NIR image with those of the visible band image. When images in the testing set, obtained in one modality, are matched against images of the gallery set, captured in the same modality, the process is referred to as same spectral-matching. This method is normally used to form a baseline case, against which recognition accuracy of other cases is compared. When images in the gallery set, captured in one modality are matched against the test images captured in another modality, the process is referred to as cross-spectral matching. In the next section, we study the nature of spectral bands, used so far in literature.

3 Spectral Bands Current biometric systems operate in the visible range (350–740 ηm) [28]. Spectral imaging can be used to improve the performance of these systems. In a spectral biometric system, images can be captured in either the IR band or in a specific wavelength (sub-band). They can also be captured in multiple sub-bands or multiple wavelengths to find discriminant features. Fusion of these multiple images captured in multiple wavelengths could furnish better results and information about a subject. The most widely used case of spectral imaging has been the use of Infrared radiation to capture images. InfraRed is a modality which has been used alongside visible range biometric systems. Because of its invariance to ambient lighting, InfraRed is well-suited to capture images of the face and the iris. InfraRed can be broken down into active IR and passive IR. Table 2 shows corresponding wavelengths of these bands.

Deep Spectral Biometrics: Overview and Open Issues Table 2 The ranges of visible, active & passive InfraRed, as seen in Fig. 1

Band Visible Near InfraRed (NIR) Short-Wave InfraRed (SWIR) Mid-Wave InfraRed (MWIR) Long-Wave InfraRed (LWIR)

221 Wavelength 350–740 ηm 740–1000 ηm 1–3 μm 3–5 μm 8–14 μm

The active IR band is divided into Near-IR (NIR) and Short-Wave IR. This is a reflection dominant band and has been used for capturing face and iris images. NIR ranges from 740 to 1000 ηm and helps tackle low light conditions [22]. SWIR ranges from 1 to 3 μm. SWIR is used for surveillance applications and can counter fog [23]. The active IR band is close to the visible range of the electromagnetic spectrum which is why matching an image captured in active IR band, against an image captured in the visible band, produces promising results, despite the difference in phenomenology which is small, due to proximity. The passive or thermal IR band [43] is divided into Mid-Wave InfraRed (MWIR) and the Long-Wave InfraRed (LWIR) bands. This band is emission dominant. Thermal bands are used to produce thermograms of the object (used in the acquisition phase). This band is useful for imaging of skin tissue because thermal sensors are capable of sensing variations in the temperature of the skin [28]. MWIR ranges from 3 to 5 μm, while LWIR ranges from 8 to 14 μm. The thermal band is far from the visible range of the electromagnetic spectrum which is why there is a large phenomenology difference between the two. Matching an image captured in the thermal band against an image captured in the visible band, can prove to be a challenge [28, 44]. In the next section, we define cases of various promising biometric traits, for which images have been captured in a band other than visible.

4 Biometric Trait: Face Face is a unique and universal biometric trait. It is also very easy to capture for surveillance activities. Much work has been done in terms of biometric systems and spectral imaging systems with face as the chosen trait. Angelopoulou et al. and Cooksey et al. [45, 46] conducted experiments to understand if reflectance (light reflected by the surface of the object) can help to distinguish between skin and non-skin regions. There was variance in the human skin colors. It was found that darker skin reflected a smaller amount of light than lighter skin. However, these experiments did not take into account facial skin.

222 Fig. 6 Facial thermograms captured through thermal imaging [51]

R. Munir and R. Ahmed Khan

MWIR

LWIR

There has been extensive research work on face recognition systems, which are also widely deployed around the world. Face is a trait which uniquely provides universality and permanence. The capture of a facial image is also acceptable to subjects compared to other traits. Since there has been extensive work in face recognition systems, many public databases are available which have captured images in bands other than the visible band. We find that it is viable to choose NIR or SWIR for face imaging due to small phenomenology gap between visible band and active IR band. Passive IR bands (MWIR or LWIR) have also been used to capture images of the face and they are referred to as facial thermograms. These are captured by sensors that are capable of sensing the temperature of the skin. Very little work has been done to capture faces in the thermal bands [28, 47–50]. The work deals with matching gallery set images captured in the thermal band against images, also, captured in the thermal band. This forms a baseline for the case where images in the gallery set are in a different band than those in the probe set. However, the results of cross-matching thermal images against visible images did not yield encouraging results. This is because of the large phenomenology gap between thermal band and the visible band. Figure 6 shows thermograms captured by the LWIR and MWIR sensors. The datasets used by researchers for benchmarking of machine learning algorithms in the case of facial thermogram, are not public at the moment. This has hindered the research progress of the area. In the next subsection, we study databases of faces captured in active IR bands that have been used for deep learning based methods.

4.1 Databases for Face In this section, we review the publicly available datasets for face recognition which have been obtained in spectral bands, other than the visible band. These datasets have been used for experiments based on deep learning methods.

Deep Spectral Biometrics: Overview and Open Issues

4.1.1

223

Carnegie Mellon University Hyperspectral Face Dataset (CMU-HSFD)

The CMU-HSFD dataset [52] is publicly available for research purposes. The researchers acquired this dataset in the range of 450–1100 ηm (wavelengths with a step size of 10 ηm). Forty-eight subjects participated in multiple sessions to provide images for the dataset. There were multiple number of sessions in which the images were acquired by altering illumination conditions in each. Alignment errors were observed due to blinking of participants. Figure 7, shows a cube from the CMUHSFD dataset.

4.1.2

The Hong Kong Polytechnic University Hyperspectral Face Database (HK PolyU-HSFD)

The HK PolyU-HSFD is publicly available for research [53]. This has been developed at the Hong Kong Polytechnic University. The researchers acquired this dataset in the range of 400–720 ηm (wavelengths with a step size of 10 ηm). Twentyfive candidates (17 male candidates and 8 female candidates) participated in the making of this dataset. Multiple sessions were held over a period of 5 months to capture face cubes for this dataset. In every session, there were three views, i.e., frontal, right, and left for each cube. Variations were reported in appearance due to the length of time between multiple sessions. Figure 8 shows a face cube from the HK PolyU-HSFD dataset.

Fig. 7 CMU-HSFD: example of a subject’s face cube [52]

Fig. 8 HK PolyU-HSFD: example of a subject’s face cube [53]

224

4.1.3

R. Munir and R. Ahmed Khan

UWA Hyperspectral Face Database (UWA-HSFD)

The UWA Hyperspectral Face Database (UWA-HSFD) is publicly available for research [54, 55]. The researchers acquired this dataset in the range of 400–720 ηm (wavelengths with a step size of 10 ηm). Seventy candidates participated in the making of this dataset. Figure 9 shows a face cube from the UWA-HSFD dataset.

4.1.4

The Hong Kong Polytechnic University (PolyU) NIR Face Database

This public database has been produced by the Hong Kong Polytechnic University [56]. The images have been acquired from 335 subjects. The system has captured images in the NIR range. There are a total of 34,000 images in the database. Figure 10 shows an example of a cube from the HK PolyU NIR Face Database.

4.1.5

CASIA HFB Database

CASIA Heterogeneous Face Biometrics Database is a publicly available database [57]. The images have been captured in the visible range and the NIR range. There are also present in the database, 3D images of the faces captured. A total of 100 subjects (57 male subjects, 43 female candidates) participated to provide their images. Figure 11 shows an image of a face cube from the database.

Fig. 9 UWA-HSFD: example of a subject’s face cube [54, 55]

Fig. 10 PolyU NIR database: example of NIR images [56]

Deep Spectral Biometrics: Overview and Open Issues

225

Fig. 11 CASIA HFB database: VIS, NIR & 3D depth faces [57]

4.1.6

CASIA NIR-VIS 2.0 Database

This is a publicly available database [58] and is an improvement of the CASIA HFB Database. The images have been captured from 725 subjects. The images have been captured in both the visible and InfraRed range. Figure 12 shows an image taken from the database.

4.2 Deep Learning Based Methods for Face Recognition In this section, we see how deep learning based methods have been used in spectral imaging of face.

4.2.1

Face Recognition Using Convolutional Neural Networks (CNNs)

In this case, a Convolutional Neural Network has been trained for classification of hyperspectral face images and for band selection [59]. In the band selection process, many different wavelengths have been used to capture multiple face images. Sharma et al. considered a hyperspectral cube, in which every band is considered a separate image. Then they trained their CNN model, which consisted of three convolution layers, two FC layers and softmax loss function. Batch normalization layer, ReLU and max pooling layer, follows every convolution layer. In this framework, specific bands rich in information and uncorrelated from each other were chosen. So was the case of choosing the best three bands, from which hand-crafted features (LBP, HOG, and DSIFT-FVs [60]) were extracted. After classification, a recognition accuracy of 99.4% was reported on the HK PolyU-HSFD dataset and 99.2% on the CMUHSFD dataset. Recognition accuracy obtained from LBP (80% and 99.5%), HOG (79.4% and 99.6%) or DSIFT-FV (88.3% and 98.4%), respectively, also produced

226

R. Munir and R. Ahmed Khan

Fig. 12 CASIA NIR-VIS 2.0 face database: VIS & NIR images [58]

better results. However, for the PolyU-HSFD, results were higher with CNN than for hand-crafted features.

4.2.2

NIRFaceNet

In this method, GoogLeNet [61] was modified to produce NIRFaceNet [62]. GoogLeNet is a deep neural network consisting of 22 parameter layers. Twenty seven layers in total, if pooling was included. The CASIA NIR dataset is used in this case. Since this dataset is small (only 3940 pictures), GoogLeNet had to be modified, as GoogLeNet was trained on a much larger dataset (ImageNet). The modified architecture has only eight layers. CASIA NIR Database was used for testing of NIRFaceNet, GoogLeNet, LBP+PCA [63, 64], LBP Histogram, ZMUDWT [65], and ZMHK [66]. There were nine test sets created on which these methods were applied. In the different test sets, different types of noise or occlusion were added to the images to assess recognition performance. The NIRFaceNet was shown to provide the highest recognition accuracy percentage for all the cases. The main

Deep Spectral Biometrics: Overview and Open Issues

227

crux of developing NIRFaceNet was for non-cooperative cases where the user is at a particular distance from the spectral sensor. Therefore, NIRFaceNet is considered suitable for surveillance activities.

5 Biometric Trait: Iris A commercial iris based biometric system works in the InfraRed (IR) range of the electromagnetic spectrum [33, 67]. The irides, in the periocular region, of different humans vary in color. The variation can be across individuals of different regions. According to Boyce et al. [33], the variation in color of different irides is due to the cellular density and pigmentation of the stroma. When light penetrates the iris, the band with the longer wavelength is absorbed by the iris while that with the shorter wavelength is reflected back. Also, the different colors of the iris occur due to the cellular composition and the melanin in the iris, which determine what band is reflected back. Due to varying phenotypical traits of the irides, spectral imaging is able to break down the reflection pattern [68]. Iris images are also collected in the form of images of the periocular region. Researchers have used periocular region images to enhance the recognition performance of iris and face recognition systems [69]. The image of this region consists of the iris, eyebrow, skin beneath the eye and the eyelash. These images are captured in both, the visible range and the Near InfraRed range. An example of a periocular image can be seen in Fig. 13. Although, iris is a unique and stable biometric trait, it has not been implemented for systems, as widely as commercial face biometric systems. It is however, used

Fig. 13 An image of the periocular region of the face [70]

228

R. Munir and R. Ahmed Khan

along with important biometric traits, such as the face and the fingerprint. From Table 1, we see that this biometric trait has high permanence, distinctiveness and universality. This trait also has low circumvention which is desirable, unlike the face, for which artifacts can be easily reproduced. Iris is not affected by pose variations, facial expression variations and age, unlike the face. However, due to low acceptability, it is not a used widely. Many detailed surveys have been written pertaining to iris recognition and periocular recognition [71–73]. In the next section, we highlight publicly available spectral datasets for iris and periocular recognition.

5.1 Databases for Iris In this section, we review publicly available datasets for iris recognition.

5.1.1

The IIT Delhi Iris Database

This publicly available database [74, 75] has been collected at IIT, New Delhi, India. Images of the iris have been captured from students and the staff members at the university. The database has a total of 1120 images acquired from 224 individuals (176 male subjects and 48 female subjects). The age range of the subjects is from 14–55 years. The images had been captured in an indoor environment. Figure 14 shows an image of the iris from the IIT Delhi Iris Database. Fig. 14 An image of the iris in the IITD database [74, 75]

Deep Spectral Biometrics: Overview and Open Issues

5.1.2

229

CASIA-Iris-Thousand Database

This is a publicly available database [76]. It has nearly 20,000 images of the iris, captured from 1000 subjects. These images have been captured in the NIR range. The specialty of this database is the presence of images from a very large number of subjects. Figure 15 shows an image of the iris from the CASIA-Iris-Thousand database.

5.1.3

IIITD Multispectral Periocular Database (IMP)

The IMP database consists of images in three different spectrums, i.e., the visible spectrum, night vision and Near InfraRed (NIR) [69]. In the set consisting of visible images, there are a total of 310 images from 62 subjects (5 images per subject). In the night vision range, there are a total of 310 images from 62 subjects (5 images per subject). In the set consisting of NIR images, there are 620 images of 62 subjects. For every subject, there are 10 images, 5 correspond to the right eye and 5 to the left. Figure 16 shows images of the periocular region captured in three different spectrums.

5.1.4

University of Notre Dame Dataset

NIR imaging (LG2200 iris image camera) was used to acquire images in this dataset [77]. ND-IRIS-0405 has 64,980 images, acquired from 356 subjects. Real world conditions are present in the images of this dataset, such as presence of occlusion and blur. Figure 17 shows an image from the ND-IRIS-0405 dataset. Fig. 15 An image of the iris in CASIA iris database [76]

230

R. Munir and R. Ahmed Khan

Fig. 16 Images from the IMP database (a) visible (b) night vision (c) NIR [69]

Fig. 17 An image from the ND-IRIS-0405 dataset [77]

Deep Spectral Biometrics: Overview and Open Issues

231

In another dataset from the same group called CrossSensor-Iris-2013 database [78], 29,986 images have been acquired using LG4000 and 116,564 images have been acquired using LG2200 from 676 subjects.

5.1.5

Visible Light Mobile Ocular Biometric Dataset

Visible light mobile Ocular Biometric (VISOB) [79] dataset was produced for the ICIP2016 Challenge. Frontal images of 550 subjects were acquired using three different mobile cameras: iPhone 5S, Samsung Note 4 and Oppo N1. There were three different illumination conditions. There are a total of 48,250 images in the enrollment set and 46,797 for the verification set. Figure 18 shows images from the VISOB dataset.

5.1.6

Cross Sensor Iris and Periocular Database

Cross Sensor Iris and Periocular (CSIP) database [80] consists of images acquired by four smartphones (Sony Ericsson Xperia Arc, Apple iPhone 4, ThL W200, and Huawei U8510). Different setups have been applied such as capturing images with the frontal or rear cameras and capturing images with or without flash. Images of subjects have been captured under different illumination conditions. Fifty subjects volunteered to provide 2004 right periocular images. Figure 19 shows images from the CSIP dataset.

Fig. 18 Images from the VISOB dataset [79]

Fig. 19 Images from the CSIP dataset [80]

232

R. Munir and R. Ahmed Khan

5.2 Deep Learning Based Methods for Iris Verification In this section, we highlight cases where neural network based learning has been applied to match gallery images against probe images of irides.

5.2.1

VGG-Net

Minaee et al. [81] used deep learning based methods for iris verification. They used, both CASIA-Iris-Thousand database and the IIT Delhi Iris Database. They specifically extracted deep features from VGG-Net [82, 83] for iris verification. There are 16 convolution layers and five pooling layers. The VGG-Net has 138 million parameters. After the extraction of deep features, PCA is used for dimensionality reduction. SVM is applied for classification. The proposed method produced a high accuracy of 99.4% when compared with other algorithms applied to the IIT database, such as the Haar Wavelet [75] (96.6% accuracy), Log-Gabor Filter by Kumar [75] (97.2% accuracy), Fusion [75] (97.4% accuracy), Elastic Graph Matching [84] (98% accuracy) and texture+scattering features [85] (99.2% accuracy).

5.2.2

Deep Learning on IMP Database

Sharma et al. [69] used deep learning based methods to cross-match images in the IMP database. The database consists of images of the periocular region, in the visible spectrum, night vision and NIR spectrum. Sharma et al. used deep learning to extract features from the individual spectrums and then combined them for matching cross-spectral images. As a first step, a neural network is learned for the first spectrum and is called as Spectrum-1 NNet. The real pairs (label 0) and the fake pairs (label 1) are provided as input to the network. The network has two hidden layers, radial basis kernel function and sigmoid activation function. Weight training is performed through back propagation with regularization. Then, another is learned for the second spectrum, called as Spectrum-2 NNet, using the same process. After this, Spectrum 1 and 2 are combined to perform cross-matching. Pyramid HOG (PHOG) [86, 87] features are obtained from both Spectrum 1 and 2. A cross-spectral feature vector is created which is given as input to the neural network, in a weighted manner. Sharma et al. compared the results of their algorithm with those of Local Binary Patterns (LBP) [88, 89], Histogram of Oriented Gradients (HOG) [87], Pyramid HOG (PHOG) [86, 87], and Four-Patch LBP (FPLBP) [90] descriptors. From among the four descriptors, the best results are obtained in the NIR range due to its illumination invariance, in the case of same spectral-match. However, in crossmatching, the descriptors perform poorly. Neural networks (Spectrum 1 and 2) are also applied for same spectral-matching in all three ranges. In this case, the accuracy

Deep Spectral Biometrics: Overview and Open Issues

233

is further improved. In the cross-spectral matching case, matching is performed between Visible-NIR, Visible-night vision and night vision-NIR. There is 35–50% improvement compared to the results obtained from the descriptors.

5.2.3

DeepIrisNet

Gangwar et al. [91] proposed a Deep Convolutional Neural Network (DCNN) based method for recognition of iris. They performed their experiments on two publicly available iris databases, namely ND-IRIS-0405 and ND-CrossSensor-Iris2013. They used two architectures, namely DeepIrisNet-A and DeepIrisNet-B. DeepIrisNet-A consists of eight convolutional layers, with batch normalization. After every two convolution layers is the pooling layer. A total of four pooling layers are present. In DeepIrisNet-B, there are five convolution layers, which are followed by two inception layers. After every two convolution layers is the pooling layer. In the top three layers, for both architectures, each output neuron is connected to every input. The activation function used in the hidden layers, is ReLU (Rectified Linear Unit). The weights have been initialized with zero-mean Gaussian distribution. The proposed architectures provided highly accurate recognition results.

5.2.4

LightCNN29

Garg et al. [92] used LightCNN29 to perform periocular recognition in unconstrained conditions. They applied their proposed model on IMP, CSIP and VISOB datasets. Twenty-nine convolutional layers were present in their model with 3×3 filters. The model consisted of four pooling layers. On CSIP dataset, the proposed model provided a Rank-1 recognition accuracy of 89.53% which is higher than the accuracy obtained for HOG [87] and DAISY [93] algorithms. On the VISOB dataset, for all lighting conditions and environments (indoors or outdoors), the proposed algorithm provides 10% improvement EER. For the IMP dataset, Genuine Accept Rate of 82.97% at 10% False Accept Rate was achieved.

6 Biometric Trait: Palmprint Palmprint is a biometric trait, and is known for its advantages over other traits. From among those advantages is the low-cost capture system used to capture the image. Another advantage is its resistance to spoof attacks. Very recently, deep learning based methods have been used to study the identification accuracy for a multispectral palmprint recognition system [94–96].

234

R. Munir and R. Ahmed Khan

In a palmprint recognition system, the palmprints of two individuals are compared to see if the line patterns match. As per [97, 98], there are principal lines (further divided into heart line, head line, and life line), wrinkles and ridges on the palm. The principal lines are unique features for individuals and do not change with the passage of time. Wrinkles are thin and irregular in nature. The remaining lines on the palm, other than the principal lines and the wrinkles, are classified as ridges. The lines of the palm are unique and permanent in nature, therefore, these offer rich information for discrimination of features for identification. Spectral imaging can be used to extract information in the palmprint to create a multimodal biometric system that is robust against spoof attacks. Spectral imaging has been used for palmprint recognition by Hao et al. [36, 99, 100]. In literature, we see two kinds of spectral palmprint verification systems, as discussed by Raghavendra et al. [101]: 1. A system with multiple number of cameras, where each camera is capable of acquiring images in a single spectral band. 2. A single camera with multiple number of illuminators which capture the images in multiple bands. The second system has an advantage over the first system because a single camera can simultaneously capture images of a subject in different spectral bands, where no registration is required. Such a system has been proposed by Hao et al. [102] where hygiene and user friendliness have been prioritized. The developed system requires no hand contact with the device. A time division approach is applied instead of frequency division, in which images in different bands are obtained in consecutive timeslots. A palmprint based biometric system is not used widely. Fingerprint based biometric systems are deployed widely and are sufficient. However, a palmprint based system is also an alternative that can be combined with other biometric traits for recognition. Next subsection deals with publicly available palmprint datasets.

6.1 Databases for Palmprint This section deals with discussion on publicly available datasets for palmprint recognition. These datasets are needed to apply deep learning.

6.1.1

PolyU Multispectral Palmprint Database

For the collection of this dataset, images were acquired from 250 subjects. There were 195 male subjects and 55 female subjects, with age ranging from 20 to 60 years. Samples were acquired in two different sessions, separated by 9 days. Six

Deep Spectral Biometrics: Overview and Open Issues

235

Fig. 20 From left to right, palm under: blue band, green band, red band and NIR [103]

images were acquired from each subject, of each palm in a single session. A total of 6000 images have been acquired for single illumination. Images were captured in four different bands: red, green, blue, and NIR. This database is publicly available [35, 103–106] and has been captured by the Hong Kong Polytechnic University. Figure 20 shows an image of Region of Interest (ROI) of the palm, captured in four different bands.

6.1.2

CASIA Multispectral Palmprint Database

CASIA Multispectral Palmprint is a publicly available dataset, collected by the Chinese Academy of Sciences’ Institute of Automation (CASIA) [99, 102, 107]. This dataset has 7200 images of the palmprint in total, captured over two sessions. Both sessions have been separated by a month, with variations in the posture of the hand. The images have been acquired from 100 subjects. The images have been converted to gray-scale JPEG files in the dataset. Every session has three samples, each consisting of six images. These images have been captured at different spectral bands (460 nm, 630 nm, 700 nm, 850 nm, 940 nm and white light). Figure 21 shows images of palmprint present in the dataset.

6.2 Deep Learning Based Methods for Palmprint Verification In this section, we explore how deep learning based methods have been exploited for palmprint verification.

6.2.1

PCANet Deep Learning for Palmprint Verification

Meraoumia et al. [95, 96] experimented with a palmprint based biometric system. In this system, a new image is produced in which, the Region of Interest (ROI) is first

236

R. Munir and R. Ahmed Khan

Fig. 21 From left to right, six different images of the palmprint [99, 102, 107]

extracted from the original palmprint image. A PCANet descriptor acts as feature vector for each new image. All feature vectors are then used to train a classifier such as SVM, RBF, RFT, and KNN. When the biometric system is presented with the test image, ROI is first extracted. Then the feature vector is extracted from this new image and passed on to the SVM classifier for decision. The PCANet descriptor used in this case is a texture based deep learning method. Meraoumia et al. [96] report that the advantage of PCANet is its ability to extract palmprint features with increasing inter-class variability and decreasing intra-class variability. This results in accurate identification. PCANet has the following components [96]: • Cascaded Principal Component Analysis (PCA): There are two steps here. In the first step, the filter banks are computed through application of PCA [63, 64] over a set of vectors. In the second step, the PCANet algorithm is executed over every output image. • Binary hashing: In this step, the format of the output images is converted to binary. • Histograms: In this step, every image from the previous step is partitioned into a certain number of blocks. After this partition, a histogram is computed for every decimal value in the block.

Deep Spectral Biometrics: Overview and Open Issues

237

The scheme has been tested on publicly available multispectral palmprint databases. One of these databases is the CASIA Multispectral Palmprint Database and the other is PolyU Multispectral Palmprint Database. A very high identification accuracy was reported as a result of the test, when compared with methods such as Local Line Directional Pattern [100], Log-Gabor Feature extraction with dimension reduction using Kernel Discriminant Analysis and classification by Sparse Representation Classifier [101], log-Gabor with hamming distance [108] and Neighboring Direction Indicator [109], Double Orientation Code [110]. However, there was a bias with the bands selected for each method. The best bands were chosen for each, which were different for every scheme. The same bands were not used to check accuracy rates, for every method, which hinders common benchmarking.

7 Conclusion and Future Work In this section, we review the limitations of deep learning based methods for spectral biometric systems. We agree that a way forward, for implementation of spectral imaging to help capture information beyond the visible spectrum, is by use of active IR imagery. This is because of the small phenomenology difference between active IR and the visible range, which is used to capture gallery images. There is very little or no work in passive IR bands or thermal bands. Databases with thermal images of face exist, but have not been made public. Initially, a large publicly available dataset needs to be acquired to observe the merits of IR imaging for biometric traits such as, the face and iris. The number of images in any dataset, for different biometric traits is very small. A large number of images need to be captured. A common, standard dataset should be released for experiments and cross-evaluation of algorithms. We have observed from the different publicly available databases that these have been generally captured in different bands or wavelengths. Experimental results cannot be cross-evaluated unless the images are captured in the same wavelength. It is also not possible to verify results because of the difference in bands, used by different research groups. An important point to note for datasets is the consideration of real-world scenarios. Most of the available datasets have been obtained in controlled conditions. Datasets should be captured in uncontrolled conditions with sufficient stand-off distances. This removes bias and is more suited to a practical real-world case. After capturing a large database, deep learning based methods should be used for evaluation. Deep learning based methods have shown promise. Deep networks possess large number of parameters to learn. If the number of training samples is small, in case of a smaller dataset, overfitting occurs. To learn more discriminative features, there is a need to go deeper. This will take place when a common benchmark is obtained to avoid bias introduced by specific methods and datasets, as we see so far in literature.

238

R. Munir and R. Ahmed Khan

References 1. A. Jain, L. Hong, S. Pankanti, Biometric identification, Commun. ACM 43(2), 90–90 (2000) 2. R.A. Khan, A. Meyer, H. Konik, S. Bouakaz, Framework for reliable, real-time facial expression recognition for low resolution images. Pattern Recogn. Lett. 34(10), 1159–1168 (2013) 3. R.A. Khan, A. Meyer, H. Konik, S. Bouakaz, Pain detection through shape and appearance features, in 2013 IEEE International Conference on Multimedia and Expo (ICME) (2013), pp. 1–6. https://doi.org/10.1109/ICME.2013.6607608 4. A.K. Jain, A. Ross, S. Prabhakar, An introduction to biometric recognition. IEEE Trans. Circuits Syst. Video Technol. 14(1), 4–20 (2004). https://doi.org/10.1109/TCSVT.2003. 818349 5. C. Ibarra-Castanedo, Quantitative subsurface defect evaluation by pulsed phase thermography: depth retrieval with the phase. PhD Thesis, Laval University (2005) 6. D.W. Allen, An overview of spectral imaging of human skin toward face recognition, in Face Recognition Across the Imaging Spectrum (Springer, Berlin, 2016), pp. 1–19 7. M. Nischan, R. Joseph, J. Libby, J. Kerekes, Active spectral imaging. Lincoln Lab. J. 14, 131–144 (2003) 8. Q. Wei, J. Bioucas-Dias, N. Dobigeon, J.-Y. Tourneret, Hyperspectral and multispectral image fusion based on a sparse representation. IEEE Trans. Geosci. Remote Sens. 53(7), 3658–3668 (2015) 9. G.A. Shaw, H.-h.K. Burke, Spectral imaging for remote sensing. Lincoln Lab. J. 14(1), 3–28 (2003) 10. D. Cabib, M. Adel, R.A. Buckwald, E. Horn, Spectral bio-imaging of the eye (Apr. 29, 2003). US Patent 6,556,853 11. T.S. Hyvarinen, E. Herrala, A. Dall’Ava, Direct sight imaging spectrograph: a unique add-on component brings spectral imaging to industrial applications, in Digital Solid State Cameras: Designs and Applications, vol. 3302 (International Society for Optics and Photonics, Bellingham, 1998), pp. 165–176 12. M. Dickinson, G. Bearman, S. Tille, R. Lansford, S. Fraser, Multi-spectral imaging and linear unmixing add a whole new dimension to laser scanning fluorescence microscopy. Biotechniques 31(6), 1272–1279 (2001) 13. D.L. Farkas, D. Becker, Applications of spectral imaging: detection and analysis of human melanoma and its precursors. Pigment Cell Melanoma Res. 14(1), 2–8 (2001) 14. H.J. Bouchech, S. Foufou, A. Koschan, M. Abidi, A kernelized sparsity-based approach for best spectral bands selection for face recognition. Multimed. Tools Appl. 74(19), 8631–8654 (2015). https://doi.org/10.1007/s11042-014-2350-2 15. H. Steiner, A. Kolb, N. Jung, Reliable face anti-spoofing using multispectral SWIR imaging, in 2016 International Conference on Biometrics (ICB) (IEEE, Piscataway, 2016), pp. 1–8 16. R. Raghavendra, K.B. Raja, S. Venkatesh, C. Busch, Face presentation attack detection by exploring spectral signatures, in 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (IEEE, Piscataway, 2017), pp. 672–679 17. R. Raghavendra, K.B. Raja, S. Venkatesh, F.A. Cheikh, C. Busch, On the vulnerability of extended multispectral face recognition systems towards presentation attacks, in 2017 IEEE International Conference on Identity, Security and Behavior Analysis (ISBA) (IEEE, Piscataway, 2017), pp. 1–8 18. N. Vetrekar, R. Raghavendra, K.B. Raja, R. Gad, C. Busch, Extended spectral to visible comparison based on spectral band selection method for robust face recognition, in 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017) (IEEE, Piscataway, 2017), pp. 924–930 19. T.I. Dhamecha, A. Nigam, R. Singh, M. Vatsa, Disguise detection and face recognition in visible and thermal spectrums, in 2013 International Conference on Biometrics (ICB) (2013), pp. 1–8. https://doi.org/10.1109/ICB.2013.6613019

Deep Spectral Biometrics: Overview and Open Issues

239

20. H. Chang, A. Koschan, B. Abidi, M. Abidi, Physics-based fusion of multispectral data for improved face recognition, in 18th International Conference on Pattern Recognition (ICPR’06) (2006) 21. R. Ramachandra, C. Busch, Presentation attack detection methods for face recognition systems: a comprehensive survey. ACM Comput. Surv. 50(1), 8:1–8:37 (2017). https://doi. org/10.1145/3038924 22. J.Y. Zhu, W.S. Zheng, J.H. Lai, S.Z. Li, Matching NIR face to VIS face using transduction. IEEE Trans. Inf. Forensics Secur. 9(3), 501–514 (2014). https://doi.org/10.1109/TIFS.2014. 2299977 23. N.D. Kalka, T. Bourlai, B. Cukic, L. Hornak, Cross-spectral face recognition in heterogeneous environments: a case study on matching visible to short-wave infrared imagery, in 2011 International Joint Conference on Biometrics (IJCB) (2011), pp. 1–8. https://doi.org/10.1109/ IJCB.2011.6117586 24. B. Klare, A.K. Jain, Heterogeneous face recognition: matching NIR to visible light images, in 2010 20th International Conference on Pattern Recognition (2010), pp. 1513–1516. https:// doi.org/10.1109/ICPR.2010.374 25. H. Chang, Y. Yao, A. Koschan, B. Abidi, M. Abidi, Spectral range selection for face recognition under various illuminations, in 2008 15th IEEE International Conference on Image Processing (2008), pp. 2756–2759. https://doi.org/10.1109/ICIP.2008.4712365 26. F. Nicolo, N.A. Schmid, Long range cross-spectral face recognition: matching SWIR against visible light images. IEEE Trans. Inf. Forensics Secur. 7(6), 1717–1726 (2012). https://doi. org/10.1109/TIFS.2012.2213813 27. H. Méndez, C.S. Martín, J. Kittler, Y. Plasencia, E. García-Reyes, Face recognition with LWIR imagery using local binary patterns, in Advances in Biometrics, ed. by M. Tistarelli, M.S. Nixon (Springer, Berlin, 2009), pp. 327–336 28. B. Thirimachos, R. Arun, C. Cunjian, H. Lawrence, A study on using mid-wave infrared images for face recognition (2012). https://doi.org/10.1117/12.918899 29. N. Short, S. Hu, P. Gurram, K. Gurton, A. Chan, Improving cross-modal face recognition using polarimetric imaging. Opt. Lett. 40(6), 882–885 (2015). https://doi.org/10.1364/OL.40. 000882, http://ol.osa.org/abstract.cfm?URI=ol-40-6-882 30. K.A. Nixon, R.K. Rowe, Multispectral fingerprint imaging for spoof detection, in Biometric Technology for Human Identification II, vol. 5779 (International Society for Optics and Photonics, Bellingham, 2005), pp. 214–226 31. R.K. Rowe, K.A. Nixon, S.P. Corcoran, Multispectral fingerprint biometrics, in Information Assurance Workshop, 2005. IAW’05. Proceedings from the Sixth Annual IEEE SMC (IEEE, Piscataway, 2005), pp. 14–20 32. D. Zhang, Z. Guo, Y. Gong, Multiple band selection of multispectral dorsal hand, in Multispectral Biometrics (Springer, Berlin, 2016), pp. 187–206 33. C. Boyce, A. Ross, M. Monaco, L. Hornak, X. Li, Multispectral iris analysis: a preliminary study51, in Conference on Computer Vision and Pattern Recognition Workshop, 2006. CVPRW’06 (IEEE, Piscataway, 2006), pp. 51–51 34. A. Ross, R. Pasula, L. Hornak, Exploring multispectral iris recognition beyond 900 nm, in IEEE 3rd International Conference on Biometrics: Theory, Applications, and Systems, 2009. BTAS’09 (IEEE, Piscataway, 2009), pp. 1–8 35. D. Zhang, Z. Guo, G. Lu, L. Zhang, W. Zuo, An online system of multispectral palmprint verification, IEEE Trans. Instrum. Measure. 59(2), 480–490 (2010) 36. Z. Guo, D. Zhang, L. Zhang, W. Liu, Feature band selection for online multispectral palmprint recognition, IEEE Trans. Inf. Forensics Secur. 7(3), 1094–1099 (2012) 37. R. Munir, R.A. Khan, An extensive review on spectral imaging in biometric systems: challenges and advancements. arXiv preprint arXiv:1807.05771 38. R.A. Khan, Detection of emotions from video in non-controlled environment, Theses, Université Claude Bernard - Lyon I, Nov. 2013. https://tel.archives-ouvertes.fr/tel-01166539 39. R.A. Khan, A. Crenn, A. Meyer, S. Bouakaz, A novel database of children’s spontaneous facial expressions (LIRIS-CSE). Image Vision Comput. 83–84, 61–69 (2019)

240

R. Munir and R. Ahmed Khan

40. Y. LeCun, K. Kavukcuoglu, C. Farabet, Convolutional networks and applications in vision, in Proceedings of 2010 IEEE International Symposium on Circuits and Systems (ISCAS) (IEEE, Piscataway, 2010), pp. 253–256 41. I. Hadji, R.P. Wildes, What do we understand about convolutional networks? CoRR abs/1803.08834 42. M.D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks, in Computer Vision – ECCV 2014, ed. by D. Fleet, T. Pajdla, B. Schiele, T. Tuytelaars (Springer, Cham, 2014), pp. 818–833 43. T. Bourlai, B. Cukic, Multi-spectral face recognition: identification of people in difficult environments, in 2012 IEEE International Conference on Intelligence and Security Informatics (2012), pp. 196–201. https://doi.org/10.1109/ISI.2012.6284307 44. B.F. Klare, A.K. Jain, Heterogeneous face recognition using kernel prototype similarities. IEEE Trans. Pattern Anal. Machine Intell. 35(6), 1410–1422 (2013) 45. E. Angelopoulou, The reflectance spectrum of human skin, Technical Reports (CIS) (1999), p. 584 46. D.W.A. Catherine C. Cooksey, B.K. Tsai, Spectral reflectance variability of skin and attributing factors (2015). https://doi.org/10.1117/12.2184485 47. H. Chang, H. Harishwaran, M. Yi, A. Koschan, B. Abidi, M. Abidi, An indoor and outdoor, multimodal, multispectral and multi-illuminant database for face recognition, in 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW’06) (2006), pp. 54–54. https://doi.org/10.1109/CVPRW.2006.28 48. D.A. Socolinsky, A. Selinger, Thermal face recognition in an operational scenario, in Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004, vol. 2 (IEEE, Piscataway, 2004) 49. B. Martinez, X. Binefa, M. Pantic, Facial component detection in thermal imagery, in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (2010), pp. 48–54. https://doi.org/10.1109/CVPRW.2010.5543605 50. K.P. Gurton, A.J. Yuffa, G.W. Videen, Enhanced facial recognition for thermal imagery using polarimetric imaging. Opt. Lett. 39(13), 3857–3859 (2014). https://doi.org/10.1364/OL.39. 003857, http://ol.osa.org/abstract.cfm?URI=ol-39-13-3857 51. S. Hu, J. Choi, A.L. Chan, W.R. Schwartz, Thermal-to-visible face recognition using partial least squares. J. Opt. Soc. Am. A 32(3), 431–442 (2015). https://doi.org/10.1364/JOSAA.32. 000431, http://josaa.osa.org/abstract.cfm?URI=josaa-32-3-431 52. L.J. Denes, P. Metes, Y. Liu, Hyperspectral Face Database (Carnegie Mellon University, The Robotics Institute, Pittsburgh, 2002) 53. W. Di, L. Zhang, D. Zhang, Q. Pan, Studies on hyperspectral face recognition in visible spectrum with feature band selection. IEEE Trans. Syst. Man, Cybern. A: Syst Humans 40(6), 1354–1361 (2010) 54. M. Uzair, A. Mahmood, A. Mian, Hyperspectral face recognition with spatiospectral information fusion and PLS regression. IEEE Trans. Image Process. 24(3), 1127–1137 (2015). https://doi.org/10.1109/TIP.2015.2393057 55. M. Uzair, A. Mahmood, A.S. Mian, Hyperspectral face recognition using 3D-DCT and partial least squares. in British Machine Vision Conference (2013) 56. B. Zhang, L. Zhang, D. Zhang, L. Shen, Directional binary code with application to polyU near-infrared face database. Pattern Recog. Lett. 31(14), 2337–2344 (2010) 57. S. Z. Li, Z. Lei, M. Ao, The HFB face database for heterogeneous face biometrics research, in IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2009. CVPR Workshops 2009 (IEEE, Piscataway, 2009), pp. 1–8 58. S.Z. Li, D. Yi, Z. Lei, S. Liao, The CASIA NIR-VIS 2.0 face database, in 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops (2013), pp. 348–353. https://doi.org/10.1109/CVPRW.2013.59 59. V. Sharma, A. Diba, T. Tuytelaars, L. Van Gool, Hyperspectral CNN for image classification & band selection, with application to face recognition, Tech. rep. (Katholieke Universiteit, Leuven, 2016)

Deep Spectral Biometrics: Overview and Open Issues

241

60. A. Mahendran, A. Vedaldi, Understanding deep image representations by inverting them, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 5188–5196 61. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, et al., Going deeper with convolutions, in Conference on Computer Vision and Pattern Recognition (2015) 62. M. Peng, C. Wang, T. Chen, G. Liu, NIRFaceNet: a convolutional neural network for nearinfrared face identification. Information 7(4), 61 (2016) 63. H. Abdi, L.J. Williams, Principal component analysis, Wiley interdisciplinary reviews. Comput. Stat 2(4), 433–459 (2010) 64. S. Wold, K. Esbensen, P. Geladi, Principal component analysis. Chemom. Intell. Lab. Syst. 2(1-3), 37–52 (1987) 65. S. Farokhi, S.M. Shamsuddin, U.U. Sheikh, J. Flusser, M. Khansari, K. Jafari-Khouzani, Near infrared face recognition by combining Zernike moments and undecimated discrete wavelet transform. Digital Signal Process. 31, 13–27 (2014) 66. S. Farokhi, U.U. Sheikh, J. Flusser, B. Yang, Near infrared face recognition using Zernike moments and Hermite kernels. Inf. Sci. 316, 234–245 (2015) 67. P. Wild, P. Radu, J. Ferryman, On fusion for multispectral iris recognition, in 8th IAPR International Conference on Biometrics (2015), pp. 31–73 68. C.K. Boyce, Multispectral iris recognition analysis: techniques and evaluation, Ph.D. thesis, Citeseer, 2006 69. A. Sharma, S. Verma, M. Vatsa, R. Singh, On cross spectral periocular recognition, in 2014 IEEE International Conference on Image Processing (ICIP) (IEEE, Piscataway, 2014), pp. 5007–5011 70. F.M. Algashaam, K. Nguyen, M. Alkanhal, V. Chandran, W. Boles, J. Banks, Multispectral periocular classification with multimodal compact multi-linear pooling. IEEE Access 5, 14,572–14,578 (2017) 71. F. Alonso-Fernandez, J. Bigun, A survey on periocular biometrics research. Pattern Recog. Lett. 82, 92–105 (2016) 72. K. Nguyen, C. Fookes, R. Jillela, S. Sridharan, A. Ross, Long range iris recognition: a survey. Pattern Recog. 72, 123–143 (2017) 73. M. De Marsico, A. Petrosino, S. Ricciardi, Iris recognition through machine learning techniques: a survey. Pattern Recog. Lett. 82, 106–115 (2016) 74. IITD iris database. http://web.iitd.ac.in/~biometrics/Database_Iris.htm 75. A. Kumar, A. Passi, Comparison and combination of iris matchers for reliable personal authentication. Pattern Recog. Lett. 43(3), 1016–1026 (2010) 76. CASIA iris image database. https://doi.org/http://biometrics.idealtest.org/ 77. K.W. Bowyer, P.J. Flynn, The ND-iris-0405 iris image dataset, arXiv preprint arXiv:1606.04853 78. University of Notre Dame, Computer Vision Research Lab. https://cvrl.nd.edu/projects/data/ 79. A. Rattani, R. Derakhshani, S.K. Saripalle, V. Gottemukkula, ICIP 2016 competition on mobile ocular biometric recognition, in 2016 IEEE International Conference on Image Processing (ICIP) (IEEE, Piscataway, 2016), pp. 320–324 80. G. Santos, E. Grancho, M.V. Bernardo, P.T. Fiadeiro, Fusing iris and periocular information for cross-sensor recognition. Pattern Recog. Lett. 57, 52–59 (2015) 81. S. Minaee, A. Abdolrashidiy, Y. Wang, An experimental study of deep convolutional features for iris recognition, in Signal Processing in Medicine and Biology Symposium (SPMB), 2016 IEEE (IEEE, Piscataway, 2016), pp. 1–6 82. Convolutional Neural Networks for visual recognition. http://cs231n.github.io/convolutionalnetworks/#comp 83. K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556 84. R. Farouk, Iris recognition based on elastic graph matching and Gabor wavelets. Comput. Vis. Image Underst. 115(8), 1239–1244 (2011)

242

R. Munir and R. Ahmed Khan

85. S. Minaee, A. Abdolrashidi, Y. Wang, Iris recognition using scattering transform and textural features, in 2015 IEEE Signal Processing and Signal Processing Education Workshop (SP/SPE) (IEEE, Piscataway, 2015), pp. 37–42 86. A. Bosch, A. Zisserman, X. Munoz, Representing shape with a spatial pyramid kernel, in Proceedings of the 6th ACM international conference on Image and video retrieval (ACM, New York, 2007), pp. 401–408 87. N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005. CVPR 2005, vol. 1 (IEEE, Piscataway, 2005), pp. 886–893 88. T. Ojala, M. Pietikäinen, D. Harwood, A comparative study of texture measures with classification based on featured distributions. Pattern Recog. 29(1), 51–59 (1996) 89. T. Ahonen, A. Hadid, M. Pietikainen, Face description with local binary patterns: application to face recognition. IEEE Trans. Pattern Anal. Mach. Intel. 28(12), 2037–2041 (2006) 90. L. Wolf, T. Hassner, Y. Taigman, Descriptor based methods in the wild, in Workshop on faces in ’real-life’ images: detection, alignment, and recognition, (2008) 91. A. Gangwar, A. Joshi, DeepIrisNet: deep iris representation with applications in iris recognition and cross-sensor iris recognition, in 2016 IEEE International Conference on Image Processing (ICIP) (IEEE, Piscataway, 2016), pp. 2301–2305 92. R. Garg, Y. Baweja, S. Ghosh, M. Vatsa, R. Singh, N. Ratha, Heterogeneity aware deep embedding for mobile periocular recognition. arXiv preprint arXiv:1811.00846 93. E. Tola, V. Lepetit, P. Fua, Daisy: an efficient dense descriptor applied to wide-baseline stereo. IEEE Trans. Pattern Anal. Mach. Intel. 32(5), 815–830 (2010) 94. K. Bensid, D. Samai, F.Z. Laallam, A. Meraoumia, Deep learning feature extraction for multispectral palmprint identification. J. Electr. Imag. 27(3), 033018 (2018) 95. A. Meraoumia, L. Laimeche, H. Bendjenna, S. Chitroub, Do we have to trust the deep learning methods for palmprints identification?, in Proceedings of the Mediterranean Conference on Pattern Recognition and Artificial Intelligence (ACM, New York, 2016), pp. 85–91 96. A. Meraoumia, F. Kadri, H. Bendjenna, S. Chitroub, A. Bouridane, Improving biometric identification performance using PCANET deep learning and multispectral palmprint, in Biometric Security and Privacy (Springer, Berlin, 2017), pp. 51–69 97. T. Connie, A.T.B. Jin, M.G.K. Ong, D.N.C. Ling, An automated palmprint recognition system. Image Vis. Comput. 23(5), 501–515 (2005) 98. D. Zhang, W. Zuo, F. Yue, A comparative study of palmprint recognition algorithms. ACM Comput. Surv. 44(1), 2 (2012) 99. Y. Hao, Z. Sun, T. Tan, Comparative studies on multispectral palm image fusion for biometrics, in Asian Conference on Computer Vision (Springer, Berlin, 2007), pp. 12–21 100. Y.-T. Luo, L.-Y. Zhao, B. Zhang, W. Jia, F. Xue, J.-T. Lu, Y.-H. Zhu, B.-Q. Xu, Local line directional pattern for palmprint recognition. Pattern Recog. 50, 26–44 (2016) 101. R. Raghavendra, C. Busch, Novel image fusion scheme based on dependency measure for robust multispectral palmprint recognition. Pattern Recog. 47(6), 2205–2221 (2014) 102. Y. Hao, Z. Sun, T. Tan, C. Ren, Multispectral palm image fusion for accurate contact-free palmprint recognition, in 15th IEEE International Conference on Image Processing, 2008. ICIP 2008 (IEEE, Piscataway, 2008), pp. 281–284 103. PolyU multispectral palmprint database. http://www.comp.polyu.edu.hk/~biometrics/ MultispectralPalmprint/MSP.htm 104. D.D. Zhang, W. Kong, J. You, M. Wong, Online palmprint identification. IEEE Trans. Pattern Anal. Mach. Intell. 25, 1041 (2003) 105. D. Han, Z. Guo, D. Zhang, Multispectral palmprint recognition using wavelet-based image fusion, in 9th International Conference on Signal Processing, 2008. ICSP 2008 (IEEE, Piscataway, 2008), pp. 2074–2077 106. Z. Guo, D. Zhang, L. Zhang, W. Zuo, G. Lu, Empirical study of light source selection for palmprint recognition. Pattern Recog. Lett. 32(2), 120–126 (2011) 107. CASIA-ms-palmprintv1. http://biometrics.idealtest.org/

Deep Spectral Biometrics: Overview and Open Issues

243

108. M.D. Bounneche, L. Boubchir, A. Bouridane, B. Nekhoul, A. Ali-Chérif, Multi-spectral palmprint recognition based on oriented multiscale log-Gabor filters. Neurocomputing 205, 274–286 (2016) 109. L. Fei, B. Zhang, Y. Xu, L. Yan, Palmprint recognition using neighboring direction indicator. IEEE Trans. Human-Mach. Syst. 46(6), 787–798 (2016) 110. L. Fei, Y. Xu, W. Tang, D. Zhang, Double-orientation code and nonlinear matching scheme for palmprint recognition. Pattern Recog. 49, 89–101 (2016)

Biometric Blockchain: A Secure Solution for Intelligent Vehicle Data Sharing Bing Xu, Tobechukwu Agbele, and Richard Jiang

1 Introduction Intelligent vehicles [1–5] are featured by its capability on real-time data processing and sharing over the internet of vehicles, which can enable IVs the capability to be integrated into the pervasive networks in smart cities. IV data sharing faces the challenges in two scenarios: Vehicle-to-Vehicle (V2V) data sharing and Vehicle-toInfrastructure (V2I) data sharing [1], and a secure protocol is needed to provide the regulation of any safety-critical incidents and hazards. The information gathered by the feedback of the nearby vehicles is often vulnerable to security attacks, and incorrect feedback can result in higher congestion and severe hazards [2]. Particularly, lacking of information on end users or stakeholders will undermine the system and make it hard to pin down the responsible party. In the IV data sharing networks, security is a critical issue during V2V or V2I communication. These networks heavily demand a reliable biometric credit (BC) protocol of credible trust and privacy [3]. In this chapter, we propose to use biometrics for an ID-based credit element and build a credit-based trust system for transmitting reliable data over the IV data sharing platform. Here, the credit is associated with a unique biometric ID, which is attached to the message format and transmitted during V2V or V2I communication. The blockchain based cloud storage manages the IV-BC protocol, and is accessible ubiquitously by the credited end users. The IV-BC mechanism using biometric blockchain can create the unique crypto ID with biometric cues, digital contracts

B. Xu Computer and Information Sciences, Northumbria University, Newcastle upon Tyne, UK e-mail: [email protected]; [email protected] T. Agbele · R. Jiang () Computing and Communication, Lancaster University, Lancaster, UK e-mail: [email protected]; [email protected] © Springer Nature Switzerland AG 2020 R. Jiang et al. (eds.), Deep Biometrics, Unsupervised and Semi-Supervised Learning, https://doi.org/10.1007/978-3-030-32583-1_11

245

246

B. Xu et al.

Fig. 1 Intelligent vehicles in the simulation of routed peer-to-peer V2I communication as well as V2V communication

with self-executing, and detailed information of IVs under the control over the blockchain cloud [4]. Figure 1 shows the standard intelligent vehicle information sharing environment with both V2V and V2I communication. Previously, many researchers attempted to combine automotive and blockchain technology together [3], considering their applications based on services and smart contracts. In this chapter, we will focus on the issues on secure and fast communication between intelligent vehicles, and propose to exploit biometric blockchain (BBC) to build a robust trust environment for IV data sharing, while the combination of biometrics and blockchain technology will enable a better secure protocol with the additive biometric information. We organize our chapter as follows. Section 2 presents the motivation of using biometric blockchain based trust environment for data sharing among intelligent vehicles over traditional security methods. Section 3 presents the introduction of biometric blockchain technology for intelligent vehicles data sharing. Section 4 presents the detailed architecture of the BBC-based IV data sharing framework. Section 5 concludes our chapter and discusses our future work on BBC-based IV data sharing.

2 Motivation of Our Work Currently intelligent transport systems (ITS) use ad-hoc networks for vehicle communication such as DSRC, WAVE, and Cellular Network, which does not guarantee secure data transmission. Security protocols for vehicle communication

Biometric Blockchain: A Secure Solution for Intelligent Vehicle Data Sharing

247

applications based on cellular and IT standard security mechanisms are often fragile to malicious attacks and also out of date with little suitability for real ITS applications. So far, many researchers on ITS are advocating new protocols to provide a robust security mechanism for ITS. Our proposed BBC-based mechanism has many advantages. It is easy to implement; it can easily trace down and link services to specific customers; it is based on peer-to-peer communication; and it provides a secure and trust environment for vehicle communication with immutable database and ubiquitous data access in a secure way. Our proposal is based on a very simple concept of using biometric blockchain based trust environment for data sharing among intelligent vehicles using the Intelligent Vehicle Biometric Credits (IV-BC). We are exploiting the features of blockchain such as distributed and open ledgers, secure encryption with Merkel tree and Hash function (SHA-256), and consensus mechanism (proof of work algorithm). More details will be included in the following sections.

3 Biometric Blockchain for Intelligent Vehicle Systems 3.1 Blockchain Technology A blockchain consists of distributed open ledgers saved at each node in a peer-topeer network, which are self-maintained by each node. It is provided with a peerto-peer network without the interference of the third party. The blockchain integrity is based on a strong cryptography algorithm that validates blocks and then chains them together on transactions, making it impossible to tamper with any individual transaction without being detected. Figure 2 shows an overview of such blockchain technology. Within the core of blockchain, shared ledgers allow a decentralized storage over peer-to-peer networks, a cryptography algorithm secures the ledgers from any malicious tampering, the consensus protocol reinforces the data integrity by all-party validation, and the smart contract mechanism allows the ledgers executed within data transmission.

3.2 Blockchain Technology in Vehicle The blockchain technology has also been proposed for ITS to implement a secured, trusted, and decentralized autonomous ecosystem [7] with a seven-layer conceptual model of blockchain. Benjamin et al. proposed the blockchain technology for vehicular ad-hoc network (VANET) [8]. The introduction of blockchain enables the smart contract system with vehicle ad-hoc network, and promotes the combination of multiple applications, including mandatory applications (traffic regulation, vehicle

248

B. Xu et al.

Fig. 2 Blockchain technology [6] with features such as shared ledger, cryptography, signed blocks of transactions, and digital signatures

insurance, vehicle tax, etc.) and optional applications (that provides information and updates on weather forecasts, traffic guidance, entertainments, e-commercial access, and other smart city functions), while blockchain technology can facilitate the integration of these services with trust and privacy. Blockchain can provide peer-to-peer communication without disclosing personal information, strengthen the security in data sharing, and secure multiple communications between vehicles, etc. In Ref. [9], a blockchain based mechanism was proposed that will not disclose any private data information of vehicles user when providing and updating the remote wireless software and other vehicles services. Sean Rowen et al. [10] exploited the blockchain technology for securing the IV communication through the visible light and acoustic side channels. Their proposed mechanism was successfully verified through a session cryptographic key, leveraging both side-channels and blockchain public key infrastructure. In this chapter, we will exploit our biometric blockchain mechanism for the IV communication environment. We propose a new framework using biometrics

Biometric Blockchain: A Secure Solution for Intelligent Vehicle Data Sharing

249

for secure credit-based environment with peer-to-peer communication between intelligent vehicles without interfering/disturbing other intelligent vehicles.

3.3 Biometric Blockchain for Intelligent Vehicles Despite the seemingly reliable and convenient services offered by the features of blockchain technology for applications, there is a host of security concerns and issues [11–20], and its understanding is relevant to the research community, governments, investors, regulators, and other stakeholders. Given the complexity and the centralized infrastructure of the blockchain, it is not a surprise that it would have some downsides especially in sharing data among different systems or regions, and securing data for a centralized infrastructure is a challenging task since potential attacks and exploits lead to a single-point-of-contact requiring trust for this individual authority. This implies that more research effort needs to be done in order to assure that information are secured in terms of privacy, assuring that only authorized users are able to access the data. To address these challenges, recent research [20] has started to think about bring biometrics [21, 22] into blockchain with a hope to achieve better security, scalability, and privacy. Based on this initiative we proposed a new blockchain framework, namely biometric blockchain (BBC) for IV data sharing. Figure 3 shows our proposed framework, where biometrics is used to sign off ledgers and generate unique IDs that can be associated with individuals. Cloud-based biometrics-as-aservice (BaaS) is engaged to provide extra support to local IV data sharing. The benefits of such a BBC-based framework are apparent. First, we can easily associate the activities in the IV data sharing to a specific individual or customer, and then the services can be easily tailored to specific users. Moreover, a personal credit-based system can be associated with user-specific services, and the security of IV data sharing is then double guaranteed by the individual credit system. Hence, BBC as proposed in Fig. 3 could be a valuable solution for practical applications in IV data sharing.

4 BBC-Based Credit Environment for IV Data Sharing In this chapter, we propose a credit-based IV communication using biometric blockchain technology. Our proposed mechanism consists of three basic aspects including communication network enabled connected device, vehicular cloud computing (VCC), and biometric blockchain (BBC). Figure 4 shows the complete data sharing environment for intelligent vehicles. Vehicles on the roads communicate with the proposed BBC-based IV data sharing platform, where vehicles can be connected with BBC-based platform and store their IV-BC over BBC-based cloud services. More details are explained below.

250

B. Xu et al.

Fig. 3 The proposed BBC framework for IV data sharing

4.1 Vehicular Cloud Computing Vehicular cloud computing (VCC) has a remarkable impact on traffic management and road safety by instantly using vehicle resources, such as computing, data storage, and internet decision-making. It is a hybrid technology that integrates IV data management with cloud computing. Hence, it faces new challenges coming from both local wireless communication and remote cloud services.

4.2 Network Enabled Connected Devices Each intelligent vehicle connects itself to the infrastructure and other vehicles via an internet-enabled biometric-certificated device. Such an Internet-of-Things (IoT) device with biometric identification can authenticate, approve, and organize any instant communication over VANET such as smartphone, PDA, intelligent vehicles, etc.

Biometric Blockchain: A Secure Solution for Intelligent Vehicle Data Sharing

251

Fig. 4 Proposed BBC-based intelligent vehicle communication

4.3 BBC-Supported Intelligent Vehicles Biometric blockchain consists of a technically unlimited number of blocks or ledgers that are chained together cryptographically in chronological order, with individual biometric signatures. Each block consists of transactions, which are the actual data to be stored in the chain. As shown in Fig. 5, the overall system is a seven-layer conceptual model of the standard architecture for the intelligent communication network. Here, we briefly explain the key features of the proposed BBC-based network model, as below. 1. Physical layer: This layer presents the communication network enabled devices such as IoT devices, fingerprint ID device, face ID device, camera, GPS, PDA, and other devices on intelligent vehicles. All these devices can be involved into dynamic V2I and V2V communications with blockchain mechanism. 2. Data Layer: This layer processes the data blocks or ledgers with cryptography features such as Merkle tree and Hash algorithm to generate secured blocks. The typical structure of block, as shown in Fig. 6, consists of a header part that

252

B. Xu et al.

Fig. 5 The proposed BBC-based intelligent vehicle communication network framework

Fig. 6 The structure of blocks in a blockchain

Biometric Blockchain: A Secure Solution for Intelligent Vehicle Data Sharing

3.

4.

5.

6. 7.

253

specifies the previous hash and nonce with current hash (root) following by a Merkle tree. Hash keys are generated by the double SHA 256 algorithm. Network Layer: This layer handles with the forwarding of data over peerto-peer network communication. It also copes with the verification of the communication, while the legality of the broadcasted message is verified and managed over the peer-to-peer connection between two IVs. Handshake Layer: The handshake layer is also called the consensus layer in the biometric blockchain based framework. It provides a decentralized control in the network communication and helps to build up the trust between unknown users in the wild communication environment using biometric credits. Typically in IV communication networks, proof of driving (PoD) is a preferable consensus algorithm, which verifies and validates the vehicles involved in the communication networks. In our BBC framework, this can be traced down to end users with their biometric credits. Biometric credit layer: This layer handles with IV-BC issues, as proposed in this chapter. The proposed IV-BC protocol has a crypto data that is assigned to each vehicle, and the consensus competition will favour a vehicle with higher credits in records. The winner will be rewarded an extra credit associated with its biometric ID. The vehicle with the maximum IV-BC credit leads in the communication network. Such an IV-BC credit system helps to build a reliable user-credited trust environment in the vehicles communication. Presentation Layer: This layer encapsulates multiple scripts, contracts, and algorithms that are provided by the vehicles and users involved in the network. Service Layer: This layer contains the scenarios and user cases of IV communication system.

It is worth to note that lots of research organizations and startups are implementing blockchain in different areas. In our proposed scheme, the novelty is to include personal biometric information to authorize the responsibility and secure the environment by tracing down the history of associated key managers or users.

5 Privacy Issues in Biometric Blockchain Biometrics could help secure data in blockchain. On the other side, it is also sensitive to expose the privacy of individuals. The deployment of BBC-based framework will inevitably request the biometric information from individuals, such as managers or users. Hence, BBC obvious needs a privacy-protected mechanism when biometric information is collected from IVs. Recent research has enlightened the privacy issues with a nice solution by using encrypted biometrics [21–23]. As shown in Fig. 7, a set of biometric features such as faces can be encrypted and combined into a ledger as an encrypted signature. The modern technology [21–23] has revealed that we do not need to decrypt these biometric information and can directly verify it in its encrypted domain, making the

254

B. Xu et al.

Fig. 7 Privacy protection using encrypted biometrics [21]

use of biometrics much less sensitive in terms of privacy concerns when it is used for blockchain technology. The implementation of BBC may be associated with the online biometric verification, and can be based on multimodal biometrics [24]. Classically, feature (such as LBP or SIFT) based [25, 26] biometric verification is popular though it could be a bit time-consuming. Recently new approaches such as deep neural networks [27–30] are taking over the area of biometric verification. Such verification can be carried out on cloud servers, leading to a new topic called biometrics-as-aservice (BaaS). There may raise concerns on the computing resources while biometrics-as-aservice is accessed via cloud platforms, as shown in Fig. 5. However, in a BBCbased platform, biometrics is verified only when it is necessary. This implies the biometric information could mostly be dormant and hence the requirement on extra computing is minimized.

6 Conclusion In conclusion, we have presented a reward credit-based intelligent vehicle communication framework, where biometric blockchain is engaged to secure the IV data sharing and associate any activities or services with key individuals. Such BBCbased IV-BC protocol provides fast and credible communication between IVs. It also helps to detect the detailed history of the communication among IVs and end users, and a biometrics-based credit recording system can be associated with an IV with its associated individuals. Such detailed records can be further available to various user-specific services such as IV service providers, insurance companies, entertainment providers, and other fancy smart city functions.

Biometric Blockchain: A Secure Solution for Intelligent Vehicle Data Sharing

255

References 1. G. Yan, S. Olariu, A probabilistic analysis of link duration in vehicular ad hoc networks. IEEE Trans. Intell. Transp. Syst. 12(4), 1227–1236 (2011) 2. D. Singh, M. Singh, I. Singh, H.J. Lee, Secure and reliable cloud networks for smart transportation services, in The 17th International Conference on Advanced Communication Technology (ICACT 2015), Seoul, 2015, pp. 358–362. https://doi.org/10.1109/ICACT.2015.7224819 3. S. Olariu, M. Eltoweissy, M. Younis, Toward autonomous vehicular clouds. ICST Trans. Mobile Commun. Comput. 11(7–9), 1–11 (2011) 4. C. Wang, Q. Wang, K. Ren, W. Lou, Privacy-preserving public auditing for data storage security in cloud computing, in Proceedings of the IEEE INFOCOM, San Diego, CA, 2010, pp. 1–9 5. M. Singh, D. Singh, A. Jara, Secure cloud networks for connected & automated vehicles, in 2015 International Conference on Connected Vehicles and Expo (ICCVE, Shenzhen, 2015), pp. 330–335. https://doi.org/10.1109/ICCVE.2015.94 6. S. Nakomoto, Bitcoin: a peer-to-peer electronic cash system (2009). BITCOIN.ORG, p. 3 7. Y. Yuan, F.-Y. Wang, Towards blockchainbased intelligent transportation systems, in 2016 IEEE 19th International Conference on Intelligent Transportation Systems (ITSC), Windsor Oceanico Hotel, Rio de Janerio, Brazil, 1–4 Nov 2016 8. B. Leiding, P. Memarmoshrefi, D. Hogrefe, Self-managed and blockchain-based vehicular adhoc networks, in Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct (UbiComp ’16) (ACM, New York, 2016), pp. 137–140 9. A. Dorri, M. Steger, S.S. Kanhere, R. Jurdak, Blockchain: a distributed solution to automotive security and privacy. IEEE Commun. Mag. 55, 119 (2017) 10. S. Rowan, M. Clear, M. Huggard, C. Mc Goldrick, Securing vehicle to vehicle data sharing using blockchain through visible light and acoustic side-channels, eprint arXiv:1704.02553, 2017 11. M. Crosby et al., BlockChain technology: beyond bitcoin, Sutardja Center for Entrepreneurship & Technology Technical Report, University of California, Berkeley, 2015 12. C. Jaikaran, Blockchain: background and policy issues (Congressional Research Service, Washington, DC, 2018) 13. H. Kakavand et al., The blockchain revolution: an analysis of regulation and technology related to distributed ledger technologies, Luther Systems & DLA Piper, 2016 14. O. Mazonka et al., Blockchain: simple explanation, J. Ref., 2016, http://jrxv.net/x/16/chain.pdf 15. D. Tapscott, A. Tapscott, Blockchain revolution: how the technology behind bitcoin is changing money, business and the world (Portfolio Penguin, London, 2016) 16. K. Saito, H. Yamada, What’s so different about blockchain? Blockchain is a probabilistic state machine, in IEEE 36th International Conference on Distributed Computing Systems Workshops, Nara, Japan, 2016, pp. 168–175 17. S. Raval, Decentralized applications: harnessing bitcoin’s blockchain technology (Oreilly, Beijing, 2016). ISBN 9781491924549 18. I. Bashir, Mastering blockchain (Packet Publishing, Birmingham, 2017) 19. D. Puthal et al., Everything you wanted to know about the blockchain. IEEE Consum. Electron. Mag. 7(4), 6–14 (2018) 20. P. Garcia, Biometrics on the blockchain. Biometric Technol. Today 5, 5–7 (2018) 21. R. Jiang et al., Emotion recognition from scrambled facial images via many graph embedding. Pattern Recognit. 67, 245–251 (2017) 22. R. Jiang et al., Face recognition in the scrambled domain via salience-aware ensembles of many kernels. IEEE Trans. Inf. Forensics Security 11(8), 1807–1817 (2016) 23. R. Jiang et al., Privacy-protected facial biometric verification via fuzzy forest learning. IEEE Trans. Fuzzy Syst. 24(4), 779–790 (2016) 24. R. Jiang et al., Multimodal biometric human recognition for perceptual human–computer interaction. IEEE Trans. Syst. Man Cybern. Part C 40(5), 676 (2010)

256

B. Xu et al.

25. R. Jiang et al., Face recognition in global harmonic subspace. IEEE Trans. Inf. Forensics Security 5(3), 416–424 (2010) 26. R.M. Jiang et al., Live-cell tracking using SIFT features in DIC microscopic videos. IEEE Tran. Biomed Eng. 57, 2219–2228 (2010) 27. Z. Jiang et al., Social behavioral phenotyping of drosophila with a 2D-3D hybrid CNN framework. IEEE Access 7, 67972–67982 (2019) 28. G. Storey, R. Jiang, S. Keogh, A. Bouridane, C-T Li, 3DPalsyNet: A facial palsy grading and motion recognition framework using fully 3D convolutional neural networks, IEEE Access, Vol. 7, 2019. https://doi.org/10.1109/ACCESS.2019.2937285 29. G. Storey et al., Integrated deep model for face detection and landmark localisation from ‘in the wild’ images. IEEE Access 6, 74442–74452 (2018) 30. G. Storey et al., Role for 2D image generated 3D face models in the rehabilitation of facial palsy. IET Healthcare Technol. Lett. 4, 145 (2017)

Optical Flow Estimation with Deep Learning, a Survey on Recent Advances Stefano Savian, Mehdi Elahi, and Tammam Tillo

1 Introduction Biometric systems attempt to detect and identify people based on certain physiological characteristics, e.g., fingerprints and face, or even behavioral characteristics, e.g., signature and gait. In recent years, there has been a large development in biometric systems, thanks to advances in deep learning (DL). DL techniques leverage the hierarchical architectures to learn discriminative representations and have contributed to some of the top performing biometric techniques. These techniques have also fostered numerous successful real-world biometric applications, e.g., face recognition and face identification [95]. Gait recognition can be a good example of a behavioral biometrics that uses the shape and motion cues of a walking person for identification. Gait could be performed at a distance, in contrast to the other biometric approaches such as fingerprint or iris scan [52]. The shape features are captured during gait phases, while motion features get captured during the transition between these phases. Still there are challenges in gait recognition, including variations in clothing, footwear, carrying objects, complicated background, and walking speed [87, 95]. Motion clues can be obtained, even without involving additional hardware (e.g., accelerometers and lidar [94]), instead by extracting motion from the captured video [52]. For example, Xiao et al. [102] obtained the very good performance by explicitly using the motion information, i.e., optical flow, as the input in a simple pose estimation framework. Motion estimation involves the estimation of optical flow which is the projection of 3D motion into the 2D plane of the camera. Nonetheless, the “global”

S. Savian () · M. Elahi · T. Tillo Free University of Bozen-Bolzano, Bolzano, Italy e-mail: [email protected]; [email protected]; [email protected] © Springer Nature Switzerland AG 2020 R. Jiang et al. (eds.), Deep Biometrics, Unsupervised and Semi-Supervised Learning, https://doi.org/10.1007/978-3-030-32583-1_12

257

258

S. Savian et al.

motion provided by the optical flow is composed of the motion of the objects in the scene and the ego-motion, i.e., the motion of the camera. Optical flow estimation has a long history, and much research has been carried since the pioneering methods of Horn and Shunk [30] and Lucas Kanade [50] have been published in 1981. Hence, through more than three decades of history, there is a massive improvement of techniques used for different aspects of optical flow estimation1 . In the particular case of small displacements, the problem of optical flow estimation has been almost completely solved [22]. The remaining challenges can be listed as: (1) fast motion, (2) illumination changes, (3) occlusions, and (4) untextured regions. In these challenges, the optical flow estimation problems become ill-posed and hard to treat, analytically. However, very recent approaches have come after the wave of DL progress which had a massive impact on this field of research. Still this chapter would take advantage of the noted survey [22], in assisting to review not only the current state-of-the-art optical flow estimation techniques but also highlighting the benefits and limitations of traditional approaches. Nonetheless, in order to motivate and explain the reasons behind the success of DL based optical flow estimation, the authors will show how some classical based top performing methods resemble a deep structure akeen to convolutional neural networks. To sum up, to obtain high-level understanding of video contents for computer vision tasks, it is essential to know the object status (e.g., location and segmentation) and motion information (e.g., optical flow) [17]. Hence motion estimation is a valid data source to obtain non-intrusive and remote high quality biometrics. Optical flow estimation is a fast evolving field of computer vision and to the authors’ knowledge, there is only one comprehensive review by Tu et al. [92], which surveys classical and DL based methods for optical flow. Nonetheless, this chapter provides a (more systematic) comparison of DL methods and gives a slightly different categorization of DL methods (compared to the related works) and introduces a novel class of Hybrid methods. Readers are kindly pointed to Tu et al. survey [92] for more detailed descriptions of performance measures for optical flow, optical flow color code, and optical flow applications (besides biometrics ones). In conclusion, this chapter briefly reviews optical flow estimation methods, which can be used mainly in biometric applications. More particularly, it focuses on how the field of optical flow is evolving, what are the benefits and limitations of DL based methods, as well as what do DL and classical approaches have in common. The rest of the chapter is organized as follows: In Sect. 2, a brief introduction to DL is provided. In Sect. 3, optical flow estimation is discussed by mainly reviewing the traditional approaches. Then DL for optical flow approaches is described and compared. This section continues with introducing the hybrid approaches. Section 4 provides further discussions and lists a number of applications of the optical flow estimation in biometrics. The chapter is finalized with Sect. 5 by providing the conclusion.

1 In a valuable work, early history of the field has been surveyed by Fortun et al. [22]. This work reviews the traditional works, and does not cover the very recent progresses.

Optical Flow Estimation with Deep Learning, a Survey on Recent Advances

259

2 Deep Learning Deep learning (DL) is a class of signal processing architectures which consist of connecting and stacking different convolutional layers and non-linear activation functions in order to generate flexible predictive models. These models are typically tuned by the Backpropagation algorithm using the target information to indicate how much the (internal) parameters should be updated [45]. Deep learning is blooming and it already enables noticeable steps forward in various engineering applications, and influencing many signal processing fields, e.g., Image Classification, Natural Language Processing (NLP), and Time Series Analysis. A notable comprehensive overview of the deep learning field has been authored by LeCun et al. [45]. The book from Goodfellow et al. is also considered a milestone on this topic [26]. It covers the main techniques, including the Convolutional Neural Network (CNN), that is heavily used in computer vision. There is a lot of material on deep learning models and applications and the reader is invited to further investigate the above mentioned papers or books, if interested for more details. Here existing (generic) architectures which have been tailored to optical flow estimation are briefly reviewed. It is assumed that the reader has a basic knowledge on the field, as it is not possible to describe all the technical details. • The state-of-the-art CNN for image recognition: the pioneering work on this field is LeNet from LeCun et al. [44], a seven layer CNN, used for digit recognition. Subsequently, The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [78] was introduced to foster the development of models for visual recognition. AlexNet (see Krizhevsky et al. [42]) uses a similar architecture as LeNet, but it is deeper, with more filters per layer, and with stacked convolutional layers. GoogleNet [88] also known as InceptionNet is inspired by LeNet but implemented a novel element which they called inception module, GoogleNet achieved very high performance in ILSVRC 2014. VGGNet [81] consists of 16 convolutional layers and a very uniform architecture, however, it has a massive number of parameters (i.e., 138 millions) to train. The winner of ILSVRC 2015 is residual neural network (ResNet) [29]. This architecture introduced skip connections between convolutional layers. Thanks to this innovative technique, they were able to train a network with 152 layers while still having lower complexity in comparison to VGGNet. It achieves a top five error rate beating other baselines on ImageNet dataset. • Fully Convolutional Networks (FCNs): CNN models where every layer is convolutional, FCNs obtain very good performance in image segmentation, e.g., Long et al. [47]. Additionally, U-Nets are FCNs specifically designed to produce accurate segmentation even with a relatively small dataset [77]. • Siamese Networks [40] (originally introduced by Bromley et al. [10]) are a class of neural network architectures that contain two or more identical subnetworks. The subnetworks have the same configuration and share the same parameters. Parameter updating is mirrored across both subnetworks. Sharing weights across

260

S. Savian et al.

subnetworks leads to less learnable parameters, thus less tendency to overfit. Siamese Network’s output is usually a one dimensional feature vector which indicates a similarity or a relationship between two comparable (input) things. • Generative Adversarial Networks (GANs): originally introduced by Goodfellow et al. [25] are generative algorithms that can generate new data instances by learning the distribution of the input data. Differently from the previous architecture which are discriminative, GANs can learn to produce new data. One network, called the generator, generates new data instances, while the other, the discriminator, evaluates them for authenticity. Although the deep learning based optical flow architectures could vary dramatically, still all the techniques presented in this chapter are drawn from the above mentioned models.

3 Optical Flow Before discussing further the optical flow, we note that, traditionally, optical flow estimation rely on two assumptions: • Assumption 1: pixel intensity remains unchanged along motion trajectory • Assumption 2: motion appears locally as a translation. Lets denote D as motion displacement vector in a two dimensional space and T as the temporal sampling step; By having D vector with the positional displacement among two frames, we can compute the motion velocity vector V according to the following formula: V = lim

T →0

D T

(1)

w(x, y, t) = u(x, y, t), v(x, y, t), 1 w ∇f (x, y, t) +

∂ f (x, y, t) = 0. ∂t

(2) (3)

Equation 3 is known as the Optical Flow Equation, where f is a three dimensional spatio-temporal field depending on the coordinates x, y and time t, and u and v are the displacements, respectively, along x and y axes (a detailed explanation can be found in [68]). The optical flow is defined as a two layers matrix with the same height and width of the input frame, where each of the two layers gives the offset of each pixel movement, where layer v is along y axis and layer u along x axis.

Optical Flow Estimation with Deep Learning, a Survey on Recent Advances

261

One of the earliest techniques proposed to solve the optical flow equation (Eq. 3) are Variational Methods. As an example, Horn Schunck (HS) [30] approach adopted the minimization of a cost function where mean squared error with a regularization term has been sued. A work by Lukas-Kanade [50] from the same year proposed the minimization of an iterative cost function under slightly different assumptions, i.e., velocity vector V being constant in local patches, i.e., Patch Based Methods. However, both approaches have drawbacks. For example, the motion between two frames must be sufficiently small, the equations when discretized increase noise and locality assumptions result in poor motion accuracy [19].

3.1 Traditional Methods In this section, we discuss mainly the traditional methods for optical flow estimation (see Table 1). Table 1 Overview of handcrafted methods; the handcrafted methods mentioned here are not realtime Paper year Ref. DeepFlow 2013 [97]

EpicFlow 2015 [74]

RicFlow 2017 [32] FlowFields 2015 [4] DenseFlow 2013 [84]

ProbFlow 2017 [96] DiscreteFlow 2009 [59] CPM 2016 [31]

Contribution DeepMatching Fine-to-coarse image pyramids Improved matches interpolation Used in [3, 5, 15, 31, 103, 107] Further improved EpicFlow interpolation Binary tree for patch matching Segmentation Fully-connected inference method Predicts optical flow and uncertainty Uses CRF to reduce patches search space Discrete coarse-to-fine

Builds on BM [11]

Limitation Sparse matches

DeepMatching [75], SED [18]

Input noise

EpicFlow

Strongly relies on (dense) input matches Handcrafted features for match Computationally expensive

EpicFlow EM

HS, FlowFields See [22] PatchMatch [7], EpicFlow

Small EPE improvement Semi-dense optical flow Small details are lost

262

3.1.1

S. Savian et al.

Variational Methods

One of the earliest class of optical flow2 estimation methods were variational approaches. This class of approaches computes optical flow as the minimizer of an energy functional. One of the most effective and most simple variational methods has been developed by HS [30]. By exploiting the brightness constancy equation (BCE) assumption and thus considering horizontal and vertical displacements to be sufficiently small, one can linearize the optical flow equation. Variational approaches, hence, estimate the (optical flow) interframe displacement w by minimizing E, Eq. 4 E(w) =

D(u, v) + αS(u, v) dx dy

(4)

where D is a term that penalizes deviations of the data from the BCE and S is a smoothing term; is the image region of locations x,y. All variational approaches aim at minimizing an energy function similar to E but exploiting different properties of the data, e.g., gradient constancy Brox and Malik (BM) [11], higher order derivatives [66], color model [60, 114]. Variational methods are biased towards the initialization which is usually the zero motion field because of all local minima, this approach selects the one with the smallest motion, which is not necessarily the correct solution [12].

3.1.2

Patch Based Methods

Methods based on Lucas Kanade (LK) approach work on the discrete domain: the two frames are divided into patches (regions) of fixed sizes, matched by minimizing a gradient. Accordingly, LK uses Newton–Raphson technique for cost minimization. Most of the coarse-to-fine methods apply variations of LK on the frame, at different levels of granularity by looking for correspondences, first on a more large area, and then moving to smaller patches; or alternatively, by producing a rough estimate of the optical flow by downsampling the frames (see Fig. 1). Patch based methods are biased by the motion of large scale structures. The coarse-to-fine approaches have the drawback of hardly detecting small fast moving objects when the motion of bigger structures (or camera motion) has an overall high magnitude. Similarly to variational approaches, patch based methods have been improved very much, where majority of the improvements involve a different computation of the descriptor function. A descriptor is a function which is applied to all patches, used to produce a vector of similarities. Scale invariant feature transform (SIFT) [49]

2 There

is a slight difference between optical flow and flow fields, however, in computer vision these two terms are usually used interchangeably [22].

Optical Flow Estimation with Deep Learning, a Survey on Recent Advances

263

Fig. 1 Example of three layers frame pyramids. Input frames Fn , Fn+1 are downsampled: L0 are the input frames at native resolution, L1 , L2 are the frames downsampled one and two times, respectively. The optical flow is iteratively refined starting from the most downsampled frames. The initial optical flow (OF) is computed first at the lowest resolution L2 (OFinit ), upsampled and used as initial estimate for the next higher resolution layer, refined (OFref ) and upsampled until the final high-resolution layers

and histogram of oriented gradients (HOG), as well as DAISY [90] are well-known descriptors which have been used in many computer vision field including optical flow estimation (e.g., [46]).

3.1.3

Patch Based with Variational Refinement

A first unifying method was proposed by Brox and Malik (BM) [12] representing a big shift in performance of optical flow estimation. Brox and Malik formulate the problem of optical flow estimation in a variational refinement, but introducing an additional energy term Edescriptors on SIFT and color. Many methods based on BM have a descriptor stage, a matching stage, and a (variational) refinement stage to interpolate the optical flow in a sparse to dense manner. For the matching stage, different techniques have been proposed over time (an overview can be found in [22]). In this part, we additionally review very recent

264

S. Savian et al.

developments of patch based with variational refinement (state-of-the-art at the time of publishing) that have been further improved by the partial integration of DL architectures, as described in Sect. 3.6. PatchMatch [7] is a general purpose computer vision algorithm to match arbitrary descriptors using K-nearest neighbors algorithm in a coarse-to-fine manner with random initialization. FlowFields [4] is similar to PatchMatch for the propagation stage, but uses a kd-tree (a specific type of binary tree) to compute initial matches. Also noticeable is DiscreteFlow [59] which computes patch similarities in the discrete domain, using DAISY descriptors to find pixelwise correspondences among neighboring frames, and processing the vector of similarities with a conditional random field (CRF), without coarse-to-fine optimization. Finally, FullFlow [15] also optimize an energy function as a CRF, over discrete regular grids. A noticeable different approach, inspired by the success of Brox and Malik, is DeepFlow [97]. DeepFlow is a variational optical flow with an additional loss function based on a deep matching algorithm on a classical variational framework which allows to integrate feature descriptors and matching. DeepFlow, is nonparametric and is based on a fine-to-coarse approach. Two consecutive frames are divided in four by four patches which are then convolved producing three dimensional response maps. The convolution operator outputs a stack of response maps providing higher values where patches have similar patterns. Subsequently, the obtained feature maps are convolved with larger patches to find coarser matches, with a structure akeen to CNN, but with no learnt parameters. This process is recursively applied to coarser patches. Finally all the feature maps, which are obtained by convolutions at different level of granularity, are processed: higher activation of the feature maps mean higher similarity among patches. A local maxima is computed to find dense matches (correspondences), which are then used to compute the optical flow. DeepFlow most important innovation is its matching algorithm, DeepMatching [75], which has been used as descriptor and matching algorithm. Compared to various classical methods mentioned using, i.e., SIFT and HOG, DeepMatch obtains similar performance for small displacements, while drastically outperforming classical methods on large displacements. For these reasons, many top performing methods make use of DeepMatch along with different variational refinement methods (see Fig. 2). EpicFlow is one of the recent methods for refinement and post-processing task [74] and it has been adopted by several works [3, 5, 15, 31, 103, 107]. EpicFlow is

DeepMatching

DeepFlow

EpicFlow

RichFlow

Fig. 2 Sparse to dense refinement methods applied to DeepMatching. DeepMatching algorithm is used to compute correspondences which can be refined by DeepFlow, EpicFlow, or RichFlow

Optical Flow Estimation with Deep Learning, a Survey on Recent Advances

265

built on DeepMatch and random forests; DeepMatch is used to compute matches, while structured forest (structured edge detectors SEDs [18]) are used to compute image edges, exploiting the local structure of edges by looking at the information gain of random forests. The additional edges information allows to further densify the sparse matches and improves the variational refinement energy function. The energy function is further improved by using geodesic distance instead of Euclidean distance, obtaining a more natural model for motion discontinuities (further details in the paper [74]). EpicFlow further improves DeepFlow on large discontinuities and occluded areas, nonetheless outperforms all state of the art (classical) coarse-to-fine approaches. Due to its performance, EpicFlow has been integrated in robust interpolation of correspondences for large displacement optical flow (RicFlow) [32]. RicFlow is based on Epicflow in the sense that uses DeepFlow and SED to compute the flow. In addition to EpicFlow, the input images were segmented in superpixel to improve the method over input noise and provide a better initialization. RicFlow was among the best performing state-of-the-art methods on Sintel clean pass at the time of publishing.

3.2 Deep Learning Approaches In the previous section, it has been shown that DeepFlow, one of the top performing handcrafted algorithms resembles a convolutional structure similar to deep learning models, but with no learnt parameters. Perhaps triggering a new line of research based on deep convolutional structures in the field of optical flow estimation. In this section, advances of models based on deep learning are reviewed (see Table 2).

3.2.1

Development of DL Based Optical Flow Estimation

Single and Stacked Architectures The first deep learning optical flow architecture has been introduced by Fisher et al. and named FlowNet [19], FlowNet directly estimates the optical flow using a generic CNN U-Net architecture [77]. Due to the lack of data with optical flow groundtruth the authors generated a new dataset: “Flying Chairs” [55], which is a synthetic dataset with optical flow ground truth. It consists of more than 20K image pairs and corresponding flow fields of 3D chair models moving with just affine motion in front of random backgrounds. This dataset is necessary for network convergence, since CNN typically has a very large number of trainable weights (tens of millions) requiring a considerable number of input data to avoid overfitting. The original paper proposes two slightly different model architectures FlownetS (Flownet Simple) and FlownetC (Flownet Correlation). FlownetS consists of a CNN receiving two stacked input RGB frames which are then supervisely trained on the

266

S. Savian et al.

Table 2 Overview of deep learning methods Paper year Ref. ContinualFlow 2018 [61]

MFF 2018 [73] FlowNet 2015 [19] FlowNet2.0 2016 [35] FlowNet3 2018 [36] FlowNetH 2018 [37] SpyNet 2017 [70] PWC-Net 2018 [85]

SegFlow 2017 [17] Xiang et al. 2018 [101] Fang et al. 2018 [21] a Top

Contribution

Builds on

Limitation

Top performing (Sintel final pass) Occlusion estimates Multiple frame Multiple frame

PWC net, GRU

Not top on Sintel clean

PWC net

First DL model

U-net

Stacked FlowNetS and FlowNetC Improved training schedule FlowNetC with FWD/BCK consistency check Confidence measures

FlowNetS, FlowNetC

Not top on Sintel clean Artifacts for small motions Oversmoothed flow High number of weights

End-to-end Coarse-to-fine on frames Top on Sintel finala Coarse-to-fine on features Cost volume layer Segmentation and optical flow estimation Traditional priors on cost function Two branch CNN lightweight

ResNet, FlowNet2.0 FlowNetS-C Coarse-to-fine, U-net

High number of weights Focus only on confidence estimates Piece-wise training

SpyNet, FlowNetC

Not top on Sintel clean

FlowNetS, ResNet FlowNetS

Two times EPE w.r.t. first ranked Small improvement

Coarse-to-fine approach

Not tested on Sintel final

among methods working with frame pairs

optical flow groundtruth. Similarly, FlowNetC is also supervisely trained on the groundtruth, but instead of working with a stacked input the two frames are fed to two identical branches which are merged on a later stage by a correlation layer, the correlation layer performs cross-correlation (cost volume) between the feature maps of the two input, enabling the network to compare each patch with no additional learnt parameters (at the correlation layer). Both networks upsample feature maps by upconvolutions at the output side of the network to increase the resolution of the computed optical flow, degraded by the stacked convolutions and pooling layers at the contractive side of the network (see Fig. 3). The expanding part of the network is composed of “upconvolutional” layers: unpooling and deconvolution. There are four upconvolution layers in the refinement part and for computational reasons the flow is finally upsampled to full resolution by

Optical Flow Estimation with Deep Learning, a Survey on Recent Advances

267

Fig. 3 FlowNetC architectures [19]

bilinear upsampling. Skip connections are used to connect layers on the contractive part to the expanding (refinement) part providing additional information of flow level features at the upsampling stage. Data augmentation is very important for model generalization. Augmentation has been performed in place both to the image pair and to the grountruth flow fields, it includes geometric transformations, Gaussian noise, multiplicative color, and additive brightness changes. Finally, the authors trained the network by minimizing the squared error on endpoint error EPE. As said previously, EPE is the Euclidean distance between network estimates and groundtruth flow. For further details on EPE and other optical flow metrics please refer to Tu et al. [92]. Training on EPE is not optimal for small displacements, as the Euclidian distance only gives information on the error magnitude while omitting error direction information; however, it allows the architecture to perform well in case of large displacements such as in the Sintel benchmark. One important discovery is that although FlowNetS and FlowNetC have been trained on synthetic data, they can also perform well on natural scenarios. However, the main drawback is the low accuracy in case of small and simple movements which instead are conditions where traditional methods perform well. A second drawback is the over-smoothed flow fields produced, and the fact that a variational post-processing stage is required, as noticed by Hui et al. [34]. To overcome FlowNet limitations, Ilg et al. proposed FlowNet 2.0 [35], which stacks FlownetC and FlownetS and new designed FlowNet-SD (small displacement) to perform the flow estimation. Flownet-SD (see Fig. 4), has larger input feature maps and is trained on a dataset with small displacement, ChairsSDHom. Furthermore, after each subnetwork the flow is warped and compared with the second image and the error is fed to a fusion network that takes as inputs the estimated flows, the flows magnitudes and the brightness error after warping. The fusion network contracts and expands to full resolution producing the final flow fields. Due to the large size of FlowNet 2.0, i.e., around 38 million parameters (see Table 3), its subnetworks have been trained sequentially by training one subnetwork

268

S. Savian et al.

Fig. 4 FlowNet 2.0 architecture [35], which consists of FlowNetS and FlowNetC stacked Table 3 Supervised deep learning model comparison Principles Pyramid Warping Cost volume

FlowNetS −− −− −−

#parameters (M) Memory (MB) Forward (ms)

38.67 154.5 11.40

FlowNetC 3-level – Single level large range 39.17 156.4 21.69

FlowNet2 3-level Image Single level large range 162.49 638.5 84.80

SpyNet Image Image – 1.2 9.7 –

PWC-Net 6-level Feature Multi-level small range 8.75 41.1 28.56

Coarse-to-fine approaches require less parameters and lead to state-of-the-art performance. Data source [86]

while freezing the weights of the others. Moreover, the authors have generated a new more realistic dataset, FlyingThings3D [35] to be robust to untextured regions and to produce flow magnitude histograms close to those of the UCF101 [83] dataset, which is composed of real sequences. A very important finding is the impact of training schedule on network performance. Solely training on the more complex FlyingThings3D is worse than using the simpler FlyingChairs, and training on a mixture of FlyingChairs and FlyingThings3D also does not lead to better performance. The order of presenting the data affects model accuracy; the best schedule is training on FlyingChairs and finetuning on Things3D. Also subnetworks FlowNetS and FlowNetC can benefit of around 20–30% of improvement when trained with the above mentioned schedule. The authors conjecture is that FlyingChairs allows the network to learn color matching and that the refinement with Things3D improves performance under realistic scene lighting. Flownet 2.0 outperforms EpicFlow, and obtain state of the art performance on Sintel final pass, at the time of publishing. FlowNet and FlowNet 2.0 are important milestones of optical flow estimation and serve as building block for other methods. Ilg et al. modified FlowNetC to estimate the confidence interval on the estimated optical flow in [37]. Xian et al. [101] use FlowNetS and add a multi-assumption loss function (brightness constancy, gradient constancy, and image-driven smoothness assumption) in the expanding part during

Optical Flow Estimation with Deep Learning, a Survey on Recent Advances

269

the network training. On the contrary, FlowNet 3 [36] further improves FlowNet 2.0, by taking out the small displacement network, removing explicit brightness error and, add residual connections in the stack based on [65]. They also modified the stack and in particular FlowNetC to jointly compute forward and backward flow consistency and estimate occlusions. The authors demonstrate that efficient occlusions estimates come at no extra cost. Coarse-to-Fine Iterative Refinement The first coarse-to-fine end-to-end approach is proposed by Ranjan et al., proposing spatial pyramid network (SpyNet) [70], which combines the classical coarse-to-fine approach for optical flow estimation with deep neural networks. At each level of the pyramid a CNN is trained independently, meaning that each level of the pyramid (independently) deals with motion within a certain boundary of displacement. This architecture allows SpyNet to perform at different magnitudes of displacement. In fact, since every level of the pyramid deals with a fixed range of motion, the final optical flow is produced by iteratively upsampling the coarse optical flow estimates and warping it with higher resolution frames. Hence, at each level of the pyramid L, a pixel motion at the top level corresponds to 2L−1 pixels at the full resolution [86] (see Fig. 1). Coarse-to-fine iteration benefit the estimates as there is no need for a full computation of the cost function, which is a bottleneck for real-time optical flow estimation. The authors show how SpyNet improves the results of FlowNet which perform badly in case of small movements, while obtaining the same performance in case of large motion. Nonetheless, SpyNet has 96% less weights compared to FlowNet, consisting “only” of one million parameters, one order of magnitude less than FlowNet and two compared to FlowNet 2.0 (Sect. 3.3 for further insights on the number of weights). Following this line of research, Sun et al. [85] presented PWC-Net. PWC-net is a pyramidal coarse-to-fine CNNs based for optical flow estimation. PWC-net uses pyramid, warping, and cost volume (see Fig. 5). The network is trained end-to-end in a similar manner as SpyNet, however with some differences: (1) SpyNet warps frames with coarse estimates of the flow, while PWC-net warps feature maps, (2) SpyNet feeds CNN with frames while PWC-net inputs a cost volume, (3) PWCnet data augmentation do not include Gaussian noise (more details in Sect. 3.4). PWC-net image pyramids are end-to-end learnable and the cost volume is produced exploiting FlowNetC correlation layer. Finally PWC-net uses a context network to exploit contextual information for refinement. PWC-net is outperforming all methods to date in the challenging Sintel final pass (Sect. 4.2). It is the first time that an end-to-end method outperforms well-engineered and fine-tuned traditional methods. At the time of writing PWC-net is still the top performing two frames optical flow estimator on Sintel final and it is used as building block for the top performing multi-frame optical flow estimators [61, 73]. Finally, Inspired by PWCnet, Fang et al. [21] and Hui et al. [34] proposed similar methods, but with lower performance.

270

S. Savian et al.

Fig. 5 PWC-net architecture [85]

3.2.2

Other Important DL Based Methods

Multi Objective Optical flow is an important piece of information for motion segmentation and action recognition. Ilg et al. already show how the flow fields generated by FlowNet 2.0 matches and outperforms classical optical flow methods, when plugged into the CNN temporal stream of Symonyan et al. [81]. This configuration produces high level action recognition. The state-of-the-art performance has been obtained also in motion segmentation, by plugging the optical flow of FlowNet 2.0 to [39] and [64]. Moreover, Cheng et al. [17] further elaborate the idea and propose one unified deep learning framework, called SegFlow, for joint estimation of object segmentation and optical flow. SegFlow architecture has two branches: FlowNetS for the optical flow and a residual CNN, Resnet-101 [29]. Feature maps are merged between two branches at the final layers. Training is done iteratively: weights are initialized according to FlowNetS and ResNet-101. When optimizing the segmentation branch, the optical flow branch weights are frozen. The segmentation is trained on the DAVIS dataset [67], with additional affine data augmentation. Similarly, when training the optical flow, the segmentation branch is fixed and weights are only updated in the flow network using optical flow datasets with groundtruth: Sintel, KITTI, Monkaa, and Driving (Sect. 3.4). Both networks benefit from each other and the results are state-of-the-art for both segmentation and optical flow estimation, with accuracy doubled compared to FlowNet.

Indirect-Supervised and Semi-Supervised Methods Indirect-supervised approaches treat the optical flow estimation as a frame reconstruction problem. Although these unsupervised methods do not need to use large datasets as training samples, the overall accuracy is slightly inferior to the supervised approaches. These methods still need to be trained with groundtruth data to tune their weights, but they do not need specifically optical flow groundtruth data to model the optical flow. Instead they use a proxy task, i.e., frame synthesis. All

Optical Flow Estimation with Deep Learning, a Survey on Recent Advances

271

Table 4 Overview of indirect-supervised learning methods Paper year Ref.

Contribution

Builds on

Limitation

GeoNet 2018 [109] Ahmadi et al. 2016 [1] Jason et al. 2016 [38] MIND 2016 [48] Ren et al. 2017 [72] DenseNet 2017 [113] Zhu et al. 2017 [112] Yang et al. 2018 [106] Wulff et al. 2018 [99] UnFlow 2017 [57] Ranjan et al. 2018 [71] Lai et al. 2018 [43] TransFlow 2017 [2] SfM-Net 2017 [93]

Rigid and non-rigid motion Photometric loss

ResNet-50

Automotive domain

HS, coarse-to-fine

Train on DeepFlow

a Similar

Photometric loss and smoothness Analysis by synthesis

FlowNetS

Automotive domain

FlowNetSa

Photmetric loss

FlowNetSa

Results only on Sintel train Low performance

Extends DenseNet

DenseNet [33]

Large memory footprint

FlowFields proxy groundtruth FlowNet 2.0 proxy groundtruth Fine-tune on groundtruth data Occlusion estimates

FlowFields

Train on FlowFields

FlowNet 2.0

Rely on FlowNet 2.0, explicit ground truth Uses groundtruth

Multi-objective (segmentation) GAN applied to optical flow L1 norm (Charbonnier) Depth, occlusion mask estimation, photometric error

MINDa FlowNetsC

FlownetS

Results only on Sintel Train Custom split for evaluation Uses groundtruth

FlowNetS

Automotive domain

FlowNetSa

Not tested on Sintel

FlowNetC

architecture

of the methods below rely on frame warping which is a differentiable operation and allow backpropagation for network tuning (see Table 4). Differences among indirect-supervised are rather small and mostly involve different constrains on the cost function: photometric or geometrical error function. Thus a common optical flow estimation pipeline is: (1) let the network estimate the flow fields, (2) warp the frame with the flow fields, (3) measure the photometric loss between the synthetized frame and the groundtruth [99]. Ahmadi and Patras [1] presented a method for training a CNN using the UCF101 dataset [83] for motion estimation without explicit optical flow groundtruth data, instead of exploiting the optical flow equation, Eq. 3, similarly to traditional HS. The architecture proposed is very similar to FlowNetS and it has been trained on the real scenario dataset UCF101. Performance on Sintel is very close to FlowNetS and FlowNetC. Yu et al. [38] also design a network similar to FlowNetS,

272

S. Savian et al.

with a photometric loss function. Zhu et al. [112] also use FlowNetS trained on photometric loss, but initialize the learning on proxy ground truth provided by FlowFields [4]. Ren et al. [72] also train FlowNetS on frame interpolation error, but with an additional non-linear data term. Niklaus et al. [63] jointly perform interpolation and flow estimation, with results comparable to FlownetS. Long et al. [48] train a CNN for optical flow estimation by interpolating frames, with some more minor differences. An U-net is trained to synthesize the middle frame in the training phase. Afterwards, the frame correspondences are obtained directly through the same network backpropagation, a process called analysis by synthesis [108]. The network uses triplets of consecutive frames, the first and last are used as input and the middle frame serves as groundtruth. Network output is two matrices of gradients with respect to the input; the gradients are obtained from the network through backpropagation, which produces sensitivity maps for each interpolated pixel. However the backpropagation pass is computationally expensive and, especially in unstructured or blurry regions the derivatives are not necessarily well located. Zhu and Newsam [113] extend DenseNet architecture [33] by adding dense connectivity to FlowNetS layers; however, the network accuracy is two times worsen if compared to the original FlowNetS. Meister et al. use a loss based on the CENSUS transform [28] and check forward and backward flow consistency, explicitly integrating occlusion reasoning. Other approaches instead use geometrical reasoning for self-supervision. Alletto et al. [2], trained a network to estimate a global homography and a second network to estimate the residual flow after warping using the homography. The method has been validated on KITTI, with performance similar to FlowNetS. Vijayanarasimhan et al. presented SfM-Net [93] which decomposes scene motion in terms of scene and object depth, camera motion, and 3D object rotations and translations. Given a sequence of frames, SfM-Net predicts depth, segmentation, camera and rigid object motions, and converts those into motion fields. Wulff et al. [99] noticed that not always the flow fields can be learnt by photometric error due to untextured regions and lack of context information. They trained a temporal interpolation network on frame synthesis on large set of videos without involving any prior assumption and fine-tune the network on groundtruth data from KITTI and Sintel. The explicit use of groundtruth data drastically improves performance, and the architecture outperforms FlowNetS and SpyNet, at the cost of not being fully unsupervised. Similar to self-supervision, Lai et al. [43] use GAN. They used a discriminator network trained on optical flow groundtruth. The discriminator is used in adversarial loss to learn the structural patterns of the flow warp error without making assumptions on brightness constancy and spatial smoothness. Once the discriminator network is trained, the network can be trained on any dataset, providing a loss for unsupervised training. Yin et al. presented GeoNet [109] exploiting geometric relationships extracted over the predictions of depth, rigid, and non-rigid parts and then combined as an image reconstruction loss. They separate static and dynamic scene parts. Depth maps and camera poses are regressed, respectively, and fused to produce the rigid flow. Furthermore, the second stage is fulfilled by the ResFlowNet, i.e., residual

Optical Flow Estimation with Deep Learning, a Survey on Recent Advances

273

FlowNet using the output from the rigid structure reconstructor, to predict the corresponding residual signal for handling dynamic objects flow fields. The final flow is a combination of the rigid and non-rigid estimated flow with an additional geometric constrain. Similarly to GeoNet, Ranjan et al. [71] propose a framework for estimation of depth, camera motion, optical flow and segmentation using neural networks that act as adversaries, competing to explain pixels that correspond to static or moving regions, and as collaborators through a moderator network that assigns pixels to be either static or dynamic. This and GeoNet are among the best unsupervised methods; however, their performances are not comparable to supervised and classical methods and their benchmarks are mostly from the automotive domain, e.g., KITTI.

3.3 Deep Learning Networks Comparison One of the challenges of deep learning models is to limit the number of network parameters to avoid overfitting and reduce memory footprint. Unlike stacked approaches, deep coarse-to-fine SpyNet and PWC-net do not need to deal with large motions thanks to the image pyramid for the first and feature pyramid for the latter. It has been shown that coarse-to-fine image and feature pyramids require less weights, and at the same time lead to state-of-the-art performance for PWC-net. LiteFlownet and other minor coarse-to-fine models have not been included as they are not the first published or top performing methods. Finally, all Deep Learning methods can show unexpected behaviors. For example, Savian et al. [79] have shown that FlownetC and PWC-Net suffer of sign imbalance, i.e. they estimate the optical flow with different accuracy depending on motion direction and orientation.

3.4 Optical Flow Datasets It has been already discussed that it is difficult to obtain the proper data for training deep optical flow models. For this reason, a fundamental contribution of FlowNet and FlowNet 2.0 approaches are the (computer graphics) datasets that have been released to train the networks: FlyingChairs and FlyingThings3D (see Sect. 3.2.1). While very useful, however, it is not still clear how to generate more data that can generalize well on real-world videos. In a recent follow-up paper, Mayer et al. [56] perform an in-depth analysis of what are the characteristics of good training datasets, as research is shifting from proposing models to generate abundant data for supervised learning. There are further findings reported in the paper: the authors discovered that (1) artificially rendered data can well generalize on real videos, (2) if training with a single dataset, complex lighting and post-processing effects worsen the performance, (3) training on different datasets with an increasing level of complexity leads to best performance.

274

S. Savian et al.

In the following, we briefly describe the training dataset that, to the best of our knowledge, are the largest with dense optical flow groundtruth: • FlyingChairs [19] is a synthetic dataset which consists of more than 22K image pairs and their corresponding flow fields. Images show renderings of 3D chair models moving in front of random backgrounds from Flickr.3 Motions of both the chairs and the background are purely planar. FlyingChairs2 contains additional minor modalities [36]. • ChairsSDHom [35] is a synthetic dataset of image pairs with optical flow ground truth. ChairsSDHom is a good candidate for training networks on small displacements, it is designed to train networks on untextured regions and to produce flow magnitude histograms close to those of the UCF101 dataset. ChairsSDHom2 contains additional minor modalities which are discussed in [36]. • FlyingThings3D [55] is a dataset rendered of image pairs from 3D models (randomly shaped polygons and ellipses) with simple but structured background. Foreground objects follow linear trajectories plus additional non-rigid deformation in 3D space. • Sintel has 1K training frames drawn from the entire video sequence of an open source movie. Sintel is not sufficient to train a network from scratch [19], however, it can be used for fine-tuning in the context of deep learning. This dataset is mostly used as challenging benchmark for evaluation of large displacement optical flow (Sect. 4.1). • Monkaa [55] contains 8.5K frames, and it is drawn from the entire video sequence of a cartoon, which is similar to Sintel, but more challenging. Monkaa contains articulated non-rigid motion of animals and complex fur. • Playing for Benchmarks [76] is based on more than 250K high-resolution video frames, all annotated with ground-truth data for both low-level and high-level vision tasks, including optical flow. Ground-truth data (for variety of tasks) is available for every frame. The data was collected while driving, riding, and walking a total of 184 km in diverse ambient conditions in a realistic virtual world. • KITTI2012 [24] contains almost 200 frames of stereo videos of road scenes from a calibrated pair of cameras and LiDAR mounted on a car. While the dataset contains real data, the acquisition method restricts the ground truth to static parts of the scene thus the main motion is given by the ego-motion of the camera [55]. KITTI2015 [58] (800 frames) is obtained by fitting 3D models of cars to the point clouds. However, the ground truth optical flow is sparse. • Driving [55] contains more than 4K frames of virtual scenes in a naturalistic, dynamic street setting from the viewpoint of a driving car, made to resemble the KITTI datasets.

3 https://www.flickr.com/.

Optical Flow Estimation with Deep Learning, a Survey on Recent Advances

275

It is worth noting that, FlyingChairs and FlyingThings3D contain well textured background. By ablation studies Meyer et al. [56] discovered that background textures help to better perform on unseen datasets, and yield to best results on Sintel even though the motion where they have been trained is unnatural. The mentioned dataset is large enough to train deep CNN with just some additional data augmentation. Differently, Monkaa contains very difficult motion on repetitive and monotonous texture which have been found to be counterproductive for training.

3.5 Training Schedule and Data Augmentation In Sect. 3.2.1, it has been shown that just by retraining FlowNetC with a new schedule, on FlyingChairs followed by the more refined FlyingThings3D it is possible to improve its performance by a 20–30% underlying the importance of training schedule. Nonetheless, Deqin et al. in [86] further demonstrate this concept by retraining PWC-net and FlowNetC. They further increase the accuracy of PWCnet by 10% and show that it is possible to further improve FlowNetC by 56% solely by retraining the network with their new training schedule and a smoother data augmentation (i.e., no additive Gaussian noise), outperforming the more complex FlowNet2 by 5%. These results show that if trained improperly, a good model can perform poorly. Meaning that a fair comparison of deep learning models should consider the same training datasets and scheduling in order to disentangle model and training data contributions.

3.6 Hybrid Methods This class of optical flow estimation techniques integrate end-to-end learnt approaches with traditional architectures (see Table 5). Two main branches of hybrid methods can be identified: • deep feature based: obtained by partially integrating the flow estimation pipeline with CNN. In this context cost volume, matching, or descriptors are obtained by deep learning while other building blocks are traditional, i.e., variational refinement. • scene understanding: CNNs are used to differentiate frames regions based on object properties or semantics. This information is integrated with prior knowledge on the motion field, e.g., motion is prominent on foreground objects while the background has a smoother and more linear motion.

276

S. Savian et al.

Table 5 Overview of hybrid learning methods Paper year Ref.

Contribution

Builds on

Limitation

DCFlow 2017 [103] PatchMatch 2017 [31] SOF 2016 [80]

CNN to produce cost function

EpicFlow

Long inference

Siamese Networks with new loss function Semantic Segmentation. Different models for different layers CNN produce rigidty score. Iterative refinement Local and Context siamese networks Siamese CNN. Exploit segmentation. Epipolar Flow Siamese CNN

FlowFields

Long inference

DiscreteFlow, DenseFlow [84]

Not tested on Sintel

DiscreteFlow

Long inference

DiscreteFlow

Piece-wise training

SOF, Epicflow for refinement PatchMatch, EpicFlow Tatarchenko et al. [89] CPM [31], RichFlow

Automotive domain

MR-Flowa 2017 [100] Guney et al. 2016 [27] Bai et al. 2016 [3] PatchBatch 2016 [23] Behl et al. 2017 [8] Maurer et al. 2018 [53] a Top

Stereo frames. CNNs, CRF CNN trained during inference, multiframe, Best on SINTEL clean

Long inference Automotive domain Long inference

among methods working with frame pairs

3.6.1

Feature Based

Hybrid deep learning patch based methods make use of learned matching functions [51, 110, 111]. These architectures have been adopted to extract and match descriptors for optical flow. The most relevant examples are PatchBatch [23], Deep DiscreteFlow [27], DCFlow [103], and exploiting semantic information and deep matching for Optical Flow [3] (which also integrates semantic information, and is discussed in Sect. 3.6.2). These methods exploit learned matching functions which are then integrated into handcrafted methods. As discussed in Sect. 3.1.2 different approaches have been developed to obtain descriptors and aggregate information from local matches. However, handcrafted patch based optical flow estimators are limited by the computational cost of computing a 4D cost volume [15] or by number of pixelwise flow proposals at the initialization stage [7, 59]. In these cases integrating learned convolutional networks on the handcrafted pipeline lead to better accuracy and orders of magnitude faster inference. DCFlow [103] is inspired by Chen and Koltun [15] and constructs a cost volume by using a four Layer CNN, this cost volume produces a feature space for all pixels in a way that matching scores are then computed by a simple internal product in this space, refined by EpicFlow post processing. Learning feature embedding and matches with a CNN allows the method to be more resilient to patch appearance changing and make large search space computationally feasible; nonetheless, the

Optical Flow Estimation with Deep Learning, a Survey on Recent Advances

277

dimensionality of the feature space allows to find a tradeoff between computational cost and performance while drastically requiring less parameters (around 130K) if compared with fully learned methods. Deep DiscreteFlow independently train a context network with a large receptive field size on top of a local network using dilated convolutions on patches. It performs feature matching by comparing each pixel in the reference image to every pixel in the target image: matching points on a regular grid in the reference image to every pixel in the other image, yielding a large tensor of forward matching costs, similarly to DiscreteFlow a CRF is used for flow refinement. ProFlow is also a MultiFrame method and is discussed in Sect. 3.7.

3.6.2

CPM [32] + RichFlow [31] + Maurer et al. [55]

ProFlow [54]

Flow Fields [4]

Flow Fields CNN [5]

FullFlow [16]

DCFlow [104]

DiscreteFlow [60]

Deep DiscreteFlow [27]

PatchMatch [7]

PatchBatch [23]

Yamaguchi et al. [106] + EpicFlow

ESIDM [3]

Domain Understanding

This section refers to methods which exploit high level semantic of the scene to obtain prior information on the optical flow. These methods classify the scene into different regions of similar motion and apply an optimized optical flow model to each region depending on motion characteristics. These models are known in literature as “layered models”. A good example is optical flow with semantic segmentation and localized layers (SOF) [80]. The authors classify scenes into “Things” (rigid moving objects), Planes (planar background), and Stuff. A different model is then adapted for each of the three classes to refine DiscreteFlow, Sect. 3.1.2. Segmentation is performed with CNN by using DeepLab [16]. Focus is on the estimation of “Things” by applying layered optical flow only in the regions of interest (“Things” can be considered foreground). SOF is based on [84] a primer for the idea of embedding semantic information into flow estimation. A similar idea was proposed by Sai et al. [91] using fully connected models for segmentation jointly with a variational approach for optical flow, however their evaluation is limited to frame interpolation for optical flow and the segmentation dataset is limited. Instead, Bai et al. [3] exploit semantic information along with deep siamese networks to

278

S. Savian et al.

estimate matches and thus the optical flow. This method is neither fully end-toend nor fully handcrafted, but uses siamese networks to perform the optical flow on the foreground, and uses a patch based epipolar method [104] to compute the optical flow on the background. In this way, the authors exploits siamese CNN for patch extraction and matching in areas with complex movement, where neighboring frames are fed to each branch of a siamese network to extract features and the two siamese branches are then combined with a product layer to generate a matching score for each possible displacement. For the background the authors use handcrafted methods for better performance on small and simple motion. This is an important contribution as integration of learnt functions along with handcrafted features allow the method to overcome the weaknesses of both traditional methods (complex movements) and DL based methods (small displacements). However, this method has been developed to work in the context of autonomous driving where the scene is typically composed of a static background and a small number of traffic participants which move “rigidly”. Finally, Behl et al. [8] exploit the semantic cues and geometry to estimate the rigid motion between frames more robustly and leads to improved results compared to all baselines. CNNs are trained on a newly annotated dataset of stereo images and integrated into a CRF-based model for robust 3D scene flow estimation, this work obtains the lowest outlier percentage in KITTI2015 for non-occluded regions. Similarly to Bai et al. [3], Wulff et al. presented MR-Flow (mostly rigid-flow) [100] which uses CNN to produce a semantic rigidity probability score across different regions also taking into account that some objects are more likely to move than others. This score is combined with additional motion cues to obtain an estimate of rigid and independently moving regions. A classical unconstrained flow method is used to produce a rough flow estimate. After that, the information on rigid structures and the initial optical flow are iterated and jointly optimized. Currently MR-Flow ranks first place in Sintel clean pass (see Sect. 4.1).

3.7 Multi-Frame Methods To the author knowledge multi-frame methods have been explored since 1991 with the work of Black and Andan [9]. Currently, multi-frame methods are the top performing in the Sintel benchmark final pass and second best in Sintel clean pass. As already mentioned for the final pass case the methods are based on PWC-net and thus fully learnable: Neoral et al. [61] and Ren et al. [73]. Proflow Maurer et al. [53] obtains the second best score in Sintel clean pass. Proflow is based on Coarse-to-fine Patch Match (CPM) [31], RichFlow and as additional refinement for the matches [54]. Finally ProFlow uses a CNN trained online (during the estimation) on forward and backward flow to obtain a sparse to dense motion field. The model is learnt in-place and makes this method quite different from others in this chapter.

Optical Flow Estimation with Deep Learning, a Survey on Recent Advances

279

4 Discussion 4.1 Flow Estimation Benchmarks and Performance Assessment Optical flow evaluation is a difficult task for many reasons: optical flow information is difficult to obtain for real scenarios and artificial scenes might not be as challenging as natural videos. Optical flow benchmarks count just few datasets. The Middlebury benchmark [6] is composed of sequences partly made of smooth deformations, but also involving motion discontinuities and motion details. Some sequences are synthetic, and others were acquired in a controlled environment allowing to produce ground truth for real scenes. However, the dataset is limited to few sequences and its challenges have almost been completely overcomed by modern methods [22]. For this reason, a new dataset has been generated: MPI-sintel evaluation benchmark [14]. Sintel is drawn from a short computer rendered movie, it counts around 1500 frames with optical flow groundtruth. Two-thirds of the dataset is given for training and the rest is used for evaluation. Sintel is a challenging benchmark including fast motion, occlusions, and non-rigid objects. There are in parallel more optical flow benchmarks that have been released to evaluate optical flow, KITTI 2012 [24] which consists of moving camera in static scenes, KITTI 2015 [58] extended to dynamic scenes, large motion, illumination changes and occlusions, and HD1K dataset [41], but they are tailored for the automotive domain. Thus, for applications not related to the automotive domain the most common benchmark is Sintel. Error measures such as photometric error on frame interpolation sequences can be misleading as not necessarily photometric error correspond to optical flow error (see Sect. 3.2.2). Moreover, optical flow estimation faces several different challenges: small displacement, large displacement, light change, and occlusions [98]. To correctly assess performance all these factors must be taken into account, but this is hard to catch with a single metric. Thus, performance is measured using different metrics: (1) EPE all; (2) EPE matched (EPE on non-occluded regions); (3) EPE unmatched; (4) d0-10, d10-60, d60–140, which are average endpoint error in regions within the indicated displacement range taking only matched pixels into account; (5) s0–10, s10–40, s40+, which are average endpoint errors in regions moving within the specified speed range per frame. The overall ranking is a combination of the previous metrics, evaluated both for “clean” pass (no change in light) and “final” pass (change in light, strong atmospheric effects, motion blur, camera noise).4

4 An

in-depth explanation on how Sintel dataset was generated is given in [14] and [13].

280

S. Savian et al.

4.2 Optical Flow Estimation Ranking As explained above, it is very difficult to rank optical flow estimation methods because performance cannot be accurately assessed by a single metric or a single scenario. For this reason we believe it is more important to cluster methods based on their application domain. Nonetheless, in this fast changing research field it is more important to underline what are the characteristics that give substantial improvement. Currently, the best optical flow methods for Sintel clean pass are deep learning hybrid methods exploiting domain knowledge (see Sect. 3.6.2), the best method is MR-Flow, followed by hybrid multi-frame ProFlow. Sun et al. [86] conjecture that this is because they exploit traditional methods to refine motion boundaries which are perfectly aligned in the clean pass. Differently, on Sintel final pass the post-processing effects cause severe problems to existing traditional methods. In this challenging situation deep learning coarse-to-fine methods (see Sect. 3.2.1) obtain the best performance. PWCnet is the top performing two frame optical flow method, outperformed only by multi-frame methods applying PWC-net to multiple frames. Moreover, it is believed that the final pass is more challenging and realistic because it is corrupted by motion blur, atmospheric changes, and noise. However, PWC-net as others coarse-to-fine methods may fail on small and rapidly moving objects, due to the coarse-to-fine refinement (see Sect. 3.1.2).

4.3 Biometrics Applications of Optical Flow There are several applications of optical flow estimation in the biometrics field. In this section, we briefly discuss most notable applications. In [82] optical flow estimation has been adopted for the face recognition task. The results were promising, e.g., for particular sub-task of distinguishing real people from their image. Human pose estimation and tracking is another application where optical flow has been applied and has shown promising results [102]. Accordingly, it has been shown that the use of optical flow, based on pose propagation and similarity measurement, can result in substantially superior outcome compared to baselines. The task of single person pose estimation has been addressed by several researchers, where pose estimation accuracy has been enhanced by considering optical flow [69]. In a more recent work [20], the authors extended this research line and addressed a more difficult task, i.e., multi-people tracking (MPT). Another example of application of optical flow in biometrics is action recognition. In a recent work [62], the authors have jointly estimated the optical flow while recognizing actions with convolutional neural networks capturing both appearance and motion in a single mode. The result has shown that this model significantly improves action recognition accuracy in comparison to the baseline. In [52] the authors proposed a multi-task CNN model that receives (as input) a sequence of

Optical Flow Estimation with Deep Learning, a Survey on Recent Advances

281

optical flow channels and uses them for computing several biometric features (such as identity, gender, and age). It is worth noting that, these were some examples of the applications of optical flow estimation in biometrics. While we mentioned some important works, however, there could be further works coming up in recent time, using more advanced optical flow estimation techniques for the particular application in biometrics.

5 Conclusion In this book chapter, we have provided a survey on the state-of-the-art in optical flow estimation with a focus on deep learning (DL) methods. We have conducted a comprehensive analysis and classified a wide range of techniques, along with an identified, descriptive, and discriminative dimension, i.e., whether the techniques are based on DL, or they are traditional handcrafted. We have reported the similarities and differences between DL and traditional methods. We believe that this systematic review on optical flow estimation can help to better understand and use the methods, hence providing a practical resource for the practitioners and researchers in the field of biometrics. In addition to that, we described and listed datasets for optical flow estimation, commonly employed by the research community. It is worth noting that optical flow estimation is a mature, while still growing, research field and can be seen as a multi-disciplinary area. This research area partially overlaps with a broad range of topics, such as signal processing, computer vision, and machine learning. Hence, our book chapter by no means can be all-inclusive, and indeed it focuses mainly on DL methods that are of practical importance for biometric research.

References 1. A. Ahmadi, I. Patras, Unsupervised convolutional neural networks for motion estimation, in 2016 IEEE International Conference on Image Processing (ICIP) (IEEE, Piscataway, 2016), pp. 1629–1633 2. S. Alletto, D. Abati, S. Calderara, R. Cucchiara, L. Rigazio, TransFlow: unsupervised motion flow by joint geometric and pixel-level estimation (2017), arXiv preprint arXiv:1706.00322 3. M. Bai, W. Luo, K. Kundu, R. Urtasun, Exploiting semantic information and deep matching for optical flow, in European Conference on Computer Vision (Springer, Berlin, 2016), pp. 154–170 4. C. Bailer, B. Taetz, D. Stricker, Flow fields: dense correspondence fields for highly accurate large displacement optical flow estimation, in Proceedings of the IEEE International Conference on Computer Vision (2015), pp. 4015–4023 5. C. Bailer, K. Varanasi, D. Stricker, CNN-based patch matching for optical flow with thresholded hinge embedding loss, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2 (2017), p. 7

282

S. Savian et al.

6. S. Baker, D. Scharstein, J.P. Lewis, S. Roth, M.J. Black, R. Szeliski, A database and evaluation methodology for optical flow. Int. J. Comput. Vis. 92(1), 1–31 (2011) 7. C. Barnes, E. Shechtman, A. Finkelstein, D.B. Goldman, Patchmatch: a randomized correspondence algorithm for structural image editing. ACM Trans. Graph. 28(3), 24 (2009) 8. A. Behl, O.H. Jafari, S.K. Mustikovela, H.A. Alhaija, C. Rother, A. Geiger, Bounding boxes, segmentations and object coordinates: how important is recognition for 3D scene flow estimation in autonomous driving scenarios?, in International Conference on Computer Vision (ICCV), vol. 6 (2017) 9. M.J. Black, P. Anandan, Robust dynamic motion estimation over time, in CVPR, vol. 91 (1991), pp. 296–203 10. J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, R. Shah, Signature verification using a “siamese” time delay neural network, in Advances in Neural Information Processing Systems (1994), pp. 737–744 11. T. Brox, C. Bregler, J. Malik, Large displacement optical flow, in IEEE Conference on Computer Vision and Pattern Recognition, 2009. CVPR 2009 (IEEE, Piscataway, 2009), pp. 41–48 12. T. Brox, J. Malik, Large displacement optical flow: descriptor matching in variational motion estimation. IEEE Trans. Pattern Anal. Mach. Intell. 33(3), 500–513 (2011) 13. D. Butler, J. Wulff, G. Stanley, M. Black, MPI-Sintel optical flow benchmark: supplemental material, in MPI-IS-TR-006, MPI for Intelligent Systems (2012). Citeseer 14. D.J. Butler, J. Wulff, G.B. Stanley, M.J. Black, A naturalistic open source movie for optical flow evaluation, in European Conference on Computer Vision (Springer, Berlin, 2012), pp. 611–625 15. Q. Chen, V. Koltun, Full flow: optical flow estimation by global optimization over regular grids, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 4706–4714 16. L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A.L. Yuille, DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2018) 17. J. Cheng, Y.-H. Tsai, S. Wang, M.-H. Yang, SegFlow: joint learning for video object segmentation and optical flow, in 2017 IEEE International Conference on Computer Vision (ICCV) (IEEE, Piscataway, 2017), pp. 686–695 18. P. Dollár, C.L. Zitnick, Structured forests for fast edge detection, in 2013 IEEE International Conference on Computer Vision (ICCV) (IEEE, Piscataway, 2013), pp. 1841–1848 19. A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. van der Smagt, D. Cremers, T. Brox, FlowNet: learning optical flow with convolutional networks, in Proceedings of the IEEE International Conference on Computer Vision (2015), pp. 2758– 2766 20. M. Fabbri, F. Lanzi, S. Calderara, A. Palazzi, R. Vezzani, R. Cucchiara, Learning to detect and track visible and occluded body joints in a virtual world (2018), arXiv preprint arXiv:1803.08319 21. M. Fang, Y. Li, Y. Han, J. Wen, A deep convolutional network based supervised coarse-tofine algorithm for optical flow measurement, in 2018 IEEE 20th International Workshop on Multimedia Signal Processing (MMSP) (IEEE, Piscataway, 2018), pp. 1–6 22. D. Fortun, P. Bouthemy, C. Kervrann, Optical flow modeling and computation: a survey. Comput. Vis. Image Underst. 134, 1–21 (2015) 23. D. Gadot, L. Wolf, Patchbatch: a batch augmented loss for optical flow, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 4236–4245 24. A. Geiger, P. Lenz, C. Stiller, R. Urtasun, Vision meets robotics: the KITTI dataset. Int. J. Robot. Res. 32(11), 1231–1237 (2013) 25. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in Advances in Neural Information Processing Systems (2014), pp. 2672–2680

Optical Flow Estimation with Deep Learning, a Survey on Recent Advances

283

26. I. Goodfellow, Y. Bengio, A. Courville, Y. Bengio, Deep Learning, vol. 1 (MIT Press, Cambridge, 2016) 27. F. Güney, A. Geiger, Deep discrete flow, in Asian Conference on Computer Vision (Springer, Cham, 2016), pp. 207–224 28. D. Hafner, O. Demetz, J. Weickert, Why is the census transform good for robust optic flow computation?, in International Conference on Scale Space and Variational Methods in Computer Vision (Springer, Berlin, 2013), pp. 210–221 29. K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 770–778 30. B.K.P. Horn, B.G. Schunck, Determining optical flow. Artif. Intell. 17(1–3), 185–203 (1981) 31. Y. Hu, R. Song, Y. Li, Efficient coarse-to-fine patchmatch for large displacement optical flow, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 5704–5712 32. Y. Hu, Y. Li, R. Song, Robust interpolation of correspondences for large displacement optical flow, in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), pp. 4791–4799. https://doi.org/10.1109/CVPR.2017.509 33. G. Huang, Z. Liu, L. Van Der Maaten, K.Q. Weinberger, Densely connected convolutional networks, in CVPR, vol. 1 (2017), p. 3 34. T.-W. Hui, X. Tang, C.C. Loy, LiteFlowNet: a lightweight convolutional neural network for optical flow estimation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018), pp. 8981–8989 35. E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, T. Brox, FlowNet 2.0: evolution of optical flow estimation with deep networks, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2 (2017) 36. E. Ilg, T. Saikia, M. Keuper, T. Brox, Occlusions, motion and depth boundaries with a generic network for disparity, optical flow or scene flow estimation, in European Conference on Computer Vision (ECCV) (2018) 37. E. Ilg, O. Ciçek, S. Galesso, A. Klein, O. Makansi, F. Hutter, T. Brox, Uncertainty estimates and multi-hypotheses networks for optical flow, in European Conference on Computer Vision (ECCV) (2018) 38. J.Y. Jason, A.W. Harley, K.G. Derpanis, Back to basics: unsupervised learning of optical flow via brightness constancy and motion smoothness (2016), arXiv preprint arXiv:1608.05842 39. M. Keuper, B. Andres, T. Brox, Motion trajectory segmentation via minimum cost multicuts, in Proceedings of the IEEE International Conference on Computer Vision (2015), pp. 3271– 3279 40. G. Koch, R. Zemel, R. Salakhutdinov, Siamese neural networks for one-shot image recognition, in ICML Deep Learning Workshop, vol. 2 (2015) 41. D. Kondermann, R. Nair, K. Honauer, K. Krispin, J. Andrulis, A. Brock, B. Gussefeld, M. Rahimimoghaddam, S. Hofmann, C. Brenner, et al., The HCi benchmark suite: Stereo and flow ground truth with uncertainties for urban autonomous driving, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (2016), pp. 19–28 42. A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep convolutional neural networks, in Advances in Neural Information Processing Systems (2012), pp. 1097– 1105 43. W.-S. Lai, J.-B. Huang, M.-H. Yang, Semi-supervised learning for optical flow with generative adversarial networks, in Advances in Neural Information Processing Systems (2017), pp. 354–364 44. Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998) 45. Y. LeCun, Y. Bengio, G. Hinton, Deep learning. Nature 521(7553), 436 (2015) 46. C. Liu, J. Yuen, A. Torralba, SIFT flow: dense correspondence across scenes and its applications. IEEE Trans. Pattern Anal. Mach. Intell. 33(5), 978–994 (2011)

284

S. Savian et al.

47. J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 3431–3440 48. G. Long, L. Kneip, J.M. Alvarez, H. Li, X. Zhang, Q. Yu, Learning image matching by simply watching video, in European Conference on Computer Vision (Springer, Cham, 2016), pp. 434–450 49. D.G. Lowe, Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004) 50. B.D. Lucas, T. Kanade, An iterative image registration technique with an application to stereo vision, in In IJCAI81 (1981), pp. 674–679 51. W. Luo, A.G. Schwing, R. Urtasun, Efficient deep learning for stereo matching, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 5695–5703 52. M.J. Marín-Jiménez, F.M. Castro, N. Guil, F. de la Torre, R. Medina-Carnicer, Deep multitask learning for gait-based biometrics, in 2017 IEEE International Conference on Image Processing (ICIP) (IEEE, Piscataway, 2017), pp. 106–110 53. D. Maurer, A. Bruhn, ProFlow: learning to predict optical flow (2018), arXiv preprint arXiv:1806.00800 54. D. Maurer, M. Stoll, A. Bruhn, Order-adaptive and illumination-aware variational optical flow refinement, in Proceedings of the British Machine Vision Conference (2017), pp. 9–26 55. N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, T. Brox, A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation, in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016) 56. N. Mayer, E. Ilg, P. Fischer, C. Hazirbas, D. Cremers, A. Dosovitskiy, T. Brox, What makes good synthetic training data for learning disparity and optical flow estimation? Int. J. Comput. Vis. 126(9), 942–960. https://doi.org/10.1007/s11263-018-1082-6 57. S. Meister, J. Hur, S. Roth, Unflow: unsupervised learning of optical flow with a bidirectional census loss (2017), arXiv preprint arXiv:1711.07837 58. M. Menze, A. Geiger, Object scene flow for autonomous vehicles, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 3061–3070 59. M. Menze, C. Heipke, A. Geiger, Discrete optimization for optical flow, in German Conference on Pattern Recognition (Springer, Cham, 2015), pp. 16–28 60. Y. Mileva, A. Bruhn, J. Weickert, Illumination-robust variational optical flow with photometric invariants, in Joint Pattern Recognition Symposium (Springer, Berlin, 2007), pp. 152–162 61. M. Neoral, J. Šochman, J. Matas, Continual occlusions and optical flow estimation (2018), arXiv preprint arXiv:1811.01602 62. J.Y.H. Ng, J. Choi, J. Neumann, L.S. Davis, ActionFlowNet: learning motion representation for action recognition, in 2018 IEEE Winter Conference on Applications of Computer Vision (WACV) (IEEE, Piscataway, 2018), pp. 1616–1624 63. S. Niklaus, L. Mai, F. Liu, Video frame interpolation via adaptive separable convolution (2017), arXiv preprint arXiv:1708.01692 64. P. Ochs, J. Malik, T. Brox, Segmentation of moving objects by long term video analysis. IEEE Trans. Pattern Anal. Mach. Intell. 36(6), 1187–1200 (2014) 65. J. Pang, W. Sun, J.S.J. Ren, C. Yang, Q. Yan, Cascade residual learning: a two-stage convolutional neural network for stereo matching, in ICCV Workshops, vol. 7 (2017) 66. N. Papenberg, A. Bruhn, T. Brox, S. Didas, J. Weickert, Highly accurate optic flow computation with theoretically justified warping. Int. J. Comput. Vis. 67(2), 141–158 (2006) 67. F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, A. Sorkine-Hornung, A benchmark dataset and evaluation methodology for video object segmentation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 724–732 68. B. Pesquet-Popescu, M. Cagnazzo, F. Dufaux, Motion estimation techniques, in TELECOM ParisTech (2016)

Optical Flow Estimation with Deep Learning, a Survey on Recent Advances

285

69. T. Pfister, J. Charles, A. Zisserman, Flowing convNets for human pose estimation in videos, in Proceedings of the IEEE International Conference on Computer Vision (2015), pp. 1913– 1921 70. A. Ranjan, M.J. Black, Optical flow estimation using a spatial pyramid network, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017) 71. A. Ranjan, V. Jampani, K. Kim, D. Sun, J. Wulff, M.J. Black, Adversarial collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation (2018), arXiv preprint arXiv:1805.09806 72. Z. Ren, J. Yan, B. Ni, B. Liu, X. Yang, H. Zha, Unsupervised deep learning for optical flow estimation, in AAAI, vol. 3 (2017), p. 7 73. Z. Ren, O. Gallo, D. Sun, M.-H. Yang, E.B. Sudderth, J. Kautz, A fusion approach for multiframe optical flow estimation (2018), arXiv preprint arXiv:1810.10066 74. J. Revaud, P. Weinzaepfel, Z. Harchaoui, C. Schmid, EpicFlow: edge-preserving interpolation of correspondences for optical flow, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 1164–1172 75. J. Revaud, P. Weinzaepfel, Z. Harchaoui, C. Schmid, DeepMatching: hierarchical deformable dense matching. Int. J. Comput. Vis. 120(3), 300–323 (2016) 76. S.R. Richter, Z. Hayder, V. Koltun, Playing for benchmarks, in IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22–29, 2017 (2017), pp. 2232–2241 77. O. Ronneberger, P. Fischer, T. Brox, U-Net: convolutional networks for biomedical image segmentation, in International Conference on Medical Image Computing and Computerassisted Intervention (Springer, Cham, 2015), pp. 234–241 78. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A.C. Berg, L. Fei-Fei, ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015) 79. S. Savian, Benchmarking The Imbalanced Behavior of Deep Learning Based Optical Flow Estimators, 15th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS), ND, Lecture Notes in Computer Science, IEEE, NJ (2019) 80. L. Sevilla-Lara, D. Sun, V. Jampani, M.J. Black, Optical flow with semantic segmentation and localized layers, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 3889–3898 81. K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition (2014), arXiv preprint arXiv:1409.1556 82. M. Smiatacz, Liveness measurements using optical flow for biometric person authentication. Metrol. Meas. Syst. 19(2), 257–268 (2012) 83. K. Soomro, A.R. Zamir, M. Shah, UCF101: A dataset of 101 human actions classes from videos in the wild (2012), arXiv preprint arXiv:1212.0402 84. D. Sun, J. Wulff, E.B. Sudderth, H. Pfister, M.J. Black, A fully-connected layered model of foreground and background flow, in 2013 IEEE Conference on Computer Vision and Pattern Recognition (June 2013), pp. 2451–2458 85. D. Sun, X. Yang, M.-Y. Liu, J. Kautz, Pwc-net: CNNs for optical flow using pyramid, warping, and cost volume (2017), arXiv preprint arXiv:1709.02371, preprint, original paper is published on CVPR, June 2018 86. D. Sun, X. Yang, M.-Y. Liu, J. Kautz, Models matter, so does training: an empirical study of CNNs for optical flow estimation (2018), arXiv preprint arXiv:1809.05571 87. K. Sundararajan, D.L. Woodard, Deep learning for biometrics: a survey. ACM Comput. Surv. 51(3), 65:1–65:34 (2018) 88. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2015) 89. M. Tatarchenko, A. Dosovitskiy, T. Brox, Multi-view 3D models from single images with a convolutional network, in Computer Vision – ECCV 2016, ed. by B. Leibe, J. Matas, N. Sebe, M. Welling (Springer International Publishing, Cham, 2016), pp. 322–337

286

S. Savian et al.

90. E. Tola, V. Lepetit, P. Fua, DAISY: an efficient dense descriptor applied to wide-baseline stereo. IEEE Trans. Pattern Anal. Mach. Intell. 32(5), 815–830 (2010) 91. Y.-H. Tsai, M.-H. Yang, M.J. Black, Video segmentation via object flow, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 3899–3908 92. Z. Tu, W. Xie, D. Zhang, R. Poppe, R.C. Veltkamp, B. Li, J. Yuan, A survey of variational and CNN-based optical flow techniques. Signal Process. Image Commun. 72, 9–24 (2019) 93. S. Vijayanarasimhan, S. Ricco, C. Schmid, R. Sukthankar, K. Fragkiadaki, SfM-Net: learning of structure and motion from video (2017), arXiv preprint arXiv:1704.07804 94. C. Wan, L. Wang, V.V. Phoha, A survey on gait recognition. ACM Comput. Surv. 51(5), 89 (2018) 95. M. Wang, W. Deng, Deep face recognition: a survey (2018), arXiv preprint arXiv:1804.06655 96. A.S. Wannenwetsch, M. Keuper, S. Roth, ProbFlow: joint optical flow and uncertainty estimation, in 2017 IEEE International Conference on Computer Vision (ICCV) (IEEE, Piscataway, 2017), pp. 1182–1191 97. P. Weinzaepfel, J. Revaud, Z. Harchaoui, C. Schmid, DeepFlow: large displacement optical flow with deep matching, in 2013 IEEE International Conference on Computer Vision (ICCV) (IEEE, Piscataway, 2013), pp. 1385–1392 98. J. Wulff, D.J. Butler, G.B. Stanley, M.J. Black, Lessons and insights from creating a synthetic optical flow benchmark, in ECCV Workshop on Unsolved Problems in Optical Flow and Stereo Estimation, ed. by A. Fusiello et al. (Eds.). Part II, Lecture Notes in Computer Science 7584 (Springer, Berlin, 2012), pp. 168–177 99. J. Wulff, M.J. Black, Temporal interpolation as an unsupervised pretraining task for optical flow estimation (2018), arXiv preprint arXiv:1809.08317 100. J. Wulff, L. Sevilla-Lara, M.J. Black, Optical flow in mostly rigid scenes, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2 (IEEE, Piscataway, 2017), p. 7 101. X. Xiang, M. Zhai, R. Zhang, Y. Qiao, A. El Saddik, Deep optical flow supervised learning with prior assumptions. IEEE Access 6, 43222–43232 (2018) 102. B. Xiao, H. Wu, Y. Wei, Simple baselines for human pose estimation and tracking (2018), arXiv preprint arXiv:1804.06208 103. J. Xu, R. Ranftl, V. Koltun, Accurate optical flow via direct cost volume processing (2017), arXiv preprint arXiv:1704.07325 104. K. Yamaguchi, D. McAllester, R. Urtasun, Robust monocular epipolar flow estimation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2013), pp. 1862–1869, 105. K. Yamaguchi, D. McAllester, R. Urtasun, Efficient joint segmentation, occlusion labeling, stereo and flow estimation, in European Conference on Computer Vision (Springer, Cham, 2014), pp. 756–771 106. G. Yang, Z. Deng, S. Wang, Z. Li, Masked label learning for optical flow regression, in 2018 24th International Conference on Pattern Recognition (ICPR) (IEEE, Piscataway, 2018), pp. 1139–1144 107. Y. Yang, S. Soatto, S2F: Slow-to-fast interpolator flow, in Conference on Computer Vision and Pattern Recognition (CVPR) (2017) 108. I. Yildirim, T.D. Kulkarni, W.A. Freiwald, J.B. Tenenbaum, Efficient and robust analysisby-synthesis in vision: a computational framework, behavioral tests, and modeling neuronal representations, in Annual Conference of the Cognitive Science Society, vol. 1 (2015) 109. Z. Yin, J. Shi, GeoNet: unsupervised learning of dense depth, optical flow and camera pose, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2 (June 2018) 110. S. Zagoruyko, N. Komodakis, Learning to compare image patches via convolutional neural networks, in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2015) 111. J. Zbontar, Y. LeCun, Stereo matching by training a convolutional neural network to compare image patches. J. Mach. Learn. Res. 17(1–32), 2 (2016)

Optical Flow Estimation with Deep Learning, a Survey on Recent Advances

287

112. Y. Zhu, Z. Lan, S. Newsam, A.G. Hauptmann, Guided optical flow learning (2017), arXiv preprint arXiv:1702.02295 113. Y. Zhu, S. Newsam, DenseNet for dense flow, in 2017 IEEE International Conference on Image Processing (ICIP) (IEEE, Piscataway, 2017), pp. 790–794 114. H. Zimmer, A. Bruhn, J. Weickert, L. Valgaerts, A. Salgado, B. Rosenhahn, H.-P. Seidel, Complementary optic flow, in International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition (Springer, Berlin, 2009), pp. 207–220

The Rise of Data-Driven Models in Presentation Attack Detection Luis A. M. Pereira, Allan Pinto, Fernanda A. Andaló, Alexandre M. Ferreira, Bahram Lavi, Aurea Soriano-Vargas, Marcos V. M. Cirne, and Anderson Rocha

1 Introduction In the contemporary society, oftentimes people and corporations manipulate information taking advantage of the increased adoption of digital systems—smartphones, bank and airport control systems, the Internet, and so on. To prevent such manipulations, much of the generated data, such as photos, conversational histories, transactions, and bank statements, should be protected from indiscriminate access, so people have control over their data and can maintain their right to privacy. Traditionally, data protection methods rely on the use of external knowledge (e.g., passwords and secret questions) or tokens (e.g., smartcards), which may not be secure, as they can be forgotten, lost, stolen, or manipulated with ease. In addition, by using knowledge- or token-based solutions, the user is not required to claim an identity [19], which allows the use of multiple identities by a single person. To overcome the disadvantages of traditional security methods, biometric systems use biological or behavioral traits pertaining to a user—face, iris, fingerprint, voice, gait, among others—to automatically recognize her/him, therefore, granting access to private data. A biometric system gathers traits from an individual through a sensor, extracts features from such traits, and compares them with feature templates in a database [19], enabling the recognition of particular individuals. However, biometric traits cannot be considered as completely private information, as we inevitably expose them in our everyday life [13]: our faces in

L. A. M. Pereira · A. Pinto · F. A. Andaló · A. M. Ferreira · B. Lavi · A. Soriano-Vargas M. V. M. Cirne · A. Rocha () Reasoning for Complex Data Laboratory, Institute of Computing, University of Campinas, Campinas, SP, Brazil e-mail: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected] © Springer Nature Switzerland AG 2020 R. Jiang et al. (eds.), Deep Biometrics, Unsupervised and Semi-Supervised Learning, https://doi.org/10.1007/978-3-030-32583-1_13

289

290

L. A. M. Pereira et al.

social media, our fingerprints where we touch, our gait when we are recorded by surveillance cameras, among many other examples. This leads to the biggest drawback of pure biometric systems. In practice, our traces can be captured and offered to the system by an adversary, in order to circumvent security mechanisms. The attack to a biometric system, which occurs whenever an adversary offers a counterfeit biometric trace to the acquisition sensor, is called a presentation attack (or spoofing attack in earliest literature). It is considered as the most dangerous type of attack to a biometric system [13], as the attacker primarily needs only access to the victim’s traits, which are often plenty, and replay them to the biometric sensor. In the earliest biometric methods and systems put into operation, there was little to no concern in providing countermeasures to presentation attacks, mainly because of the assumption that the counterfeiting of biometric traces, such as fingerprints and faces, was difficult to achieve. However, it was not long before we began to receive news of biometric systems hacked by the use of false traits. One of the most prominent examples took place in 2013,1 when a Brazilian physician fooled the biometric employee attendance device, by using prosthetic fingers bearing the fingerprints of co-workers (Fig. 1b). The investigation of this fraud scheme revealed that 300 public employees had been receiving pay without going to work. To deal with the urging threat, academia and industry have been researching and applying automated methods to counterattack presentation attacks, a field of research known as presentation attack detection (PAD). These methods perform the task of differentiating between genuine (or bona fide) trait samples from attack ones, referred to as presentation attacks. For some years, researched methods relied on human knowledge to automatically look for specific characteristics expected on genuine trait biometric samples, such as shape, texture, or liveness signs; or on attack samples, such as artifacts and noise. Recently, data-driven methods have been increasingly employed to learn relevant characteristics automatically from training data, yielding the state-of-the-art results for PAD without relying on knowledgebased algorithms to extract specific characteristics from samples. The clear advances in the area after the dissemination of data-driven approaches have resulted in the possibility of generating models more robust to variations and nuances, which are captured during training, and capable of extracting and analyzing relevant sets of characteristics without the need of aggregating prior human knowledge. Nevertheless, as the models generated by data-driven methods are well-fit to the training data, sometimes they do not generalize well when applied in a cross-dataset scenario, i.e., when testing data have characteristics not present/seen in the training data (such as different sensor noise, different geometric and photometric variations and distortions, etc.). This may indicate that methods based on prior knowledge conceivably complement data-driven ones. Although extremely important, these and other aspects are not often explored and discussed when new methods are published.

1 BBC

News: https://www.bbc.com/news/world-latin-america-21756709.

The Rise of Data-Driven Models in Presentation Attack Detection

291

In this chapter, we cover the relevant literature on data-driven PAD methods, presenting a critical analysis of open, often overlooked, issues and challenges, in order to shed light on the problem. We answer and provide insights to important questions surrounding PAD research and applications. In which scenarios do these methods fail? How can such methods complement the ones based on human knowledge? How robust are these methods under the cross-dataset scenario? Are these methods robust against new attack types? Do these methods provide other ways to model the PAD problem besides the classical binary decision? And finally, are these methods applicable to multi-biometric settings? To survey the relevant literature, we examine three widely used biometric modalities: face (Fig. 1a), fingerprint (Fig. 1b), and iris (Fig. 1c). Faces are the most common biometric characteristic used by humans to recognize others [10], and they are often considered in biometric systems due to their non-intrusiveness and simplicity of acquisition (by any current camera). Systems based on face biometrics can be attacked by the presentation of a photograph, video, or 3D model of the user’s face. Fingerprints are patterns of ridges and furrows located on the tip of each finger and can be captured by sensors that provide digital images of these patterns [10]. Fingerprint biometric systems are often spoofed by the presentation of fingerprints printed on paper or by 3D finger casts. Irises contain many distinctive features [10], such as ligaments, ridges, rings, and others, which favors their use as a biometric indicator. A fake iris sample can be created from an artificial eyeball, textured contact lens, and iris patterns printed on paper [43].

Fig. 1 Presentation attack modalities. (a) Face presentation attack, CASIA-FAS dataset [56] examples: genuine face, warped paper attack, cut paper attack, and video attack. (b) Fingerprint presentation attack: six silicone fingers used to fool the biometric employee attendance device at a hospital in Brazil. Source: BBC [1]. (c) Iris presentation attack, LivDet-Iris Warsaw 2017 dataset [54]: genuine iris and printed attack

292

L. A. M. Pereira et al.

2 Benchmarks to Evaluate PAD Solutions In this section, we describe the most adopted benchmarks in the three biometric authentication modalities explored in this paper. For each modality, there is a specific table to favor the comparison. These tables present some characteristics for each benchmark, such as dataset name, the total number of samples (videos or photos), the division between bona fide (BF) samples and PA (presentation attack) and, if the data is already split into training and testing, the train/test sets ratio. This last aspect is important to check whether the amount of data is enough to be used in a data-driven approach and to ensure that there are no significant unbalanced sets either between BF/PA or train/test. The number of subjects indicates diversity and average resolution is important to check the sample sizes (height × width) coming from different acquisition sensors. The setup column shows information regarding the capturing environment, and the following column describes which type of attack is considered. Acquisition and used devices for attacks give some information about the dataset representativeness, as increasing the number and adopting realist devices are pivotal for reducing the likelihood of specific artifacts that can be explored instead of specific characteristics. Some cells of these tables are marked as N/A, indicating “Not Applicable” and N/R indicating “Not Reported.” A short discussion about the aspects of these benchmarks is also provided. To facilitate the link between the benchmarks and the described research papers, Table 4 (Appendix 1) shows all references that have used each benchmark per modality.

2.1 Face PAD Face PAD benchmarks (Table 1) are usually composed of screen- and print-based forms of attacks. However, some of them have also tried to use 3D masks. In some of the benchmarks, the small number of subjects can be problematic as it shows less variability. For instance, 3DMAD and NUAA contain faces from less than 20 subjects. In turn, MSU USSA, although with more subjects, is composed of celebrity faces acquired from the Internet, which may not retain detailed acquisition information. The main drawback of CASIA is having only one device for the attacks, lacking variety on acquisition artifacts. This is not the case for UVAD dataset, which contains videos recorded with six different devices and considered seven different devices when performing presentation attacks.

2.2 Iris PAD Iris PAD benchmarks (Table 2) are, in part, similar to Face PAD ones from the attack interface perspective (screen- and print-based). Some of them, however, have also

150/450

51k/25,500 N/R

600 videos

76,500 frames

5105/7509

808/16,268 N/R

12,614

17,076 videos

Yes

1800/1800

720/2880

4950 videos

N/R

OULUNPU [5] NUAA [47] UVAD [35]

1140/9120

10,260 images

MSU USSA [33]

Yes

N/R

CASIAFASD [56] 3DMAD [12]

104/159

263 photos and videos

Train/Test 360/480

CSMAD [4]

BF/PA 140/700

Size 1300 photos and videos

Dataset ReplayAttack [7]

Table 1 Face PAD benchmarks

404

15

55

1140

17

50

14

Seven evaluation scenarios and three image quality Three sessions (2 weeks interval), having 5 videos of 300 frames from each session Uncontrolled

1226 × 813

Lighting and background in three sections N/R Lighting, background and places in two sections

N/R

640 × 480 1366 × 768

705 × 865

640 × 480

Four lighting conditions

Setup Lighting and holding

640 × 480

Subj. Resolution 50 752 × 544 photos 720p videos

Screen

Printed photos

Screen and print

Screen and printed photo

Three modes: warped photo, cut photo and video playback 3D mask

Six silicone masks and two wearing ways

Types of attacks Screen and print

Devices for attacks 4x iPhone 3GS, 4x iPad first generation and 2x Triumph-Adler DCC 2520 color laser Nikon Coolpix P520

Six cameras

Six mobile devices front camera Webcam

Celebrities from internet and Nexus 5

Seven displays

N/R

Cannon PowerShot 550D, iPhone 5S and HP Color Laserjet CP6015 Two printers and two display

RealSense SR300 and Compact Pro (thermal) Sony NEX-5, iPad new and old USB camera ThatsMyFace.com Kinect

Acquisition Canon PowerShot SX150

Size 6749

4848

1120

7300

12,013

7459

10,240

Dataset Clarkson17 [54]

IIITD Iris Spoofing [16]

IIITD Delhi [23]

ND CLD 2015 [11]

Warsaw17 [54]

IIITD WVU [54]

CASIA Iris Fake [44]

Table 2 Iris PAD benchmarks

6000/4120

2952/4507

5168/6845

4800/2500

N/R

0/4848

Live/PA 3954/2795

N/A

3250+3000 /4209

4513/2990

6000/1200

N/A

N/R

Train/Test 3591/3158

500

N/R

186

278

224

101

Subj. 25

Textured and soft contact lens Textured contact lens Printed images

320 × 240

640 × 480 Textured lens and printed images

Print, Contact, Plastic and Synth

Diverse

640 × 480

640 × 480

640 × 480

PA types Printed images, patterned contact lenses, and printouts of patterned contact lenses Printed images

Resolution 640 × 480

Cogent CIS 202, VistaFA2E and HP flatbed optical scanner JIRIS, JPC1000, digital CMOS camera IrisAccess LG 4000 and IrisGuard AD100 IrisGuard AD100 and Sony EX-View CCD CLI plus IIITD datasets and IrisShield sensor LG-H100

Acquisition LG IrisAccess EOU2200

Ciba and Aryan for the lens and HP P3015 for the printouts Fuji Xerox C1110 printer, contact lens, re-played video and artificial eyeballs

JJ, Ciba, UCL and ClearLab Panasonic ET100

N/R

HP Color LaserJet 2025 printer

PA devices iPhone 5 and printer

The Rise of Data-Driven Models in Presentation Attack Detection

295

explored specific aspects, such as contact lenses and artificial eyeballs (ND CLD 2015). IIITD Iris Spoofing dataset is unique in providing a combined attack setting comprising paper printout of eyes and contact lenses. Yet, the most common attacks are static, in which printed iris photos are adopted. The CASA Iris Fake provides a considerable number of subjects and a consistent ratio between live and PA samples, although training and testing sets are not specified. Not having a fixed split between training and testing data may harden comparisons of different solutions and may increase the likelihood of data leakage. An important aspect of IIITD WVU is its naturally cross-dataset setting, once its training set incorporates 3000 samples from IIITD Iris Spoofing dataset.

2.3 Fingerprint PAD Fingerprint PAD benchmarks (Table 3) may consider different physical materials, such as ecoflex and gelatin. PBSKD is the benchmark with most spoofing materials (ten in total), creating one hundred spoof specimens that are acquired five times using two different fingerprint scanners. However, the most important data-source available nowadays is generated by the LivDet fingerprint competitions (years 2009, 2011, 2013, 2015, and 2017), which provide us with intra-sensor, cross-material, cross-sensor, and cross-dataset validation protocols. In the two last editions, testing samples were created using materials that are not seen in the training set, naturally enabling (and pushing for) open-set experimentation. The table does not show the numbers of each competition internal dataset, but it gives a holistic view of them concerning size, the ratio between live and PA samples, and the ratio between training and testing samples. The information about the number of subjects and resolution is omitted in the table as they differ among LivDet internal datasets given that their capturing scanners have different purposes (border control, mobile device authentication, among others).

3 Data-Driven Methods for Presentation Attack Detection In this section, we present the state-of-the-art methods for the three modalities considered in this chapter (face, fingerprint, and iris). We present methods for each modality, separately, and methods proposed for the multi-biometric scenario. We also discuss hybrid methods designed from data-driven and handcrafted methodologies. Recently, some methods proposed in the literature showed how to harmoniously mix these two approaches in order to take advantage from both.

Size 11,000

16,000

20,590

1800

16,000

Dataset LivDet 2009 [27]

LivDet 2011 [53]

LivDet 2013 [14]

PBSKD [8]

Bogus [45]

8000/8000

900/900

11,740/8850

8000/8000

Live/PA 5500/5500

Table 3 Fingerprint PAD datasets

8000/8000

1000/1000

10,350/10,240

8000/8000

Train/Test 2750/8250

Biometrika, Digital, Italdata and Sagem

CrossMatch and Lumidigm

Scanner Crossmatch, Identix and Biometrika Biometrika, Digital Persona, ItalData and Sagem Biometrika, Crossmatch, ItalData and Swipe

FX2000, 4000B, ET10 and MSO300

FX2000, L Scan Guardian, ET10 and Swipe Guardian 200 and Venus 302

Model Verifier 300 LC, DFR2100 and FX2000 FX2000, 4000B, ET10 and MSO300

Ecoflex, gelatin, latex body paint, ecoflex with silver colloidal ink coating, ecoflex with BarePaint coating, ecoflex with nanotips coating, Crayola model magic, wood glue, monster liquid latex, 2D printed on office paper Gelatin, Play-doh, silicone, wood glue, ecoflex and silgum

Gelatin, body double, latex, play-doh, ecoflex, modasil, and wood glue

Gelatin, latex, ecoflex, Play-doh, silicone and wood glue

PA devices Gelatin, silicone, play-doh

The Rise of Data-Driven Models in Presentation Attack Detection

297

3.1 Face PAD Face recognition systems are one of the least intrusive biometric approaches and can be performed with low-cost sensors (e.g., smartphone cameras). The intrinsic nature of such systems, however, makes them the most vulnerable ones. An impostor can perform illegal access in such systems by presenting a synthetic sample to the acquisition sensor to impersonate a genuine user. An attack can be carried out through presenting, to the acquisition sensor, 2D-printed photos, electronic display of facial photos or videos, or 3D face masks. Mask-based attacks, although more sophisticated than the other forms, are increasingly easy to produce. The process of presenting synthetic samples to an acquisition sensor, however, inevitably includes noise information and telltales, which are added to the biometric signal and can be used to identify attempted attacks. Despite being widely used for face recognition, data-driven models have just a recent history in the Face PAD problem and have been showing their potential to detect this kind of attack. Existing solutions are distinct, but a slight tendency can be perceived for the ones based on neural networks. Usually, pre-trained Convolution Neural Network (CNN) architectures are used as feature extractors, and these features are then used to train a classifier (e.g., SVM). Ito et al. [18], for instance, have investigated two different CNN architectures for Face PAD: CIFAR-10 and AlexNet. Instead of using cropped images of faces (as in traditional face recognition literature), the authors used the whole image as input to their method. The rationale behind this approach is that by exploiting the whole image, more information about the artifacts present in synthetic samples can be acquired. Although the proposed method overcame some baselines, experiments were performed only on one dataset. Thus, no general conclusion can be drawn about the robustness of the presented methods. Due to the nature of data-driven approaches, it is not always possible to decode the artifacts in attacks that are being exploited by the model. The model is entirely in charge of extracting features from the data (images or videos) that maximize the learning process. This aspect, however, can be partially controlled by extracting dynamic and static features. In this context, Wu et al. [52] proposed a wellengineered data-driven method. The idea is that, by combining the movements of a person in a video (dynamic features) with texture features from the frames (static features), complementary telltales of an attack can be assessed. The method extracts static features frame-by-frame using a CNN. For dynamic features, however, it employs the horizontal and vertical optical flow by using the Lucas-Kanade Pyramid method to extract dynamic maps from the frames followed by the CNN on the dynamic maps. Both static and dynamic features are combined through concatenation and used as input to a binary SVM classifier for decision-making. Yang et al. [55] have also explored static and dynamic features. An initial step is based on the use of Local Binary Patterns (LBP) descriptors to extract more generalized and discriminative low-level features of face images. LBP features are successfully applied in an intra-dataset protocol, but the performance may

298

L. A. M. Pereira et al.

degrade severely in a more realistic scenario, i.e., inter- or cross-dataset protocol, due to factors such as abnormal shadings, specular highlights, and device noise [6]. For that reason, the authors encoded these low-level features into high-level features via deep learning and proposed a sparse auto-encoder (SAE) to tackle the aforementioned issues. SAE consists on the application of a sparse penalty in a traditional auto-encoder, which is an axisymmetric single hidden-layer neural network, to strengthen the generalization ability of the model. It has a significant advantage when addressing complex problems: extract characteristics that reflect the adhesion state. Finally, a binary SVM classifier is used to distinguish genuine from synthetic face samples. Nonetheless, the training was only performed with an intra-dataset protocol. In the last years, dictionary learning, a well-known machine-learning concept, has been introduced as a candidate to build deep architectures, creating a new branch called Deep Dictionary Learning (DDL). The idea consists of stacking up dictionary learning layers to form a DDL [48] structure. Manjani et al. [26] developed a DDL approach in which layers of single-level dictionaries are stacked one after another, yielding a sparse representation of features. The main advantages are the mitigation of the requirement of large training datasets, promising intra-dataset results, and the discernment between different types of attacks, even unknown ones. However, the main concerns about this representation are the difficulty of extracting fine-grained features to deal with real mask attacks, and the lack of generalization of the method.

3.2 Iris PAD Despite the better accuracy of iris authentication methods in comparison with methods based on face and fingerprint traits, the use of this technology was limited to protect only high-restrict systems and places due to mainly the costs associated to implementation of this technology. Nowadays, iris authentication methods permeate our daily life due to research efforts toward making processes and sensors cheaper and smaller, as we can found in modern computing devices, such as smartphones. However, even the high-accuracy and advances in iris biometrics, the current irisbased authentication systems still suffer from vulnerabilities to presentation attacks. Currently, the most effective PA methods to bypass an iris authentication system consist of showing to an acquisition sensor printed photos containing iris patterns of legitimate users, textured and transparent contact lenses used for impersonating a legitimate user or for concealing the real identity of an attacker. From this perspective, we categorized the existing works on iris presentation attack detection into two non-disjoint groups based on their ability to detect the following attack types: print-based attacks, performed with printout iris images; and methods aiming at detecting attempted attacks performed with contact lenses. The first CNN-based approach proposed to detect iris presentation attacks was presented to the community by Menotti et al. [28]. In this study, the authors presented a unified framework to perform architecture- and filter-level optimization

The Rise of Data-Driven Models in Presentation Attack Detection

299

for three biometric modalities (iris, face, and fingerprint). The proposed framework was developed based on a shallow CNN architecture, the SpoofNet, with two convolutional layers tailored to detect PAs in different modalities. This framework was convenient to the community due to the small size of the datasets freely available at the time to build iris PA systems. A drawback of the study, regarding the iris modality, relies on the fact that the SpoofNet was evaluated only for print-based attempted attacks. Similarly, Silva et al. [40] presented a study that evaluated the SpoofNet in other attack scenarios different from those proposed in [28]. In that work, the authors proposed some training methodologies considering the Notre Dame and IIIT-Delhi datasets, which are composed by Near Infrared (NIR) iris images that represent bona fide presentations and presentation attacks performed with textured (colored) contact lenses and soft (transparent) contacted lenses. The authors also adapted the SpoofNet to detect three types of images: non-attack, textured, and transparent contact lenses. The proposed method outperformed existing methods for the Notre Dame dataset achieving an overall accuracy of 82.80%. The reported results suggested that SpoofNet was able to detect transparent contact lenses better than textured contact lenses and bona fide presentations (iris images without any contact lenses for this particular dataset). The obtained results considering the IIITDelhi dataset reveal some limitations of the SpoofNet architecture to distinguish these different kinds of presentations, especially the confusion of bone fide samples with transparent contact lenses attacks. Toward overcoming the SpoofNet limitation, Raghavendra et al. [37] proposed a novel CNN architecture more robust to distinguish bona fide presentations, textured, and transparent contact lenses. Similarly to Silva et al. [40], the authors proposed a multi-class CNN designed to classify an input image into bona fide and PA performed with textured and transparent contact lenses. However, similar to He et al., the authors used normalized iris image patches as input to the CNN, while Silva et al. fed their CNN with raw images and also used a six-layer CNN and dropout, a mechanism to reduce overfitting of the network. In [17], He et al. proposed a multi-patch CNN capable of detecting both attack types, print- and contact lens-based attempted attacks, by training a CNN with normalized iris image patches of size 80 × 80 pixels, which can significantly reduce the trainable parameters of a deep network and, therefore, prevent possible generalization problems such as overfitting. The authors reported near-perfect classification results for both attack types. The comparison among the proposed method and other handcraft methods in prior art shows the superiority of CNNbased approaches over feature engineering-based methods. Pala and Bhanu [30] developed a deep learning approach based on triplet convolutional neural networks, whereby three networks map iris image patches into a representation space. This is done by either taking two real examples and a fake one or two fakes and a real one, yielding intra-class and inter-class distances. The goal is to learn a distance function so that two examples taken from the same class are closer together than two examples taken from different classes. Two samples of the same class are as close as possible, according to the learned distance function.

300

L. A. M. Pereira et al.

The method was evaluated in three different datasets containing print- and contactlens attack and compared with descriptor-based methods and a CNN approach from [28], achieving the lowest average classification error rates for all datasets. The main advantage of this framework relies on its small architecture, being easy to implement on hardware, with reduced computational complexity. However, there is no guarantee that the framework performs well in a cross-dataset scenario as no tests in this regard were discussed in their work. Kuehlkamp et al. [22] combined handcrafted and data-driven features to generate multiple transformations on the input data looking for more appropriate inputspace representations. Handcrafted features are obtained by extracting multiple viewpoints of binarized statistical image features (BSIF), which are then used to train lightweight CNNs. After that, a meta-analysis technique is used for selecting the most important and discriminative set of classifiers, performing meta-fusion from selected viewpoints to build a final classification model that performs well not only under cross-domain constraints, but also under intra- and cross-dataset setups. As an advantage, this approach offers an iris PAD algorithm that better generalizes to unknown attack types, also outperforming state-of-the-art methods in this regard.

3.3 Fingerprint PAD Fingerprints are one of the most present biometric traits nowadays, being widely adopted in security systems and sensitive environments. In some cases, a biometric system could be potentially defenseless against fake fingerprints. However, research has been made to mitigate the risk of attacks by proposing software- or hardwarebased solutions. Among the hardware-based solutions, fingerprint liveness detection has been considered by most of the recent works. For software-based solutions, deep learning approaches play a crucial role, yielding state-of-the-art results. Pala et al. [31] proposed a patch-based triplet siamese network for fingerprint PAD. Under a classical binary classification formulation (live/fake), the network comprises a deep metric learning framework that can generate representative features of real and artificial fingerprints. The proposed method evaluates liveness by comparing with the same fingerprint set of patches used for training, instead of requiring an enrollment database. It also tackles the limitations of current deep learning approaches regarding computational cost, thus allowing mobile and off-line implementation. In [29], three well-known deep learning networks were utilized in the form of a binary classification problem. The networks were fine-tuned using the weights of a pre-trained network originally trained on the ImageNet dataset [21], rather than training them from scratch for each network. The authors have shown the effect of data augmentation techniques not only in the case of deep learning framework but also when a shallow technique such as LBP was utilized. Additionally, they followed an experimental protocol taking cross-dataset validation into consideration and made a significant comparison among the methods in their approach. Sun-

The Rise of Data-Driven Models in Presentation Attack Detection

301

daran et al. [45] showed how training a single CNN-based classifier using different available datasets can aid generalization and boost performance. An analytical study has been made in [49] for feature fusion by taking into account different features and methods. A two-stage deep neural network that starts from general image descriptors was adopted. In the first stage, the method is capable of simultaneously learning a transformation of different features into a common latent space used for classification in a second stage. Nogueira et al. [29] compared four deep learning techniques for liveness detection. They studied the effect of using pre-trained weights, concluding that using a pre-trained CNN could yield good results without considering modifying neither the architectures nor their hyperparameters. Deep Boltzmann Machines (DBMs) are another type of neural network that consist of learning stochastic energy-based on complex patterns. DBMs were considered in [9] and [42] for liveness detection in fingerprint data. In [9], a fingerprint spoofing detection method was proposed based on DBMs. After the network was trained on fingerprint data samples, a SVM classifier was trained in order to classify the high-level features generated by the DBMs. In [42], the authors proposed a DBM-based method which had a final layer added at the top of the network with two softmax units, forming an MLP network, to identify normal and attack patterns. Toosi et al. [50] proposed a patch-based approach for liveness detection. The method extracts a set of patches and then a classifier is applied for each patch by utilizing the AlexNet architecture as a feature extractor. The final class label is computed based on the probability scores of patches. Another patch-based method [32] attempted to detect liveness based on a voting strategy on patches classified by a CNN. Wang et al. [51] developed FingerNet, a DNN for fingerprint liveness detection with a voting strategy at the end for decision-making. Its architecture was inspired by another DNN called CIFAR-10, with the difference that the convolutional and pooling layers are more complex, besides the inclusion of an extra inner product layer. The training process, however, is not done directly on the images from these datasets. Instead, each image is cut into patches of 32 × 32 pixels, followed by a segmentation step that excludes patches depicting background content, leaving the remaining ones for training. The test process is also performed on image patches, and the voting strategy is then applied by computing labels of each patch within a fingerprint image. The label with the highest vote is chosen as the image label. Experiments were performed with LivDet2011 and LivDet2013 datasets, with FingerNet outperforming CIFAR-10. Zhang et al. [57] improved the Inception [46] architecture and built a lightweight CNN for fake fingerprint detection. In the proposed architecture, the original fully connected layer was replaced by a global average pooling layer to reduce overfitting and enhance robustness to spatial translations. For the experiments, the authors created an in-house 2D dataset with fingerprints made from different materials (along with some live examples). The reported results were expressed regarding a weighted average rate of correctly classified live fingerprints and fake fingerprints,

302

L. A. M. Pereira et al.

outperforming not only the original Inception architecture but also other methods from the literature based on CNNs.

4 Countermeasures for Face, Iris, and Fingerprint Presentation Attack Detection In this section, we discuss two approaches to deploy countermeasures based on data-driven models for detecting presentation attacks, which were successfully used to detect face-, iris-, and fingerprint-based presentation attacks in prior art.

4.1 Architecture and Filter Optimization Finding suitable architectures for a given application is a challenge and timeconsuming task due to the high dimensionality of the hyper-parameter space. Moreover, the absence of enough data makes the hyper-parameter search harder for some applications. For this reason, several techniques have been proposed in the literature toward mitigating these problems by proposing heuristics to find suitable architectures faster. The Tree-structured Parzen Estimator (TPE) is a sequential model-based optimization approach [3], which can construct models to approximate the performance of hyper-parameters based on historical observations. In summary, the TPE algorithm models probabilities P (x|y) and P (y), where x represents hyper-parameters and y represents the loss function’s value associated with x. This modeling is performed by using historical observations to estimate non-parametric density distributions for the hyper-parameters, which is used to predict good values for them [2]. Another approach to explore the hyper-parameter space toward finding suitable architectures for a given problem is the random search algorithm. In general, this approach is more efficient than manual search and the grid search algorithm [34]. The random search algorithm explores the hyper-parameter space by identifying valid hyper-parameter assignments, which defines a valid configuration space for the problem. Finally, hyper-parameters are randomly selected, considering a uniform distribution. The main advantage of this strategy is that it is simpler to implement it in non-parallel and parallel versions. The search for good filter weights in a deep learning architecture is also a challenging task. The huge number of parameters and the small size of the datasets available for PAD in the literature may prevent optimization algorithms to find an optimal or even reasonable solution. According to Menotti et al. [28], architecture and filter optimization strategies are effective toward deploying suitable deep learning models. Figure 2 shows a

The Rise of Data-Driven Models in Presentation Attack Detection

303

(a) Fingerprint modality.

(b) Iris modality.

(c) Face modality.

Fig. 2 Filter and architecture optimization results for face, fingerprint and iris presentation attack detection considering different available datasets. (a) Fingerprint modality. (b) Iris modality. (c) Face modality

comparison study of both strategies. The architecture optimization was performed considering shallow architectures with up to three layers and filter kernel with random weights, while the filter optimization was performed considering predefined architectures to the problem and Stochastic Gradient Descent (SGD) for optimizing the filter weights, which were initialized randomly. As we can observe, the architecture optimization approach was able to find architectures with near-perfect classification results for face PAD problem (Fig. 2c), with a Half Total Error Rate of ≈0.0%. For the iris PAD problem, this approach also found an architecture with near-perfect classification results, whose accuracy was superior to the filter optimization strategy in two of the three datasets (i.e., Warsaw and Biosec) evaluated for this modality and competitive results for the MobBio fake dataset, as illustrated in Fig. 2b. In contrast, for the fingerprint PAD problem, the filter optimization presented better results than the architecture optimization approach, obtaining a near-perfect classification result for all datasets evaluated in this modality (Fig. 2a). For this reason, the interplay of these two approaches is

304

L. A. M. Pereira et al.

recommended, in cases that architecture optimization is not enough to find good solutions. The two options are recommended specially when there is not enough training data to represent the PAD problem of interest.

4.2 Fine-Tuning of Existing Architectures A second trend in the literature consists of building models using transfer-learning techniques to adapt pre-existing deep learning solutions pre-trained with thousands of hundreds of images to PAD problem. Depending on the modality, the transferlearning process could take advantage of pre-trained architectures whose source problem is related to the target problem, for instance, a pre-trained architecture for face recognition (source) for optimizing a deep learning architecture for the face PAD problem (target). With this in mind, Pinto [36] adapted the VGG network architecture [41], which was originally proposed for object recognition by the Visual Geometry Group, by transferring the knowledge obtained from the training process conducted with a huge dataset, the ImageNet [21] dataset. The fine-tuned architectures were evaluated considering the original protocols of the datasets used by the authors, as well as cross-dataset protocol, whose training and testing subsets come from different sources. Figure 3 shows results for these two evaluation protocols. For all modalities, the cross-dataset evaluation protocol presented poor results in comparison with the intra-dataset evaluation protocol, with exception to ATVS Iris dataset. In several cases, the performance of PAD solutions ranges from near-perfect classification results to worse than random performance, showing us a clear need of focus for new techniques considering the cross-dataset setup as well as robustness to unseen attacks (open-set scenarios).

5 Challenges, Open Questions, and Outlook This section presents the main existing limitations of current methods and shed some light on aspects that further research paths could undertake in order to tackle PAD problems. First of all, one of the main aspects preventing the introduction of robust methods is the lack of representative public datasets. Oftentimes the existing datasets lack generalization features and are normally small when we consider the era of big data. This limitation leads to another one, which is the difficulty of performing crossdataset and domain-adaptation validations. In the absence of data, it is hard to learn the changing aspects from one dataset to another or from one condition to another. Although we can see some reasonable effort from researchers in designing techniques for iris and fingerprint presentation attack detection, the same is still not true for faces. Most works in this modality still rely on handcrafted features.

The Rise of Data-Driven Models in Presentation Attack Detection

305

(a) Face modality.

(b) Iris modality.

(c) Fingerprint modality.

Fig. 3 Intra- and cross-data validation results considering different datasets and biometric modalities. Note how performance degrades when we move from the controlled setting of intra-dataset validation to the wild settings of cross-dataset validation. (a) Face modality. (b) Iris modality. (c) Fingerprint modality

Probably this fact is related to the lack of good and representative datasets but we believe the most likely reason is that faces vary much more often than iris and fingerprints (shadows, lighting, pose, scaling) and deep learning methods would need very robust functions to deal with such transformations without incurring in overfitting. Here is probably a very nice area of research. Existing work in iris PAD using deep learning seems to be ahead of fingerprint and face modalities. A recent method proposed by [22] seems to represent an important step toward cross-dataset validation. The meta-fusion of different views of the input data has yielded the best results and represents a good tackle on handling variations of different datasets. Despite this, faces and fingerprints are still far from a reasonable path toward solving the problem. For fingerprints, there is a sensible change in performance when changing acquisition sensors. For face, although acquisition sensors play an important role in rendering the problem difficult to solve, old problems known to researchers since the development of the first face

306

L. A. M. Pereira et al.

recognition methods are still present such as shadow and illumination changes. Even in the iris case, oftentimes existing methods fail under domain shift conditions caused by cross-sensors and also for detecting transparent contact lenses. Clear efforts from the community are needed in this regard as well. Recently, Li et al. [25] have shed some light on learning generalized features for face presentation attack detection. The authors proposed training their solution with augmented facial samples based on cross-entropy loss and further enhanced the training with a specifically designed generalization loss, which coherently serves as a regularization term. The training samples from different domains can seamlessly work together for learning the generalized feature representation by manipulating their feature distribution distances. When learning domain shift conditions, it seems that proposed robust loss functions, as well as ways of implicitly learning data variations, is a promising path to solve this hard problem. With respect to open-set validation conditions and robustness against unseen types of attacks, the literature is far behind. One of the first works considering this aspect was presented in the context of fingerprint PAD [38], but the authors show that much is still to be done in this regard. For faces and irises, the story is even worse in this regard. This is surely one of the hottest topics in PAD nowadays along with domain shift robustness as non-existing work shows reasonable performance for open-set conditions in any of face, fingerprint or iris modalities. Thus far, most of the existing prior art in PAD relying on deep learning methods have only scratched the surface of their potential. Virtually all methods in the prior art only adopt existing networks and tweak them somehow. Faces are a bit different in this regard, as some authors have investigated the effects of dictionary learning and stacking of different solutions. Nonetheless, few authors have endeavored to propose significant modeling changes to such solutions, and this surely represents an open avenue of opportunities. Finally, it is worth mentioning that although we surveyed some 60 papers in this chapter, the literature still presents a series of open problems in face presentation attack detection ranging from proposing representative datasets to robust methods under the cross-dataset and open-set validation setups to representative methods robust to domain shifts. Calling the attention of the community for such problems was the main motivation that led us to write this chapter, and we sincerely hope researchers will consider these aspects in their future investigations for their PAD problems. Acknowledgements The authors thank the financial support of the European Union through the Horizon 2020 Identity Project as well as the São Paulo Research Foundation—Fapesp, through the grant #2017/12646-3 (DéjàVu), and the Brazilian Coordination for the Improvement of Higher Education Personnel—Capes, through the DeepEyes grant.

The Rise of Data-Driven Models in Presentation Attack Detection

Appendix 1: Datasets and Research Work See Table 4. Table 4 Datasets per modality and prior work relying on them Modality Face PAD

Iris PAD

Fingerprint PAD

Dataset Replay-Attack SMAD CASIA-FASD 3DMAD MSU MFSD MSU USSA NUAA UVAD Clarkson17 IIITD Iris Spoofing IIITD Delhi ND CLD 2015 Warsaw17 IIITD WVU CASIA Iris Fake LivDet 2009 LivDet 2011 LivDet 2013 PBSKD Bogus

References [6, 18, 26, 39, 52] [26] [6, 26, 28, 39, 55] [24, 26, 28] [6, 24] [6, 55] [26] [22] [20] [20, 37, 40] [15, 17, 22, 37, 40] [17, 22, 28, 30] [22] [17] [29, 31, 32, 49] [29, 31, 49–51] [9, 28, 29, 31, 42, 49–51] [45]

307

308

L. A. M. Pereira et al.

Appendix 2: List of Acronyms BSIF CNN DBM DCNN DNN HTER LBP MLP PA PAD SAE SVM

Binary statistical image features Convolutional neural network Deep Boltzmann machine Deep convolutional neural network Deep neural network Half total error rate Local binary pattern Multilayer perceptron Presentation attack Presentation attack detection Sparse auto-encoder Support vector machines

References 1. BBC News, Doctor ‘used silicone fingers’ to sign in for colleagues. BBC News. https://www. bbc.com/news/world-latin-america-21756709 (2013). Accessed 13 Jan 2019 2. J. Bergstra, R. Bardenet, Y. Bengio, B. Kégl, Algorithms for hyper-parameter optimization, in Advances in Neural Information Processing Systems (2011), pp. 2546–2554 3. J. Bergstra, D. Yamins, D.D. Cox, Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures, in International Conference on Machine Learning (2013), pp. I-115–I-123 4. S. Bhattacharjee, A. Mohammadi, S. Marcel, Spoofing deep face recognition with custom silicone masks, in IEEE International Conference on Biometrics: Theory, Applications and Systems (2018) 5. Z. Boulkenafet, J. Komulainen, L. Li, X. Feng, A. Hadid, OULU-NPU: a mobile face presentation attack database with real-world variations, in IEEE International Conference on Automatic Face and Gesture Recognition (2017) 6. G. Cai, S. Su, C. Leng, J. Wu, Y. Wu, S. Li, Cover patches: a general feature extraction strategy for spoofing detection. Concurrency Comput. Pract. Exp. 31, e4641 (2018) 7. I. Chingovska, A. Anjos, S. Marcel, On the effectiveness of local binary patterns in face antispoofing, in International Conference of the Biometrics Special Interest Group (2012) 8. T. Chugh, K. Cao, A.K. Jain, Fingerprint spoof buster: use of minutiae-centered patches. IEEE Trans. Inf. Forensics Secur. 13(9), 2190–2202 (2018) 9. G.B. de Souza, D.F. da Silva Santos, R.G. Pires, A.N. Marana, J.P. Papa, Deep features extraction for robust fingerprint spoofing attack detection. J. Artif. Intell. Soft Comput. Res. 9(1), 41–49 (2019) 10. K. Delac, M. Grgic, A survey of biometric recognition methods, in International Symposium Electronics in Marine, vol. 46 (2004), pp. 16–18 11. J.S. Doyle, K.W. Bowyer, Robust detection of textured contact lenses in iris recognition using BSIF. IEEE Access 3, 1672–1683 (2015) 12. N. Erdogmus, S. Marcel, Spoofing in 2D face recognition with 3D masks and anti-spoofing with Kinect, in IEEE International Conference on Biometrics: Theory, Applications and Systems (2013)

The Rise of Data-Driven Models in Presentation Attack Detection

309

13. N. Erdo˘gmu¸s, S. Marcel, Introduction, in Handbook of Biometric Anti-Spoofing: Trusted Biometrics under Spoofing Attacks (Springer, London, 2014), pp. 1–11 14. L. Ghiani, D. Yambay, V. Mura, S. Tocco, G.L. Marcialis, F. Roli, S. Schuckcrs, LivDet 2013 fingerprint liveness detection competition 2013, in IAPR International Conference on Biometrics (2013), pp. 1–6 15. D. Gragnaniello, C. Sansone, G. Poggi, L. Verdoliva, Biometric spoofing detection by a domain-aware convolutional neural network, in International Conference on Signal Image Technology and Internet-Based Systems (2017), pp. 193–198 16. P. Gupta, S. Behera, M. Vatsa, R. Singh, On iris spoofing using print attack, in International Conference on Pattern Recognition (2014), pp. 1681–1686 17. L. He, H. Li, F. Liu, N. Liu, Z. Sun, Z. He, Multi-patch convolution neural network for iris liveness detection, in IEEE International Conference on Biometrics: Theory, Applications and Systems (2016), pp. 1–7 18. K. Ito, T. Okano, T. Aoki, Recent advances in biometric security: A case study of liveness detection in face recognition, in Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (2017), pp. 220–227 19. A.K. Jain, A. Ross, S. Prabhakar, An introduction to biometric recognition. IEEE Trans. Circuits Syst. Video Technol. 14(1), 4–20 (2004) 20. N. Kohli, D. Yadav, M. Vatsa, R. Singh, A. Noore, Synthetic iris presentation attack using iDCGAN, in IEEE International Joint Conference on Biometrics (2018), pp. 674–680 21. A. Krizhevsky, I. Sutskever, G. Hinton, ImageNet classification with deep convolutional neural networks, in Advances in Neural Information Processing Systems (2012), pp. 1097–1105 22. A. Kuehlkamp, A. Pinto, A. Rocha, K.W. Bowyer, A. Czajka, Ensemble of multi-view learning classifiers for cross-domain iris presentation attack detection. IEEE Trans. Inf. Forensics Secur. 4(6), 1419–1431 (2019) 23. A. Kumar, A. Passi, Comparison and combination of iris matchers for reliable personal authentication. Pattern Recogn. 43(3), 1016–1026 (2010) 24. X. Li, J. Komulainen, G. Zhao, P.C. Yuen, M. Pietikäinen, Generalized face anti-spoofing by detecting pulse from face videos, in International Conference on Pattern Recognition (2016), pp. 4244–4249 25. H. Li, P. He, S. Wang, A. Rocha, X. Jiang, A.C. Kot, Learning generalized deep feature representation for face anti-spoofing. IEEE Trans. Inf. Forensics Secur. 13(10), 2639–2652 (2018) 26. I. Manjani, S. Tariyal, M. Vatsa, R. Singh, A. Majumdar, Detecting silicone mask-based presentation attack via deep dictionary learning. IEEE Trans. Inf. Forensics Secur. 12(7), 1713–1723 (2017) 27. G.L. Marcialis, A. Lewicke, B. Tan, P. Coli, D. Grimberg, A. Congiu, A. Tidu, F. Roli, S. Schuckers, First international fingerprint liveness detection competition – LivDet 2009, in International Conference on Image Analysis and Processing (2009), pp. 12–23 28. D. Menotti, G. Chiachia, A. Pinto, W.R. Schwartz, H. Pedrini, A.X. Falcao, A. Rocha, Deep representations for iris, face, and fingerprint spoofing detection. IEEE Trans. Inf. Forensics Secur. 10(4), 864–879 (2015) 29. R.F. Nogueira, R. de Alencar Lotufo, R.C. Machado, Fingerprint liveness detection using convolutional neural networks. IEEE Trans. Inf. Forensics Secur. 11(6), 1206–1213 (2016) 30. F. Pala, B. Bhanu, Iris liveness detection by relative distance comparisons, in IEEE Conference on Computer Vision and Pattern Recognition Workshops (2017) 31. F. Pala, B. Bhanu, On the accuracy and robustness of deep triplet embedding for fingerprint liveness detection, in IEEE International Conference on Image Processing (2017), pp. 116– 120 32. E. Park, W. Kim, Q. Li, J. Kim, H. Kim, Fingerprint liveness detection using CNN features of random sample patches, in International Conference of the Biometrics Special Interest Group (2016), pp. 1–4 33. K. Patel, H. Han, A.K. Jain, Secure face unlock: spoof detection on smartphones. IEEE Trans. Inf. Forensics Secur. 11(10), 2268–2283 (2016)

310

L. A. M. Pereira et al.

34. N. Pinto, D. Doukhan, J. DiCarlo, D. Cox, A high-throughput screening approach to discovering good forms of biologically inspired visual representation. PLoS Comput. Biol. 5(11), e1000579 (2009) 35. A. Pinto, W.R. Schwartz, H. Pedrini, A. de Rezende Rocha, Using visual rhythms for detecting video-based facial spoof attacks. IEEE Trans. Inf. Forensics Secur. 10(5), 1025–1038 (2015) 36. A. Pinto, H. Pedrini, M. Krumdick, B. Becker, A. Czajka, K.W. Bowyer, A. Rocha, Counteracting presentation attacks in face fingerprint and iris recognition, in Deep Learning in Biometrics (Taylor and Francis, Boca Raton, 2018) 37. R. Raghavendra, K.B. Raja, C. Busch, ContlensNet: robust iris contact lens detection using deep convolutional neural networks, in IEEE Winter Conference on Applications of Computer Vision (2017), pp. 1160–1167 38. A. Rattani, W. Scheirer, A. Ross, Open set fingerprint spoof detection across novel fabrication materials. IEEE Trans. Inf. Forensics Secur. 10(11), 2447–2460 (2015) 39. Y.A.U. Rehman, L.M. Po, M. Liu, LiveNet: improving features generalization for face liveness detection using convolution neural networks. Expert Syst. Appl. 108, 159–169 (2018) 40. P. Silva, E. Luz, R. Baeta, H. Pedrini, A. Falcao, D. Menotti, An approach to iris contact lens detection based on deep image representations, in Conference on Graphics, Patterns and Images (2015), pp. 157–164 41. K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition (2014), arXiv preprint arXiv:1409.1556 42. G.B. Souza, D.F. Santos, R.G. Pires, A.N. Marana, J.P. Papa, Deep Boltzmann machines for robust fingerprint spoofing attack detection, in International Joint Conference on Neural Networks (2017), pp. 1863–1870 43. Z. Sun, T. Tan, Iris anti-spoofing, in Handbook of Biometric Anti-spoofing: Trusted Biometrics under Spoofing Attacks (Springer, London, 2014), pp. 103–123 44. Z. Sun, H. Zhang, T. Tan, J. Wang, Iris image classification based on hierarchical visual codebook. IEEE Trans. Pattern Anal. Mach. Intell. 36(6), 1120–1133 (2014) 45. S. Sundaran, J.K. Antony, K. Vipin, Biometrie liveness authentication detection, in International Conference on Innovations in Information, Embedded and Communication Systems (2017), pp. 1–3 46. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 1–9 47. X. Tan, Y. Li, J. Liu, L. Jiang, Face liveness detection from a single image with sparse low rank bilinear discriminative model, in European Conference on Computer Vision (2010), pp. 504–551 48. S. Tariyal, A. Majumdar, R. Singh, M. Vatsa, Deep dictionary learning. IEEE Access 4, 10096– 10109 (2016) 49. A. Toosi, A. Bottino, S. Cumani, P. Negri, P.L. Sottile, Feature fusion for fingerprint liveness detection: a comparative study. IEEE Access 5, 23695–23709 (2017) 50. A. Toosi, S. Cumani, A. Bottino, CNN patch-based voting for fingerprint liveness detection, in International Joint Conference on Computational Intelligence (2017), pp. 158–165 51. C. Wang, K. Li, Z. Wu, Q. Zhao, A DCNN based fingerprint liveness detection algorithm with voting strategy, in Chinese Conference on Biometric Recognition (2015), pp. 241–249 52. L. Wu, Y. Xu, X. Xu, W. Qi, M. Jian, A face liveness detection scheme to combining static and dynamic features, in Chinese Conference on Biometric Recognition (2016), pp. 628–636 53. D. Yambay, L. Ghiani, P. Denti, G.L. Marcialis, F. Roli, S. Schuckers, LivDet 2011 – fingerprint liveness detection competition 2011, in IAPR International Conference on Biometrics (2012), pp. 208–215 54. D. Yambay, B. Becker, N. Kohli, D. Yadav, A. Czajka, K.W. Bowyer, S. Schuckers, R. Singh, M. Vatsa, A. Noore, et al., LivDet iris 2017 – iris liveness detection competition 2017, in IEEE International Joint Conference on Biometrics (2017), pp. 733–741

The Rise of Data-Driven Models in Presentation Attack Detection

311

55. D. Yang, J. Lai, L. Mei, Deep representations based on sparse auto-encoder networks for face spoofing detection, in Chinese Conference on Biometric Recognition (2016), pp. 620–627 56. Z. Zhang, J. Yan, S. Liu, Z. Lei, D. Yi, S.Z. Li, A face antispoofing database with diverse attacks, in IAPR International Conference on Biometrics (2012), pp. 26–31 57. Y. Zhang, B. Zhou, X. Qiu, H. Wu, X. Zhan, 2D fake fingerprint detection for portable devices using improved light convolutional neural networks, in Chinese Conference on Biometric Recognition (2017), pp. 353–360

Index

A Active Appearance Models (AAMs), 38–40 Active Shape Model (ASM), 39 Adversarial autoencoder (AAE), 11, 12, 17 Age estimation CNNs, 3 datasets, 3–4 deep learning based model, 3 evaluation metrics, 4 large-scale noise-free datasets, 10 models, 2 multi-column CNN, 8 predicted age, 3 See also Deep learning based age estimation model Age-invariant face recognition (AIFR) approaches, 16–17 CACD-VS, 14 deep learning methods, 15–16 use, 2 Age-oriented face image analysis models, 2 Age synthesis ageing process, 11, 13 deep learning based methods, 12–13 description, 10–11 evaluation methods, 12 future research on age synthesis, 13–14 generative model, 11 models, 2, 11–13 rejuvenating process, 11 AlexNet Training, 203–204, 259 Android phone, 150, 154, 1557 Anti-spoofing techniques CASIA-FASD database, 64, 65 dynamic approaches, 52

evaluation protocol, 59–60 Replay Attack database, 58, 60, 61 Replay Mobile database, 58–59, 61 FASNet, 53 feature level, 52 IQA (see Image quality assessment (IQA)) multimodal techniques, 53 score level, 52 sensor level, 52 static methods, 52 Artificial neural networks (ANNs), 79–80 Asymmetrical faces, 38 Attribute-complementary Re-ID net (ACRN), 25, 27, 33–34 Attribute consistency, 25, 31 Attribute Convolutional Net (ACN), 196 Attributed Pedestrians in Surveillance (APiS) database, 194 Attribute-Person Recognition (APR) network, 26, 27, 33 Attributes ACRN, 25, 27, 33–34 biometric recognition accuracy, 24 biometrics, 1 consistency, 22 PETA, 22, 23, 25, 27, 33–35 recognition accuracy, 33 Automatic person identification system, 21

B Bag of features (BOF), 73–74, 86 algorithm, 86 and CNN, 88

© Springer Nature Switzerland AG 2020 R. Jiang et al. (eds.), Deep Biometrics, Unsupervised and Semi-Supervised Learning, https://doi.org/10.1007/978-3-030-32583-1

313

314 Bag of features (BOF) (cont.) parameters, 77 performance, 93 recognition accuracy, 91 Baidu face recognition model, 124 Bedfordshire University, 84 Behavioral authentication, 142–143 Benchmark data sets, 83–84 Binarized statistical image features (BSIF), 300 Biometric attribute recognition accuracy, 24 Biometric blockchain (BBC), 249 blockchain technology (see Blockchain technology) communication network, 249 intelligent vehicles, 249 internet of vehicles, 245 ITS, 246–247 IV data sharing, 249, 250 network enabled connected devices, 250 peer-to-peer V2I communication, 246 privacy issues, 253–254 security protocols, 246–247 services and smart contracts, 246 supported intelligent vehicles, 251–253 VCC, 250 V2I, 245, 246 V2V, 245, 246 Biometric credit (BC) layer, 245, 253 Biometric face recognition BOF, 73–74 CNN, 76 data sets, 72 ML, 71 research and development, 71 SURF, 74–75 Biometric modalities, 51, 53, 162, 299, 305 Biometrics-as-a-service (BaaS), 249, 254 Biometrics attributes, 1 Biometric trait, see Face recognition; Iris recognition; Palmprint recognition Blockchain technology cryptography algorithm, 247 integrity, 247 peer-to-peer network, 247 vehicle, 247–249 Brox and Malik (BM), 263

C California Institute of Technology, 84 Carnegie Mellon University Hyperspectral Face Dataset (CMU-HSFD), 223

Index CASIA face anti-spoofing (CASIA-FASD) database, 64–66 CASIA Heterogeneous Face Biometrics Database (CASIA HFB Database), 224, 225 CASIA Multispectral Palmprint Database, 235 Centralized infrastructure, 249 Chinese Academy of Sciences’ Institute of Automation (CASIA), 235 CIFAR-10, 301 Classifier pool, 153–154 Close-set face identification, 132 Cloud-based scheme architecture, 148 data collection, 149 event-based method, 149–150 intelligent touch behavioral authentication, 148 time-based method, 149 Cloud environment, 148, 153, 155 CNN-based models, 2 Computational experiments, 84–85 Conditional random field (CRF), 264 Constellation model for ear recognition (COM-Ear), 161, 162 global and local features, 169 global processing path, 168 local processing path, 168–169 motivation, 167 Constrained Local Models (CLMs), 38–40 Convolutional neural networks (CNNs), 80–81, 167, 219–220, 225, 226, 259, 297 algorithm, 87–88 and BOF techniques, 88 computational time, 88 custom feature extractor, 88 D-CNN, 53 DeepID1, 114 landmark localisation methods, 41 orthogonal embedding, 16 and pairwise methods, 89 pre-trained CNN model CASIA-FASD database, 64–66 LDA and sigmoid classifiers, 64 pre-processing steps, 63 proposed scheme, 62 state-of-the-art methods, 65–66 training procedure, 63 structure, 113 Cosine margin-based loss function, 127–128 Cost-based intelligent mechanism, 144–145 behavioral authentication, 146

Index classifier pool, 154 decision tree, 145 initial expected cost, 145 involved users, 155 modern phones, 154 relative expected cost, 146 workload reduction, 154 Cross Sensor Iris and Periocular (CSIP) database, 231 Cumulative matching characteristics (CMC), 24 Cumulative match score curve (CMC), 172 Customised loss functions, 5–8, 16

D Data-driven models biometric system, 289–290 data protection methods, 289 digital systems, 289 disadvantages, 289 fingerprints, 291 manipulations, 289 Data layer, 251, 253 Data protection methods, 289 Deep Boltzmann Machines (DBMs), 301 Deep convolutional neural network (DCNN), 53, 195–196 Deep Dictionary Learning (DDL), 298 DeepFace architecture, 122, 264 Deep face recognition models experimental datasets, 132 on LWF and YTF datasets, 133 DeepID and DeepFace, 125 structure, 105, 114 DeepID1, 114, 116, 130, 137 and DeepID2, 114 DeepID2, 115–116 DeepID2+, 120 architecture, 116–117 DeepID3, 117 architecture, 119 VGGNet and GoogleNet, 126 Deep learning (DL) anti-spoofing (see Anti-spoofing techniques) approaches coarse-to-fine iterative refinement, 269 data augmentation, 267 FlowNet, 265, 268 FlowNet 2.0, 268 FlowNet 3, 269

315 FlownetS and FlownetC, 265–268 FlyingThings3D, 268 indirect-supervised method, 270–273 multi objective, 270 networks, 273 overview, 265, 266 semi-supervised method, 270–273 backpropagation algorithm, 259 biometric applications, 258 biometric approaches, 257 biometric systems, 257 DL progress, 258 face recognition (see Face recognition) FCNs, 259 gait recognition, 257 GANs, 260 hybrid methods, 258 landmark localisation methods (see Landmark localisation) motion clues, 257 optical flow estimation, 257–258 person Re-ID (see Person re-identification (person Re-ID)) physiological characteristics, 257 siamese networks, 259–260 Deep learning based age estimation model cross-entropy loss, 5 customised loss functions, 5–8 manifold learning algorithms, 5 multi-task learning, 8–10 network architecture, modification, 8 Deep learning based methods DeepIrisNet, 233 IMP database, 232–233 LightCNN29, 233 VGG-Net, 232 Deep learning based single attribute recognition model (DeepSAR), 196 Deep learning models comparative analysis, 102–103 dataset curation, 131 face recognition, 102–103 research work, 103 subject-dependent protocol, 131 Deep Neural Networks Direct attacks, 51 DiscreteFlow, 264, 265 Distributed biometric credits, 253

E Ear recognition ablation study, 174–175 automated recognition systems, 162

316 Ear recognition (cont.) COM-Ear (see Constellation model for ear recognition (COM-Ear)) constrained conditions, 163–165 controlled imaging conditions, 162 experimental datasets, 171–172 implementation details, 170 model training, 170 performance metrics, 172–173 qualitative evaluation, 181–186 robustness to occlusions, 179–181 state-of-the-art models (see State-of-the-art models) training details, 173–174 unconstrained conditions, 162, 165–166 Electromagnetic spectrum, 215–216 EpicFlow, 264–265 Essex Grimace data set, 86 Euclidean distance loss function, 126–127 Extended Annotated Web Ears (AWEx), 162, 163, 171, 186

F Face++, 111, 121, 122, 130, 132, 133 Face alignment, 38 algorithms, 113 Face Alignment Network (FAN), 39, 41, 43–45, 47 Face anti-spoofing network (FASNet), 53, 64, 66 Face appearance analysis, 52 Face-VS-background motion (FBM), 54–55 Face biometric, 72, 224, 227 Face detection CNN cascade, 112 and modelling, 41, 45 test pipeline, 112 Face matching identification, 129 processing, deep features, 129–130 3D reconstruction, 130 verification, 128–129 FaceNet, 122 harmonic triplet loss, 123 model structure, 123 performance, 123–124 Face recognition algorithms, 134 CASIA HFB Database, 224, 225 CASIA NIR-VIS 2.0 Database, 225, 226 CMU-HSFD dataset, 223 CNN, 135–136, 225, 226 components, 106–107

Index concepts and terminologies, 106 conventional pipeline, 99–100 current literature, 103–104 datasets deep learning networks, 109 preparation process, 109 DeepFace, 105 deep feature extraction, 107–108 deep learning architectures, 111–112 face recognition (see Face recognition) facial thermograms, 222 factor, 134 HK PolyU-HSFD dataset, 223 HK PolyU NIR Face Database, 224 identification and authentication, 51 machine learning algorithms, 222 matching, 108 near-infrared spectrum images, 134–135 NIRFaceNet, 226–227 processing, 107 recognition systems, 222 thermal imaging, 222 3D training datasets, 135 traditional, problem, 111 training and testing phases, 108 unique and universal, 221 UWA-HSFD dataset, 224 Facial analysis False acceptance rate (FAR), 151 False rejection rate (FRR), 151 Feature extraction techniques BOF, 76–77 HOG, 78 SURF, 77–78 FlowFields, 264 Freebase knowledge graph, 110 Fully convolutional networks (FCNs), 41, 196, 259 G Gallery set images, 218 Gatech data sets, 89 Generative adversarial networks (GANs), 11, 260 Geometric approaches, 164 Georgia Technology Institute Face Database, 84 GoogleNet, 259 H Handshake layer, 253 Histogram of oriented gradients (HOG), 75–76, 78–79, 263

Index The Hong Kong Polytechnic University Hyperspectral Face Database (HK PolyU-HSFD), 223 The Hong Kong Polytechnic University (PolyU) NIR Face Database, 224 Hybrid methods, 275–278 HyperFace, 41

I Identity-Preserving Conditional GAN (IPCGAN), 13 IIITD Multispectral Periocular Database (IMP), 229 ImageNet Large Scale Visual Recognition Challenge (ILSVRC), 259 Image quality assessment (IQA) FR-and NR-IQMs, 56 frame extraction, 54–55 selected IQM, 57–58 stages, 54 Image quality measures (IQMs) computational cost, 66 full-reference (FR), 53 no-reference (NR), 53 quality artifacts, 52 reduced reference (RR), 53 Indirect-supervised method, 270–273 Intelligent mechanism behavioral authentication, smartphones, 146 cost-based metric, 145 evaluation, 152 Intelligent transport systems (ITS), 246–247 Intelligent Vehicle Biometric Credits (IV-BC), 247 Intelligent vehicles, 249 2017 International Joint Conference on Biometrics (IJCB 2017), 176–177 Intersection-over-Union (IoU), 208 Iris recognition CASIA-Iris-Thousand Database, 229 cellular density and pigmentation, 227 CSIP dataset, 231 deep learning based methods (see Deep learning based methods) face and fingerprint, 228 IIT Delhi Iris Database, 228 IMP database, 229 InfraRed (IR), 227 University of Notre Dame Dataset, 229–231 VISOB dataset, 231

317 Islamic Azad University (IAU), 177 IV data sharing, 245, 246, 249, 250

J Japanese Female Facial Expression Database, 84 Joint Recurrent Learning (JRL) model, 197

K K—Nearest Neighbors (KNN), 194–195

L Landmark localisation application examples, 38 CNN, 41 evaluation, 45–47 geometry-based features, 37 task, 37 texture features, 37 traditional methods, 39–40 Liveness detection, 52, 55 Local binary patterns (LBP), 297–298 Long-Wave InfraRed (LWIR), 221 Lucas Kanade (LK) approach, 262

M Machine Learning (ML), 71–72 Markov Networks, 40 MATLAB Deep Learning Toolbox, 88 MATLAB Parallel Computing Toolbox, 82–83 Maximum Mean Discrepancy (MMD), 31–33 Mean absolute error (MAE), 4 Megvii face recognition system, 122 Mid-Wave InfraRed (MWIR), 221 Mobile device, 141 touchscreen, 143–144 Multi-attribute recognition (DeepMAR), 196 Multi-frame methods, 278 Multi-label convolutional neural network (MLCNN), 196 Multi-people tracking (MPT), 280 Multi-task Mid-level Feature Alignment (MMFA) network, 31–33

N Near Infrared (NIR), 299 Network layer, 253 Neural networks, 232–233

318 Newton–Raphson technique, 262 Normalised Mean Error (NME), 45, 47

O Open-end research questions, 136–137 Optical flow estimation assumptions, 260 biometrics applications, 280–281 data augmentation, 275 datasets, 273–275 definition, 260 estimation ranking, 280 flow estimation benchmarks, 279 handcrafted methods, 261 hybrid methods, 275–278 motion velocity vector, 260 multi-frame methods, 278 patch based methods, 262–263 performance assessment, 279 training schedule, 275 variational methods, 261, 262 variational refinement, 263–265 Optimisation, 13, 29, 39, 94

P Pairwise classification function, 81–82 Palmprint recognition CASIA Multispectral Palmprint Database, 235 fingerprint, 234 PCANet deep learning, 235–237 PolyU Multispectral Palmprint Database, 234, 235 spectral bands, 234 spectral imaging, 234 Patch based methods, 261–263 Patch-relevant-information pooling (PRIpooling), 169 Pedestrian attributes recognition dataset (PETA), 22, 23, 25, 27, 33–35 Peer-to-peer communication, 248 Person re-identification (person Re-ID) attributes, 22 CMC curve, 24 evaluation metrics, 24 real-world surveillance system, 29 semi-supervised person Re-ID MMFA network, 31–33 SSDAL, 27–28 with soft biometrics annotations, 22–24 TJ-AIDL, 29, 31, 32, 34

Index Physical layer, 251 Physiological attributes, 1 PolyU Multispectral Palmprint Database, 234, 235 Presentation attack, 290 Presentation attack detection (PAD), 51, 53, 61, 290 architecture and filter optimization, 302–304 benchmark, 292 cross-dataset and domain-adaptation validations, 304, 305 data-driven methods face, 297–298 fingerprint, 300–302 iris, 298–300 datasets and research work, 307 face, 292, 293 faces and fingerprints, 305–306 fine-tuning of existing architectures, 304 fingerprint, 295 iris, 292, 294, 295 Presentation layer, 253 Principle component analysis (PCA), 39, 40, 75 “Pristine image.”, 53 Privacy-protected mechanism, 253 Probabilistic collaborative representation classifier (P-CRC), 53

R Receiver operating characteristic (ROC), 173 Recognition, 1–2 Rectified Linear Units (ReLUs), 195 Recurrent Neural Network (RNN), 197 Region of Interest (RoI), 196 Region Proposal Network (RPN), 196 Robust Cascaded Posed Regression, 40

S Scale invariant feature transform (SIFT), 262–263 Semi-supervised deep attribute learning (SSDAL), 27, 28, 33, 34 Semi-supervised method, 270–273 Service layer, 253 Siamese networks, 259–260 Smartphones, 141–142, 298 Soft biometrics advantages, 192–193 age-related, 2

Index attribute recognition, 196–198 biometric characteristics, 191 computer vision and deep learning, 194 dataset overview, 204–206 DCNN, 195–196 descriptions, 1, 21 evaluation metric, 206–208 face recognition system, 191, 192 fingerprint-based system, 194 gender and cloth color, 193 handcrafted features, 194–195 Mask R-CNN, 193 person retrieval system AlexNet Training, 199, 203–204 camera calibration, 200–202 gender classification, 203 height estimation, 200–202 linear filtering approach, 199 Mask R-CNN, 199 torso color detection, 202–203 qualitative and quantitative results challenges, 210–212 IoU, 208 TPR, 208–210 retrieval accuracy, 212 semantic description, 192 as semantic mid-level features, 24 semi-supervised person Re-ID (see Semi-supervised deep attribute learning (SSDAL)) supervised attribute assisted identification based person Re-ID, 26–27 verification based person Re-ID, 24–26 vs. traditional, 2 traits, 2 Softmax Cross Entropy, 29, 32 Softmax loss, 125, 127, 128, 225 Sparse auto-encoder (SAE), 298 Spectral bands, 220–221 Spectral biometric systems banking sector, 215 criteria, 217 design perspective, 218–220 electromagnetic spectrum, 215–216 feature extraction methods, 215 fusion, 216 hyperspectral imaging, 216 remote sensing, 216 spectral bands, 216, 220–221 Spectral imaging electromagnetic system, 216 infrared, 220

319 iris, 227 multimodal biometric system, 234 multispectral and hyperspectral imaging, 216 palmprint, 218 Spectrum-1 NNet, 232 Spectrum-2 NNet, 232 Speeded-Up Robust Features (SURF), 74–75, 77–78, 86, 88, 90–92 median performance, 92 Spoofing attacks, 51 Stacked hourglass network depth network for 3D, 44 FAN, 44, 45 hourglass design, 42 human pose estimation, 41 with intermediate supervision, 42–43 State-of-the-art models comparative evaluation, 176–177 UERC, 176–179 Stochastic gradient descent (SGD), 173, 303 Support vector machines (SVMs), 194 Surveillance near-infrared spectrum images, 134 soft biometrics (see Soft biometrics) supervised learning frameworks, 29

T Testing set, 218 Texture analysis-based techniques, 52 Touch behavioral authentication schemes, 144 authentication signature, 147 feature, 147 Touch dynamics, 142, 143 Touchscreen mobile devices, 143–144 Traditional biometrics vs. soft biometrics, 2 Traditional machine learning, 5, 10 Transferable joint attribute-identity deep learning (TJ-AIDL), 29, 31, 32, 34 Tree Shape Model (TSM), 40, 45, 47 Tree-structured Parzen Estimator (TPE), 302 True positive rate (TPR), 208–210

U Unconstrained Ear Recognition Challenge (UERC), 165, 171, 176–180 User authentication mechanisms, 142 The UWA Hyperspectral Face Database (UWA-HSFD), 224

320 V Variational methods, 261, 262 Vehicle-to-Infrastructure (V2I) data sharing, 245, 246 Vehicle-to-Vehicle (V2V) data sharing, 245, 246 Vehicular ad-hoc network (VANET), 247

Index Vehicular cloud computing (VCC), 250 View-sensitive pedestrian attribute approach (VeSPA), 197 Visible light mobile ocular biometric (VISOB), 231 Visual Geometry Group (VGG), 304