Fine-Grained Image Analysis: Modern Approaches 3031313739, 9783031313738

This book provides a comprehensive overview of the fine-grained image analysis research and modern approaches based on d

315 97 7MB

English Pages 211 [212] Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Fine-Grained Image Analysis: Modern Approaches
 3031313739, 9783031313738

Table of contents :
Foreword
Preface
Contents
1 Introduction
2 Background
2.1 Problem and Challenges
2.2 Recognition Versus Retrieval
2.3 Domain-Specific Applications Related to Fine-Grained Image Analysis
3 Benchmark Datasets
3.1 Introduction
3.2 Fine-Grained Recognition Datasets
3.2.1 CUB200-2011
3.2.2 Stanford Dogs
3.2.3 Stanford Cars
3.2.4 Oxford Flowers
3.2.5 iNaturalist
3.2.6 RPC
3.3 Fine-Grained Retrieval Datasets
3.3.1 DeepFashion
3.3.2 SBIR
3.3.3 QMUL Datasets
3.3.4 FG-Xmedia
4 Fine-Grained Image Recognition
4.1 Introduction
4.2 Recognition by Localization-Classification Subnetworks
4.2.1 Employing Detection or Segmentation Techniques
4.2.2 Utilizing Deep Filters
4.2.3 Leveraging Attention Mechanisms
4.2.4 Other Methods
4.3 Recognition by End-to-End Feature Encoding
4.3.1 Performing High-Order Feature Interactions
4.3.2 Designing Specific Loss Functions
4.3.3 Other Methods
4.4 Recognition with External Information
4.4.1 Noisy Web Data
4.4.2 Multi-Modal Data
4.4.3 Humans-in-the-Loop
4.5 Summary
5 Fine-Grained Image Retrieval
5.1 Introduction
5.2 Content-Based Fine-Grained Image Retrieval
5.2.1 Selective Convolutional Descriptor Aggregation
5.2.2 Centralized Ranking Loss
5.2.3 Category-Specific Nuance Exploration Network
5.3 Sketch-Based Fine-Grained Image Retrieval
5.3.1 ``Sketch Me That Shoe''
5.3.2 Generalizing Fine-Grained Sketch-Based Image Retrieval
5.3.3 Jigsaw Puzzle for Fine-Grained Sketch-Based Image Retrieval
5.4 Summary
6 Resouces and Future Work
6.1 Deep Learning-Based Toolboxes
6.1.1 Hawkeye for Fine-Grained Recognition
6.1.2 PyRetri for Fine-Grained Retrieval
6.2 Conclusion Remarks and Future Directions
A Vector, Matrix and Their Basic Operations
A.1 Vector and Its Operations
A.1.1 Vector
A.1.2 Vector Norm
A.1.3 Vector Operation
A.2 Matrix and Its Operations
A.2.1 Matrix
A.2.2 Matrix Norm
A.2.3 Matrix Operation
Stochastic Gradient Descent
Chain Rule
Convolutional Neural Networks
D.1 Development History
D.2 Basic Structure
D.3 Feed-Forward Operations
D.4 Feed-Back Operations
D.5 Basic Operations in CNNs
D.5.1 Convolution Layers
D.5.2 Pooling Layers
D.5.3 Activation Functions
D.5.4 Fully Connected Layers
D.5.5 Objective Functions
References
Index

Citation preview

Synthesis Lectures on Computer Vision

Xiu-Shen Wei

Fine-Grained Image Analysis: Modern Approaches

Synthesis Lectures on Computer Vision Series Editors Gerard Medioni, University of Southern California, Los Angeles, CA, USA Sven Dickinson, Department of Computer Science, University of Toronto, Toronto, ON, Canada

This series publishes on topics pertaining to computer vision and pattern recognition. The scope follows the purview of premier computer science conferences, and includes the science of scene reconstruction, event detection, video tracking, object recognition, 3D pose estimation, learning, indexing, motion estimation, and image restoration. As a scientific discipline, computer vision is concerned with the theory behind artificial systems that extract information from images. The image data can take many forms, such as video sequences, views from multiple cameras, or multi-dimensional data from a medical scanner. As a technological discipline, computer vision seeks to apply its theories and models for the construction of computer vision systems, such as those in self-driving cars/navigation systems, medical image analysis, and industrial robots.

Xiu-Shen Wei

Fine-Grained Image Analysis: Modern Approaches

Xiu-Shen Wei VIP Group Southeast University Nanjing, China

ISSN 2153-1056 ISSN 2153-1064 (electronic) Synthesis Lectures on Computer Vision ISBN 978-3-031-31373-8 ISBN 978-3-031-31374-5 (eBook) https://doi.org/10.1007/978-3-031-31374-5 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Try not to become a man of success. Rather become a man of value. —Albert Einstein

Foreword

In the context of Computer Vision, the term fine-grained generally refers to subordinate categories. For example, “Indigo Bunting” is a fine-grained designation, in contrast to “bird”, which is an entry-level category. Humans generally find it easy to name entrylevel categories: you don’t need Wikipedia to tell you what you’re seeing is a bird. Fine-grained categories, on the other hand—in domains ranging from tissue pathology to remote sensing to modern art—require specialized expertise. Prior to the Deep Learning revolution, i.e., before 2012, such specialized expertise eluded Machine Learning models. Early fine-grained datasets such as Oxford Flowers 102 or Caltech-UCSD Birds 200 served as sobering reminders of the formidable gap between human and machine expertise. The baseline classification accuracy on the latter dataset, using 2010-era features like color histograms, vector-quantized SIFT descriptors, and a linear SVM, was a meager 17.3%. Fast forward to today, and modern methods based on transformers achieve accuracy above 92%. How did we get to this point? The chapters that follow will tell this story, viewed through the lens of advances in Deep Learning over the last decade. Xiu-Shen Wei and his team of collaborators have made their own pioneering contributions to fine-grained image analysis, and with this book, Prof. Wei distills a wide range of topics into a format suitable for an advanced undergraduate or beginning graduate student who wishes to make novel contributions of their own. Here you will find everything: from a review of the leading datasets and benchmarks, to the fundamentals of a sketch-based, finegrained image retrieval system, to appendices covering the fundamentals of convnets. Such a balance of coverage is ideal for a reader who wishes to grasp the most promising directions in a field that—like so many other areas of AI—is rapidly evolving. Copenhagen, Denmark February 2023

Serge Belongie

vii

Preface

As a tool technology to expand the capabilities of human beings, Artificial Intelligence (AI) has made great progress in recent decades. Among these fields of AI, Fine-Grained Image Analysis (FGIA), aiming to retrieve and recognize images belonging to multiple subordinate categories of a super-category, extends the capabilities of domain experts to AI systems, which has also made leaps and bounds in its technological progress associated with the development of deep learning. Nowadays, FGIA has become a fundamental and popular research area in Computer Vision and Pattern Recognition, attracting the attention of many researchers in academic universities and industrial research institutions. As a result, this field is flourished with a multitude of topics and techniques from various perspectives. Our goal in writing this book is to provide a comprehensive overview of the research studies of FGIA associated with modern deep learning techniques, which also sheds light on advanced application areas and futuristic research topics. This book aims to systematically sort out these related tasks and applications and introduce the fundamentals of convolutional neural networks to help further our understanding of the current state and future potential of FGIA research. As a professional reference and research monograph, this book covers multiple popular research topics and includes cross-domain knowledge, which will benefit readers from various levels of expertise in broad fields of science and engineering, including professional researchers, graduate students, university faculties, etc. This book will help the readers to systematically study the related topics and give an overview of this field. Specifically, there are a total of 6 chapters included in this book. Chapter 1 gives a brief introduction to FGIA. Chapter 2 introduces the background of FGIA in detail and the difference and connection between two important FGIA tasks, i.e., recognition and retrieval. In Chap. 3, multiple popular used benchmark datasets in FGIA are presented. In Chaps. 4 and 5, we introduce the main technical paradigms, technological developments, and representative approaches of fine-grained image recognition and fine-grained image retrieval, respectively. At last, related resources, the conclusions, and some potential future directions of FGIA are presented in Chap. 6. Additionally, we also provide fundamentals

ix

x

Preface

of convolutional neural networks in the appendices to further make it easier for readers to understand the technical content in the book. The completion of this book owes to not only the work of the authors but also many other individuals and groups. Special thanks would first go to all my group members of Visual Intelligence and Perception (VIP) Group, especially Xin-Yang Zhao, Yang Shen, Hao Chen, Jian Jin, Shu-Lin Xu, Yu-Yan Xu, Yang-Zhun Zhou, He-Yang Xu, Xu-Hao Sun, Jia-Bei He, Hong-Tao Yu, and Zi-Jian Zhu. We also thank Susanne Filler and Dharaneeswaran Sundaramurthy in the publication team at Springer Nature for their assistance. Finally, sincere thanks to our beloved families for their consideration as well as encouragement. Research efforts summarized in this book were supported in part by National Key R&D Program of China (2021YFA1001100), National Natural Science Foundation of China under Grant (62272231), Natural Science Foundation of Jiangsu Province of China under Grant (BK20210340), and CAAI-Huawei MindSpore Open Fund. Nanjing, China February 2023

Xiu-Shen Wei

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 3

2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Problem and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Recognition Versus Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Domain-Specific Applications Related to Fine-Grained Image Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5 5 8 9 9

3 Benchmark Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Fine-Grained Recognition Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 CUB200-2011 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Stanford Dogs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Stanford Cars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.4 Oxford Flowers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.5 iNaturalist . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.6 RPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Fine-Grained Retrieval Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 DeepFashion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 SBIR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 QMUL Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.4 FG-Xmedia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11 11 12 12 16 16 17 19 22 25 25 26 26 28 29

4 Fine-Grained Image Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Recognition by Localization-Classification Subnetworks . . . . . . . . . . . . . . 4.2.1 Employing Detection or Segmentation Techniques . . . . . . . . . . . . . 4.2.2 Utilizing Deep Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33 33 34 35 51 xi

xii

Contents

4.2.3 Leveraging Attention Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.4 Other Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Recognition by End-to-End Feature Encoding . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Performing High-Order Feature Interactions . . . . . . . . . . . . . . . . . . . 4.3.2 Designing Specific Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Other Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Recognition with External Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Noisy Web Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Multi-Modal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.3 Humans-in-the-Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59 68 78 78 87 99 107 107 114 121 127 130

5 Fine-Grained Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Content-Based Fine-Grained Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Selective Convolutional Descriptor Aggregation . . . . . . . . . . . . . . . 5.2.2 Centralized Ranking Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3 Category-Specific Nuance Exploration Network . . . . . . . . . . . . . . . 5.3 Sketch-Based Fine-Grained Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 “Sketch Me That Shoe” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Generalizing Fine-Grained Sketch-Based Image Retrieval . . . . . . . 5.3.3 Jigsaw Puzzle for Fine-Grained Sketch-Based Image Retrieval . . . 5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

141 141 143 145 147 149 152 154 156 160 162 163

6 Resouces and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Deep Learning-Based Toolboxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Hawkeye for Fine-Grained Recognition . . . . . . . . . . . . . . . . . . . . . . . 6.1.2 PyRetri for Fine-Grained Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Conclusion Remarks and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

167 167 167 169 171 174

Appendix A: Vector, Matrix and Their Basic Operations . . . . . . . . . . . . . . . . . . . .

177

Appendix B: Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

181

Appendix C: Chain Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

185

Appendix D: Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

187

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

205

1

Introduction

The human visual system is inherently capable of fine-grained image reasoning – we are not only able to tell a dog from a bird, but also know the difference between a Siberian Husky and an Alaskan Malamute (see Fig. 1.1). Fine-Grained Image Analysis (FGIA) was introduced to the academic community for the very same purpose, i.e., to teach machines to “see” in a fine-grained manner. As an interesting, fundamental and challenging problem in computer vision and pattern recognition, FGIA has been an active area of research for several decades. The goal of FGIA is to retrieve and recognize images belonging to multiple subordinate categories of a super-category (aka meta-category or a basic-level category), e.g., different species of animals/plants, different models of cars, different kinds of retail products, etc. Similar objects differ in texture and other subtle details. For example, similar species of dogs are similar in appearance, but differ in fine details. Distinguishing fine-grained differences is also an important ability for human visual perception, since the granularity of objects in the real world is always fine-grained. In the real-world, FGIA enjoys a wide-range of vision systems/applications in both industry and research societies, such as automatic biodiversity monitoring [6–8], intelligent retail [2, 3, 9], intelligent transportation [4, 12], and many more. These vision systems equipped with strong FGIA capabilities can bring a positive impact [1, 11] on conservation, promote economic growth, and improve social operation efficiency. Recently, deep learning techniques [5] have emerged as powerful methods for learning feature representations directly from data, and have led to remarkable breakthroughs in the field of FGIA. In particular, these deep learning techniques not only have provided significant improvement for FGIA in different academic tasks, but also greatly facilitate the application of FGIA in diverse industry scenarios.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 X.-S. Wei, Fine-Grained Image Analysis: Modern Approaches, Synthesis Lectures on Computer Vision, https://doi.org/10.1007/978-3-031-31374-5_1

1

2

1 Introduction

Fig. 1.1 Fine-grained image analysis versus generic image analysis (using image classification as an example)

Especially for the academic society, with rough statistics for each year, on average, there are more than ten conference papers of deep learning-based FGIA techniques published on each of AI’s and CV’s premium conferences, like AAAI, IJCAI, CVPR, ICCV, ECCV, etc. Additionally, a number of influential academic competitions about FGIA are frequently held on Kaggle.1 Several representative competitions, to name a few, are the series of iNaturalist Competitions (for large numbers of natural species),2 the Nature Conservancy Fisheries Monitoring (for fish species categorization),3 Humpback Whale Identification (for whale identity categorization)4 and so on. Each competition attracted more than 300 teams worldwide to participate, and some even exceeded 2,000 teams. Moreover, specific tutorials and workshops aiming at the FGIA topic were organized in top-tier international conference, e.g., CVPR,5 ICME, ACCV and PRICAI. Additionally, the top-tier international journals, e.g., IEEE TPAMI, ACM TOMM, Pattern Recognition Letters, and Neurocomputing, have once held the special issues targeting the topic of FGIA. These aforementioned issues show that FGIA with deep learning is of notable research interest. Given this period of rapid evolution, the aim of this book6 is to provide a comprehensive account of fine-grained image analysis research and modern approaches based on 1 Kaggle is an online community of data scientists and machine learners: https://www.kaggle.com/. 2 The competition homepage of “iNaturalist”: https://www.kaggle.com/c/inaturalist-2019-fgvc6/

overview. 3 The competition homepage of “Nature Conservancy Fisheries Monitoring”: https://www.kaggle.

com/c/the-nature-conservancy-fisheries-monitoring. 4 The competition homepage of “Humpback Whale Identification”: https://www.kaggle.com/c/

humpback-whale-identification. 5 CVPR 2021 Tutorial on Fine-Grained Visual Analysis with Deep Learning: https://fgva-cvpr21.

github.io/. 6 A preliminary version of this book was published as a survey paper [10] in IEEE TPAMI (Copyright©2022, IEEE).

References

3

deep learning, spanning the full range of topics needed for designing operational fine-grained image systems. After a thorough introductory chapter of this book, each of the following five chapters focuses on a specific topic, including reviewing background information, publicly available benchmark datasets, and up-to-date techniques with recent results, as well as offering challenges and future directions. Also, in the appendices, we present fundamental contents about deep learning, especially for deep convolutional neural networks. Overall, this book can be useful for a variety of readers, and we specifically target two kinds of audiences. One kind of these target audiences is university (undergraduate or graduate) students who are learning or studying computer vision, pattern recognition, and machine learning. The other kind is industry practitioners such as AI/CV engineers and data scientists who are working on AI/CV-related products.

References 1. IEEE international conference on computer vision 2019 workshop on computer vision for wildlife conservation. https://openaccess.thecvf.com/ICCV2019_workshops/ICCV2019_CVWC 2. Jia M, Shi M, Sirotenko M, Cui Y, Cardie C, Hariharan B, Adam H, Belongie S (2020) Fashionpedia: ontology, segmentation, and an attribute localization dataset. In: Proceedings of European conference on computer vision, pp 316–332 3. Karlinsky L, Shtok J, Tzur Y, Tzadok A (2017) Fine-grained recognition of thousands of object categories with single-example training. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp 4113–4122 4. Khan SD, Ullah H (2019) A survey of advances in vision-based vehicle re-identification. Comput Vis Image Underst 182:50–63 5. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444 6. Van Horn G, Branson S, Farrell R, Haber S, Barry J, Ipeirotis P, Perona P, Belongie S (2015) Building a bird recognition app and large scale dataset with citizen scientists: the fine print in fine-grained dataset collection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 595–604 7. Van Horn G, Cole E, Beery S, Wilber K, Belongie S, Mac Aodha O (2021) Benchmarking representation learning for natural world image collections. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 12884–12893 8. Van Horn G, Mac Aodha O, Song Y, Cui Y, Sun C, Shepard A, Adam H, Perona P, Belongie S (2017) The iNaturalist species classification and detection dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8769–8778 9. Wei XS, Cui Q, Yang L, Wang P, Liu L, Yang J (2022) RPC: a large-scale and fine-grained retail product checkout dataset. SCIENCE CHINA Inf Sci 65(9):197101 10. Wei XS, Song YZ, Aodha OM, Wu J, Peng Y, Tang J, Yang J, Belongie S (2022) Fine-grained image analysis with deep learning: a survey. IEEE Trans Pattern Anal Mach Intell 44(12):8927– 8948 11. Wei Y, Tran S, Xu S, Kang B, Springer M (2020) Deep learning for retail product recognition: challenges and techniques. Comput Intell Neurosci 128:1–23 12. Yin J, Wu A, Zheng WS (2020) Fine-grained person re-identification. Int J Comput Vis 128:1654– 1672

2

Background

In this chapter, we introduce the background knowledge about fine-grained image analysis including both fine-grained recognition and retrieval. We first formulate the definition of fine-grained image analysis, then describe its major challenges, followed by the connection between fine-grained recognition with fine-grained retrieval, and finally discuss domainspecific applications related to fine-grained image analysis.

2.1

Problem and Challenges

Fine-Grained Image Analysis (FGIA) focuses on dealing with objects belonging to multiple subordinate categories of the same meta-category (e.g., different species of birds or different models of cars), and generally involves two central tasks: fine-grained image recognition and fine-grained image retrieval. As illustrated in Fig. 2.1, fine-grained analysis lies in the continuum between basic-level category analysis (i.e., generic image analysis) and instancelevel analysis (e.g., the identification of individuals). Specifically, what distinguishes FGIA from generic image analysis is that in generic image analysis, target objects belong to coarse-grained meta-categories (i.e., basic-level categories) and are thus visually quite different (e.g., determining if an image contains a bird, a fruit, or a dog). However, in FGIA, since objects typically come from sub-categories of the same meta-category, the fine-grained nature of the problem causes them to be visually similar. As an example of fine-grained recognition, in Fig. 1.1, the task is to classify different breeds of dogs. For accurate image recognition, it is necessary to capture the subtle visual differences (e.g., discriminative features such as ears, noses, or tails). Characterizing such features is also desirable for other FGIA tasks (e.g., retrieval). Furthermore, as noted earlier, the finegrained nature of the problem is challenging because of the small inter-class variations

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 X.-S. Wei, Fine-Grained Image Analysis: Modern Approaches, Synthesis Lectures on Computer Vision, https://doi.org/10.1007/978-3-031-31374-5_2

5

6

2 Background

Fig. 2.1 An illustration of fine-grained image analysis which lies in the continuum between the basic-level category analysis (i.e., generic image analysis) and the instance-level analysis (e.g., car identification) Intra-class variances

Artic Tern

Common Tern

Caspian Tern

Inter-class variances

Fig. 2.2 Key challenges of fine-grained image analysis, i.e., small inter-class variations and large intra-class variations. Here we present four different Tern species from the CUB200-2011 dataset [11], one species per row, with different instances in the columns

caused by highly similar sub-categories, and the large intra-class variations in poses, scales and rotations (see Fig. 2.2). It is as such the opposite of generic image analysis (i.e., the small intra-class variations and the large inter-class variations), and what makes FGIA a unique and challenging problem.

2.1

Problem and Challenges

7

While instance-level analysis typically targets a specific instance of an object not just object categories or even object sub-categories, if we move down the spectrum of granularity, in the extreme, individual identification (e.g., face identification) can be viewed as a special instance of fine-grained recognition, where the granularity is at the individual identity level. For instance, person/vehicle re-identification [2, 17] can be considered a finegrained task, which aims to determine whether two images are taken of the same specific person/vehicle. In practice, these works solve the corresponding domain-specific problems using related methods to FGIA, e.g., by capturing the discriminative parts of objects (faces, people, and vehicles) [7, 9, 18], discovering coarse-to-fine structural information [16], developing attribute-based models [3, 4], and so on. Research in these instance-level problems is also very active. However, since such problems are not within the scope of classical FGIA (see Fig. 2.1), for more information, we refer readers to survey papers of these specific topics, e.g., [2, 12, 17]. In the following, we start by formulating the definition of classical FGIA. Formulation: In generic image recognition, we are given a training dataset D =   (n) (n)  |i = 1, . . . , N , containing multiple images and associated class labels (i.e., x x ,y and y), where y ∈ [1, . . . , C]. Each instance (x, y) belongs to the joint space of both the image and label spaces (i.e., X and Y , respectively), according to the distribution of pr (x, y) (x, y) ∈ X × Y .

(2.1)

In particular, the label space Y is the union space of all the C subspaces corresponding to the C categories, i.e., Y = Y1 ∪ Y2 ∪ · · · ∪ Yc ∪ · · · ∪ YC . Then, we can train a predictive/recognition deep network f (x; θ ) parameterized by θ for generic image recognition by minimizing the expected risk min E(x,y)∼ pr (x,y) [L(y, f (x; θ ))] , θ

(2.2)

where L(·, ·) is a loss function that measures the match between the true labels and those predicted by f (·; θ ). While, as aforementioned, fine-grained recognition aims to accurately classify instances of different subordinate categories from a certain meta-category, i.e., 

 x, y  ∈ X × Yc ,

(2.3)

where y  denotes the fine-grained label and Yc represents the label space of class c as the meta-category. Therefore, the optimization objective of fine-grained recognition is as   min E(x,y  )∼ pr (x,y  ) L(y  , f (x; θ )) . θ

(2.4)

Compared with fine-grained recognition, in addition to getting the sub-category correct, fine-grained retrieval also must rank all the instances so that images belonging to the same sub-category are ranked highest based on the fine-grained details in the query of retrieval tasks. Given an input query x q , the goal of a fine-grained retrieval system is to rank all

8

2 Background

M (whose label y  ∈ Y ) based on their fine-grained instances in a retrieval set  = {x (i) }i=1 c M represent the similarity between x q and each x (i) (i) relevance to the query. Let S = {s }i=1 measured via a pre-defined metric applied to the corresponding fine-grained representations, i.e., h(x q ; δ) and h(x (i) ; δ). Here, δ denotes the parameters of a retrieval model h. For the instances whose labels are consistent with the fine-grained category of x q , we form them into a positive set Pq and obtain the corresponding S P . Then, the retrieval model h(·; δ) can be trained by maximizing the ranking based score

max δ

R(i, S P ) , R(i, S )

(2.5)

w.r.t. all the query images, where R(i, S P ) and R(i, S ) refer to the rankings of the instance i in Pq and , respectively.

2.2

Recognition Versus Retrieval

In this book, we cover two fundamental areas of fine-grained image analysis (i.e., recognition and retrieval) in order to give a comprehensive review of modern FGIA techniques. Fine-Grained Recognition: We organize the different families of fine-grained recognition approaches into three paradigms, i.e., (1) recognition by localization-classification subnetworks, (2) recognition by end-to-end feature encoding, and (3) recognition with external information. Fine-grained recognition is the most studied area in FGIA, since recognition is a fundamental ability of most visual systems and is thus worthy of long-term continuous research. Fine-Grained Retrieval: Based on the type of query image, we separate fine-grained retrieval methods into two groups, i.e., (1) content-based fine-grained image retrieval and (2) sketch-based fine-grained image retrieval. Compared with fine-grained recognition, finegrained retrieval is an emerging area of FGIA in recent years, one that is attracting more and more attention from both academia and industry. Recognition and Retrieval Differences: Both fine-grained recognition and retrieval aim to identify the discriminative, but subtle, differences between different fine-grained objects. However, fine-grained recognition is a closed-world task with a fixed number of subordinate categories. In contrast, fine-grained retrieval extends the problem to an open-world setting with unlimited sub-categories. Furthermore, fine-grained retrieval also aims to rank all the instances so that images depicting the concept of interest (e.g., the same sub-category label) are ranked highest based on the fine-grained details in the query. Recognition and Retrieval Synergies: Advances in fine-grained recognition and retrieval have commonalities and can benefit each other. Many common techniques are shared by both fine-grained recognition and retrieval, e.g., deep metric learning methods [10, 19], multimodal matching methods [5, 6], and the basic ideas of selecting useful deep descriptors [14, 15], etc. Detailed discussions are elaborated in Sect. 5.4. Furthermore, in real-world appli-

References

9

cations, fine-grained recognition and retrieval also compliment each other, e.g., retrieval techniques are able to support novel sub-category recognition by utilizing learned representations from a fine-grained recognition model [1, 13].

2.3

Domain-Specific Applications Related to Fine-Grained Image Analysis

In the real world, deep learning based fine-grained image analysis techniques are also adopted to diverse domain-specific applications and shows great performance, such as clothes/shoes retrieval [8] in recommendation systems, fashion image recognition [4] in e-commerce platforms, product recognition [13] in intelligent retail, etc. These applications are highly related to both fine-grained retrieval and recognition of FGIA. Additionally, if we move down the spectrum of granularity, in the extreme, face identification can be viewed as an instance of fine-grained recognition, where the granularity is under the identity granularity level. Moreover, person/vehicle re-identification is another fine-grained related task, which aims at determining whether two images are taken from the same specific person/vehicle. Apparently, re-identification tasks are also under identity granularity. In practice, these works solve the corresponding domain-specific tasks by following the motivations of FGIA, which includes capturing the discriminative parts of objects (faces, persons and vehicles) [9, 18], discovering coarse-to-fine structural information [16], developing attribute-based models [3, 4], and so on. Research in these areas is also very attractive and hot. We refer the interested readers to the corresponding survey papers of these specific topics, e.g., [2, 17].

References 1. Ge Y, Zhang R, Wu L, Wang X, Tang X, Luo P (2019) A versatile benchmark for detection, pose estimation, segmentation and re-identification of clothing images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5337–5345 2. Khan SD, Ullah H (2019) A survey of advances in vision-based vehicle re-identification. Comp Vis Image Underst 182:50–63 3. Liu X, Wang J, Wen S, Ding E, Lin Y (2017) Localizing by describing: attribute-guided attention localization for fine-grained recognition. In: Proceeding Conference AAAI, pp 4190–4196 4. Liu Z, Luo P, Qiu S, Wang X, Tang X (2016) DeepFashion: powering robust clothes recognition and retrieval with rich annotations. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1096–1104 5. Mafla A, Dey S, Biten AF, Gomez L, Karatzas D (2020) Fine-grained image classification and retrieval by combining visual and locally pooled textual features. In: Proceedings of winter conference on applications of computer vision, pp 2950–2959

10

2 Background

6. Mafla A, Dey S, Biten AF, Gomez L, Karatzas D (2020) Multi-modal reasoning graph for scenetext based fine-grained image classification and retrieval. In: Proceedings of winter conference on applications of computer vision, pp 4023–4033 7. Schroff F, Kalenichenko D, Philbin J (2015) FaceNet: a unified embedding for face recognition and clustering. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp 815–823 8. Song J, Yu Q, Song YZ, Xiang T, Hospedales TM (2017) Deep spatial-semantic attention for fine-grained sketch-based image retrieval. In: Proceedings of the IEEE conference on computer vision, pp 5551–5560 9. Suh Y, Wang J, Tang S, Mei T, Lee KM (2018) Part-aligned bilinear representations for person re-identification. In: Proceedings of the European conference on computer vision, pp 402–419 10. Sun M, Yuan Y, Zhou F, Ding E (2018) Multi-attention multi-class constraint for fine-grained image recognition. In: Proceedings of the European conference on computer vision, pp 834–850 11. Wah C, Branson S, Welinder P, Perona P, Belongie S (2011) The Caltech-UCSD birds-200-2011 dataset. Technical report CNS-TR-2011-001 12. Wang M, Deng W (2021) Deep face recognition: a survey. Neurocomputing 429:215–244 13. Wei XS, Cui Q, Yang L, Wang P, Liu L, Yang J (2022) RPC: a large-scale and fine-grained retail product checkout dataset. SCIENCE CHINA Inf Sci 65(9):197101 14. Wei XS, Luo JH, Wu J, Zhou ZH (2017) Selective convolutional descriptor aggregation for fine-grained image retrieval. IEEE Trans Image Process 26(6):2868–2881 15. Wei XS, Xie CW, Wu J, Shen C (2018) Mask-CNN: localizing parts and selecting descriptors for fine-grained bird species categorization. Pattern Recogn 76:704–714 16. Wei XS, Zhang CL, Liu L, Shen C, Wu J (2018) Coarse-to-fine: a RNN-based hierarchical attention model for vehicle re-identification. In: Proceedings of the Asian conference on computer vision, pp 575–591 17. Ye M, Shen J, Lin G, Xiang T, Shao L, Hoi SCH (2020) Deep learning for person re-identification: a survey and outlook. IEEE Trans Pattern Anal Mach Intell, p 2872 18. Yin J, Wu A, Zheng WS (2020) Fine-grained person re-identification. Int J Comput Vis 128:1654– 1672 19. Zheng X, Ji R, Sun X, Zhang B, Wu Y, Huang F (2019) Towards optimal fine grained retrieval via decorrelated centralized loss with normalize-scale layer. In: Proceeding conference AAAI, pp 9291–9298

3

Benchmark Datasets

This chapter introduces benchmark datasets in the field of fine-grained image recognition and fine-grained image retrieval by summarizing their own characteristics, presenting their detailed protocols and showing example images of these fine-grained image datasets.

3.1

Introduction

In recent years, the vision community has released many fine-grained benchmark datasets covering diverse domains, e.g., birds [2, 27, 30], dogs [12, 26], cars [14], aircraft [19], flowers [20], vegetables [9], fruits [9], foods [3], fashion [6, 11, 17], retail products [1, 32], etc. Additionally, it is worth noting that even the most popular large-scale image classification dataset, i.e., ImageNet [23], also contains fine-grained classes covering a lot of dog and bird sub-categories. Representative images from some of these fine-grained benchmark datasets can be found in Fig. 3.1. In Table 3.1, we summarize the most commonly used image datasets, and indicate their meta-category, the amount of images, the number of categories, their main task, and additional available supervision, e.g., bounding boxes, part annotations, hierarchical labels, attribute labels, and text descriptions (cf. Fig. 3.2). These datasets have been one of the most important factors for the considerable progress in the field, not only as a common ground for measuring and comparing performance of competing approaches, but also pushing this field towards increasingly complex, practical, and challenging problems. The fine-grained bird classification dataset CUB200-2011 [30] is one of the most popular fine-grained datasets. The majority of FGIA approaches choose it for comparisons with the state-of-the-art. Moreover, continuous contributions are made upon CUB200-2011 for advanced tasks, e.g., collecting text descriptions of the fine-grained images for multi-modal analysis, cf. [7, 22] and Sect. 4.4.2. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 X.-S. Wei, Fine-Grained Image Analysis: Modern Approaches, Synthesis Lectures on Computer Vision, https://doi.org/10.1007/978-3-031-31374-5_3

11

12

3 Benchmark Datasets

Fig. 3.1 Examples of fine-grained images belonging to different species of flowers [20]/vegetables [9], different models of cars [14] and aircraft [19] and different kinds of retail products [32]. Accurate identification of these fine-grained objects requires the extraction of discriminative, but subtle, object parts or image regions. (Best viewed in color and zoomed in)

In recent years, more challenging and practical fine-grained datasets have been proposed, e.g., iNat2017 containing different species of plants and animals [28], and RPC for retail products [32]. Novel properties of these datasets include the fact that they are large-scale, have a hierarchical structure, exhibit a domain gap, and form a long-tailed distribution. These challenges illustrate the practical requirements of FGIA in the real-world and motivate new interesting research challenges (cf. Sect. 6.2). Beyond that, a series of fine-grained sketch-based image retrieval datasets, e.g., QMULShoe [33], QMUL-Chair [33], QMUL-handbag [25], SBIR2014 [16], SBIR2017 [15], Sketchy [24], QMUL-Shoe-V2 [21], were constructed to further advance the development of fine-grained retrieval, cf. Sect. 5.3. Furthermore, some novel datasets and benchmarks, such as FG-Xmedia [8], were constructed to expand fine-grained image retrieval to fine-grained cross-media retrieval.

3.2

Fine-Grained Recognition Datasets

In this section, we elaborate some representative fine-grained recognition datasets, including CUB200-2011 [30], Stanford Dogs [12], Stanford Cars [14], Oxford Flowers [20], iNaturalist2017 [28], and RPC [32].

3.2.1

CUB200-2011

The CUB200-2011 dataset [30] is a fine-grained bird species dataset proposed by Caltech in 2011, and it is also the most commonly used dataset in this field. The dataset has 200 bird subordinate categories and 11,788 images (including 5,994 training images and 5,794 test images). Each image is provided with image-level labels, object bounding boxes, part key points, and bird attribute information. Example images of the dataset are shown in Fig. 3.3. In

2017 2017 2017 2018 2019 2020 2021

Fru92 [9]

Veg200 [9]

iNat2017 [28]

Dogs-in-the-Wild [26]

RPC [32]

Products-10K [1]

iNat2021 [29]

2014

Food101 [3]

2016

2014

Birdsnap [2]

DeepFashion [17]

2013

FGVC Aircraft [19]

2015

2013

Stanford Cars [14]

2016

2011

Stanford Dogs [12]

Food-975 [35]

2011

CUB200-2011 [30]

NABirds [27]

2008

Oxford Flowers [20]

Recog.

Year

Dataset name

Topic

Plants & Animals

Retail products

Retail products

Dogs

Plants & Animals

Vegetable

Fruits

Clothes

Foods

Birds

Food dishes

Birds

Aircrafts

Cars

Dogs

Birds

Flowers

Meta-class

3,286,843

150,000

83,739

299,458

857,877

91,117

69,614

800,000

37,885

48,562

101,000

49,829

10,000

16,185

20,580

11,788

8,189

 images

10,000

10,000

200

362

5,089

200

92

1,050

975

555

101

500

100

196

120

200

102

 categories

























Part anno. HRCHY



















BBox









ATR

(continued)





Texts

Table 3.1 Summary of popular fine-grained image datasets organized by their major applicable topics and sorted by their release time. Note that, “ images” means the total number of images of these datasets. “BBox” indicates whether this dataset provides object bounding box supervisions. “Part anno.” means that key parts annotations are provided. “HRCHY” corresponds to hierarchical labels. “ATR” represents attribute labels (e.g., wing color, male, female, etc). “Texts” indicates whether fine-grained text descriptions of images are supplied. Several datasets are listed here twice since they are commonly used in both recognition and retrieval tasks

3.2 Fine-Grained Recognition Datasets 13

2016 2016 2016 2017 2017 2019 2019

QMUL-Shoe∗ [33]

QMUL-Chair ∗ [33]

Sketchy∗ [24]

QMUL-Handbag∗ [25]

SBIR2017 ∗ [15]

QMUL-Shoe-V2∗ [21]

FG-Xmedia† [8] Birds

Shoes

Shoes

Handbags

Multiple

Chairs

Shoes

Clothes

Multiple

Cars

Birds

Flowers

Meta-class

11,788

6,730/2,000

912/304

568/568

75,471/12,500

297/297

419/419

800,000

1,120/7,267

16,185

11,788

8,189

 images

200

1

1

1

125

1

1

1,050

14

196

200

102

 categories











Part anno. HRCHY







BBox













ATR







Texts

numbers of sketches and images separately (the numbers of sketches first). Regarding “ categories”, we report the number of meta-categories in these datasets † Except for text descriptions, FG-Xmedia also contains multiple other modalities, e.g., videos and audios

∗ For these fine-grained sketch-based image retrieval datasets, normally they have sketch-and-image pairs (i.e., not only images). Thus, we present the

2014 2016

2013

Stanford Cars [14]

DeepFashion [17]

2011

SBIR2014∗ [16]

2008

Oxford Flowers [20]

CUB200-2011 [30]

Retriev.

Year

Dataset name

Topic

Table 3.1 (continued)

14 3 Benchmark Datasets

3.2

Fine-Grained Recognition Datasets

15

Fig. 3.2 An example image from CUB200-2011 [30] with multiple different types of annotations e.g., category label, part annotations (aka key point locations), object bounding box shown in green, attribute labels (i.e., “ATR”), and a text description

Downy Woodpecker

Red Headed Woodpecker

Red Cockaded Woodpecker

Red Bellied Woodpecker

Pileated Woodpecker

Fig. 3.3 Example images of the CUB200-2011 [30] dataset. We show five subordinate classes, i.e., Downy Woodpecker, Red Headed Woodpecker, Red Cockaded Woodpecker, Red Bellied Woodpecker, and Pileated Woodpecker, for comparisons

addition, Birdsnap [2], NABirds [27] are the two fine-grained bird species datasets proposed later. Specifically, Birdsnap [2] was proposed by researchers at Columbia University in 2014, consisting of a total of 500 bird sub-categories and 49,829 images (including 47,386 training images and 2,443 test images). Similar to CUB200-2011, each image provides image-level labels, object bounding boxes, part keypoints, and bird attribute information. While NABirds [27] was collected by citizen scientists and domain experts, which mainly contains 48,562 images of North American birds with 555 sub-categories, part annotations, and bounding boxes.

16

3.2.2

3 Benchmark Datasets

Stanford Dogs

The Stanford Dogs dataset [13] is a fine-grained dog species dataset proposed by Stanford University in 2011. The dataset contains images of 120 breeds of dogs from around the world. This dataset has been built using images and annotations from ImageNet [23] for the task of fine-grained image categorization. It consists of 20,580 images, out of which 12,000 are used for training and 8,580 for test. Fine-grained categorical labels and bounding box annotations are provided for all 12,000 images. Example images of the dataset are shown in Fig. 3.4.

3.2.3

Stanford Cars

The Stanford Cars dataset [14] is a fine-grained car models dataset proposed by Stanford University in 2013. The dataset contains 16,185 images of 196 classes of cars. Each image is provided with an image-level label, and object bounding boxes. The data is split into 8,144 training images and 8,041 test images, where each class has been split roughly in a 50-50 split. Classes are typically at the level of “Make”, “Model”, and “Year”. Example images of the Stanford Cars dataset are shown in Fig. 3.5. Also, as fine-grained rigid objects, FGVC Aircraft [19] is a fine-grained aircraft dataset proposed by University of Oxford in 2013. The dataset contains 10,200 images of aircraft,

Blenheim Spaniel

Maltese Dog

Norfolk Terrier

Norwegian Elkhound

Weimaraner

Fig. 3.4 Example images of the Stanford Dogs [13] dataset. We show five subordinate classes, i.e., Blenheim Spaniel, Maltese Dog, Norfolk Terrier, Norwegian Elkhound, and Weimaraner, for comparisons

3.2

Fine-Grained Recognition Datasets

Ferrari 458 Italia Coupe 2012

FIAT 500 Convertible 2012

17

Ford Edge SUV 2012

Jaguar XK XKR 2012

Rolls-Royce Ghost Sedan 2012

Fig. 3.5 Example images of the Stanford Cars [14] dataset. We show five subordinate classes, i.e., Ferrari 458 Italia Coupe 2012, FIAT 500 Convertible 2012, Ford Edge SUV 2012, Jaguar XK XKR 2012, and Rolls-Royce Ghost Sedan 2012, for comparisons

with 100 images for each of 102 different aircraft model variants, most of which are airplanes. The aircraft in each image is annotated with a tight bounding box and a hierarchical airplane model label. Aircraft models are organized in a four-level hierarchy. The four levels, from finer to coarser, are: “Model”, “Variant”, “Family”, and “Manufacturer”. The data is divided into three equally-sized training, validation and test subsets. Example images of FGVC Aircraft are shown in Fig. 3.6.

3.2.4

Oxford Flowers

The Oxford Flowers dataset [20] is a fine-grained flower species dataset proposed by University of Oxford in 2008. The dataset includes 8,198 images of 102 classes of flowers. Each class consists of between 40 and 258 images. The images of flowers have pose and light variations. In addition, there are categories that have large variations within the category and several very similar categories. Example images of the dataset are shown in Fig. 3.7. As a similar dataset containing non-rigid fine-grained objects, Food101 [3] is a food species dataset proposed by ETH Zürich in 2014. The dataset contains 101 food categories with 101,000 images. For each category, 250 manually reviewed test images are provided as well as 750 training images. On purpose, the training images are not cleaned in Food101,

18

3 Benchmark Datasets

A320

Boeing 707

Tu-154

Yak-42

Cessna Citation

Fig. 3.6 Example images of the FGVC Aircraft [19] dataset. We show five subordinate classes, i.e., A320, Boeing 707, Tu-154, Yak-42, and Cessna Citation, for comparisons

Cyclamen

Frangipani

Morning Glory

Rose

Water Lily

Fig. 3.7 Example images of the Oxford Flowers [20] dataset. We show five subordinate classes, i.e., Cyclamen, Frangipani, Morning Glory, Rose, and Water Lily, for comparisons

3.2

Fine-Grained Recognition Datasets

Red Velvet Cake

Chicken Wings

19

Club Sandwich

Dumplings

Hamburger

Fig. 3.8 Example images of the Food101 [3] dataset. We show five subordinate classes, i.e., Red Velvet Cake, Chicken Wings, Club Sandwich, Dumplings, and Hamburger, for comparisons

and thus still contain some amount of noise. This comes mostly in the form of intense colors and sometimes wrong labels. Example images of the Food101 dataset are shown in Fig. 3.8. In addition, VegFru [9] as another relevant fine-grained dataset was proposed by University of Science and Technology of China in 2017, which consists of vegetables and fruits closely associated with the daily life of everyone. Particularly, all the images of VegFru are labeled hierarchically. The dataset covers vegetables and fruits of 25 upper-level categories and 292 subordinate classes. It contains more than 160,000 images in total. Among them, “vegetable” has 15 upper-level categories and 200 subordinate classes, with 91,117 images. “Fruit” has upper-level 10 categories and 92 subordinate classes, containing 69,614 images. The number of images for each sub-class varies from 200 to 2,000. Example images of VegFru are shown in Fig. 3.9.

3.2.5

iNaturalist

The series of iNaturalist image datasets [10, 28, 29] are famous large-scale fine-grained datasets in the field. They feature visually similar species, captured in a wide variety of situations, from all over the world. Images were collected with different camera types, have varying image quality, feature a large class imbalance, and have been verified by multiple citizen scientists. As another highlight of the iNaturalist datasets, they exhibit significant class imbalance, since the natural world is heavily imbalanced, as some species are more

20

3 Benchmark Datasets

Blood Orange

Grape White

Juicy Peach

Pumpkin

Balsam Pear

Fig. 3.9 Example images of the VegFru [9] dataset. We show five subordinate classes, i.e., Blood Orange, Grape White, Juicy Peach, Pumpkin, and Balsam Pear, for comparisons

abundant and easier to photograph than others. Therefore, they are always employed as test beds in the field of long-tailed recognition [18, 31, 34]. More specifically, the iNaturalist2017 dataset [28] is the earliest released iNaturalist dataset, which targets natural world visual classification and detection. Its images were uploaded by users of the citizen science application iNaturalist1 in 2017. It contains 13 super-classes (i.e., Plantae, Insecta, Aves, Reptilia, Mammalia, Fungi, Amphibia, Mollusca, Animalia, Arachnida, Actinopterygii, Chromista, and Protozoa) and 5,089 species, with a combined training and validation set of 675,000 images, 183,000 test images, and over 560,000 manually created bounding boxes. Example images of five super-classes of the iNaturalist2017 dataset are shown in Fig. 3.10. Besides, we also present some example images of five subordinate classes under “Amphibia” in Fig. 3.11. Regarding its class imbalance nature, please see Fig. 3.12 for the distribution of training images per class. Unlike traditional, researcher-collected datasets, the dataset has the opportunity to grow with the iNaturalist community.2 The iNaturalist2018 dataset [10] was proposed for the “iNat Challenge” 2018 large-scale species classification competition as part of the FGVC5 workshop3 at CVPR 2018. There are a total of 8,142 species in the dataset, with 437,513 training images, 24,426 validation images and 149,394 test images. The 2017 dataset categories contained mostly species, but 1 www.inaturalist.org. 2 https://www.inaturalist.org/. 3 https://sites.google.com/view/fgvc5/home.

3.2

Fine-Grained Recognition Datasets

Actinopterygii

Animalia

21

Fungi

Plantae

Mammalia

Fig. 3.10 Example images of five super-classes, i.e., Actinopterygii, Animalia, Fungi, Plantae, and Mammalia, of the iNaturalist2017 [28] dataset

Acris Blanchardi

Ambystoma Laterale

Anaxyrus Americanus

Dendrobates Auratus

Taricha Granulosa

Fig. 3.11 Example images of five subordinate classes under “Amphibia”, i.e., Acris Blanchardi, Ambystoma Laterale, Anaxyrus Americanus, Dendrobates Auratus, and Taricha Granulos, for comparisons

22

3 Benchmark Datasets

Number of Training Images

10 3

10 2

10 1 0

1000

2000

3000

Sorted Species

4000

5000

Fig. 3.12 The distribution of training images of the iNaturalist2017 [28] dataset, which exhibits a significant class imbalance, i.e., a few classes occupy most of the data, while most classes have rarely few samples

also had a few additional taxonomic ranks (e.g., “Genus”, “Subspecies”, and “Variety”). The 2018 categories are all species. The 2018 dataset contains “Kingdom”, “Phylum”, “Class”, “Order”, “Family”, and “Genus” taxonomic information for all species. Time to 2021, iNaturalist2021, a recently released large-scale image dataset, collected and annotated by community scientists. It consists of 2.7M training images, 100K validation images, and 500K test images and represents images from 10K species. In addition to its overall scale, the main distinguishing feature of iNaturalist2021 is that it contains at least 152 images in the training set for each species, while the corresponding minimum number of images per class in iNaturalist2017 and iNaturalist2018 are merely 9 and 2, respectively.

3.2.6

RPC

Over recent years, emerging interest has occurred in integrating fine-grained recognition technologies into the retail industry. Automatic checkout (ACO), cf. Fig. 3.13, is one of the critical problems in this area which aims to automatically generate a shopping list from the images of the products to purchase. The main challenge of this problem comes from the Large-scale and the fine-grained nature of the product categories as well as the difficulty of collecting training images that reflect the realistic checkout scenarios due to continuous updates of the products. Because of its significant practical and research value, Megvii Research Nanjing constructed and proposed a retail product checkout dataset (RPC) [32] in 2019, for facilitating the research of ACO. The design of RPC is to mimic the real-world scenarios in ACO. In context of the ACO problem, ideally the training images should be collected at the checkout counter, which

3.2

Fine-Grained Recognition Datasets

23

Fig. 3.13 Illustration of the automatic checkout (ACO) application scenario. When a customer puts his/her collected products on the checkout counter, the system will automatically recognize each product and returns a complete shopping list with the total price

capture random combinations of multiple product instances. However, due to a large number of product categories as well as the continuous update of the stock list, it is infeasible to learn the recognition model by enumerating all the product combinations. In fact, it is even impractical to assume that the checkout images cover every single product on the stock list. A more economical solution is to train the recognition system by using images of each isolated product taken in a controlled environment. Once taken, those images can be reused and distributed to different deployment scenarios. Therefore, this dataset provides images of two different types. One type is taken in a controlled environment and only contains a single product, cf. Fig. 3.14. This can correspond to product images on the advertisement website and is regarded as training data. Another type represents images of user-purchased products and these images usually include multiple products with arbitrary placement, cf. Fig. 3.15, which is treated as test data. For the second type of images, RPC also provides different levels of annotations, clutter degrees and illuminations. Furthermore, RPC is a large-scale dataset in terms of both product image quantity and product categories. It collects 200 retail stock-keeping units (SKU) that can be divided into 17 meta-categories and comprise 83,739 images including 53,739 single-product example images as training set and 30,000 checkout images as val/test set. In addition, the checkout images are provided with three different types of annotations, representing the weak to strong supervisions, which is shown in Fig. 3.16. The weakest level of annotation is the shopping list, which records the SKU category and count of each product instance in the checkout image. The middle level is point-level annotation, which provides the central position and the SKU category of each product in the checkout image. The strongest level of annotation is bounding boxes, which provide bounding boxes and SKU categories for these products. Compared with the existing datasets, RPC is closer to the realistic setting w.r.t. ACO and can derive a variety of research problems [32], e.g., multi-category counting, few-shot/weaklysupervised/fully-supervised object detection, online learning, etc.

24

3 Benchmark Datasets

Canned Food

Dessert

Alcohol

Puffed Food

Drink

Fig. 3.14 Example images of training data in the RPC [32] dataset

Easy Mode

Medium Mode

Hard Mode

Fig. 3.15 Sampled checkout images (as test data) of three clutter levels in the RPC [32] dataset

3.3

Fine-Grained Retrieval Datasets

25

Fig.3.16 Weak to strong supervisions of the RPC dataset [32]: from shopping list, points, to bounding boxes. All checkout images in RPC are labeled with these three levels of annotations

3.3

Fine-Grained Retrieval Datasets

In this section, we elaborate some representative fine-grained retrieval datasets, including DeepFashion [17], SBIR [15, 16], QMUL Datasets [21, 25, 33], and FG-Xmedia [8]. Also, CUB200-2011 [30], Stanford Cars [14], Oxford Flowers [20] are usually employed for retrieval tasks.

3.3.1

DeepFashion

The DeepFashion dataset [17] is a large-scale clothes dataset proposed by the Chinese University of Hong Kong in 2016. DeepFashion contains over 800,000 diverse fashion images ranging from well-posed shop images to unconstrained consumer photos, constituting a large-scale visual fashion analysis dataset. It is annotated with rich information on clothing items. Each image in this dataset is labeled with 50 categories, 1,000 descriptive attributes, bounding boxes, as well as clothing landmarks. DeepFashion contains over 300,000 cross-pose/cross-domain image pairs, which performs a benchmark for fine-grained image retrieval. Additionally, four benchmarks are developed based on the DeepFashion dataset, including Attribute Prediction, Consumer-to-shop Clothes Retrieval, In-shop Clothes Retrieval, and Landmark Detection. The data and annotations of these benchmarks can be also employed as the training and test sets for the following computer vision tasks, such as clothes detection, clothes recognition, and image retrieval. Examples of the Consumerto-Shop Clothes Retrieval benchmark are shown in Fig. 3.17.

26

3 Benchmark Datasets Comsumer

Shop

Sleeveless Dress

Blouseid

Shirt

Fig. 3.17 Example image-pairs of the Consumer-to-Shop Clothes Retrieval benchmark of the DeepFashion [17] dataset. We here present three subordinate classes, i.e., Sleeveless Dress, Blouseid, Shirt, for comparisons

3.3.2

SBIR

Sketches are intuitive and descriptive. As a form of inquiry, it has more detailed visual cues than pure text. SBIR2014 [16] is the first fine-grained sketch-based image retrieval dataset proposed by Queen Mary University of London (QMUL) in 2014. It sampled sketches from the 20,000 sketch dataset [4] and images from corresponding categories in the PASCAL VOC dataset [5]. SBIR2014 has 14 categories and consists of 1,120 sketches and 7,267 images. The entire dataset is split into training and test sets of equal size. Examples of the dataset are shown in Fig. 3.18. To enable quantitative evaluation, 6 sketches and 60 images in each category are sampled from the full test set. Sketch-image pairs (14 × 6 × 60 = 5, 040 pairs in total) are manually annotated for their similarity in terms of four independent criteria: viewpoint, zoom, configuration, and body feature. Furthermore, following SBIR2014, QMUL proposed a fine-grained SBIR shoe dataset (SBIR2017) [15] with 912 sketches and 304 images in 2017. Each image has three sketches corresponding to various drawing styles. Each sketch and image is annotated with its semantic parts and associated part-level attributes. Examples of the SBIR2017 dataset are shown in Fig. 3.19.

3.3.3

QMUL Datasets

QMUL datasets [21, 25, 33] are a series of fine-grained instance-level SBIR datasets, involving the fine-grained objects in terms of “shoes”, “chairs” and “handbags”, proposed by Queen

3.3

Fine-Grained Retrieval Datasets

Horse

27

Bus

Standing Bird

Sheeep

Train

Airplane

Motorbike

Cow

Cat

Canned Food

Fig. 3.18 Example image-pairs of the SBIR2014 [16] dataset

Fig. 3.19 Examples of sketch-image pairs in the SBIR2017 [15] dataset

Mary University of London (QMUL). Among them, QMUL-Shoe/Chair [33] includes two datasets, one for shoes and the other for chairs. There are 1,432 sketches and photos in total (i.e., 716 sketch-photo pairs). The Shoe dataset has 419 sketch-photo pairs and the Chair dataset 297 pairs. Examples of the QMUL-Shoe/Chair dataset are shown in Fig. 3.20. QMUL-Handbag [25] is a dataset collected by QMUL following a similar protocol (photos from online catalogs and sketches crowd-sourced) as the aforementioned dataset, resulting in 568 sketch-photo pairs. Handbags were specifically chosen to make the sketch-photo retrieval task more challenging, since handbags exhibit more complex visual patterns and have more deformable bodies than shoes and chairs. Among them, 400 are used for training and the rest for test. Beyond that, Shoe-V2 and Chair-V2 datasets [21] are the extended versions of QMUL-Shoe/Chair. There are 2,000 photos and 6,648 sketches in Shoe-V2, and 400 photos and 1,275 sketches in Chair-V2 (Fig. 3.21).

28

3 Benchmark Datasets

(a) QMUL-Shoe

(b) QMUL-Chair

Fig. 3.20 Example sketch-photo pairs of the QMUL-Shoe/Chair [33] dataset

Fig. 3.21 Example sketch-photo pairs of the QMUL-Handbag [25] dataset

3.3.4

FG-Xmedia

In the era of big data, multimedia data has become the main form of humans knowing the world. Cross-media retrieval is such an effective retrieval paradigm which users can get the results of various media types by submitting a query of any media type. The FG-Xmedia dataset [8] is a fine-grained cross-media retrieval dataset proposed by Peking University in 2019. The dataset consists of four media types, including image, text, video and audio, which contains 200 fine-grained subcategories that belong to the meta category of “Bird”. Regarding the multi-media data of FG-Xmedia, CUB200-2011 [30] and YouTube Birds [36] constitute the corresponding image and video modalities in the dataset, respectively. The text

References

29 Image

Text

Glaucous-winged Gull

The glaucous -winged gull is a large, white headed gull The genus name is from Latin Larus which appears to have referred to a gull or other large seabird. The specific glaucescens is New Latin for "glaucous" from the Ancient Greek,

The glaucous-winged gull is rarely found far from the ocean. It is a resident from the western coast of Alaska to the coast of Washington. It also breeds on the northwest coast of Alaska, in the summertime.

Slaty -backed Gull

Great black-backed gulls , are much larger than herring gulls and have a lighter bill and darker mantle. Lesser blackbacked gulls have a dark mantle and yellow legs Both great

European birds lack the long gray tongues on the 6th, 7th, and 8th primaries and solid black markings on the 5th and 6th primaries that are shown by American Herring Gulls. Firstwinter European

California Gull

California Gulls breed on sparsely vegetated island and levees in inland lakes and rivers. They forage in any open area where they can find food including garbage dumps...

California Gulls are strong, nimble fliers and opportunistic foragers; they forage on foot, from the air, and from the water. These social gulls breed in colonies and mix with other gull species along the coast in winter.

Herring Gull

Adults are similar in appearance to the herring gull, but have a smaller yellow bill with a black ring, yellow legs, brown eyes and a more rounded head. The body is mainly white with grey back and upper wings. They have black primaries .

In the far north they mix with breeding Herring Gulls , and throughout all but the southern third of their range they mix with Ring-billed Gulls . They generally do not hybridize with either of these species, and they excel at getting ...

Video

Audio

Fig. 3.22 Examples of four modalities (image, text, video and audio) of the FG-Xmedia [8] dataset. We present four subordinate classes, i.e., Glaucous-winged Gull, Slaty-backed Gull, California Gull, and Herring Gull for comparisons

and audio modalities are collected from some professional websites, including Wikipedia,4 xeno-canto5 and Bird-sounds.6 The scale of each media type in this benchmark is large, i.e., 11,788 images, 8,000 texts, 18,350 videos and 2,000 audios. For text, there are 40 instances of each sub-category. For audio, there are 60 instances of each subcategory. Examples of the dataset are shown in Fig. 3.22.

References 1. Bai Y, Chen Y, Yu W, Wang L, Zhang W (2020) Products-10K: a large-scale product recognition dataset. arXiv:2008.10545 2. Berg T, Liu J, Lee SW, Alexander ML, Jacobs DW, Belhumeur PN (2014) Birdsnap: large-scale fine-grained visual categorization of birds. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2019–2026 3. Bossard L, Guillaumin M, Gool LV (2014) Food-101 – mining discriminative components with random forests. In: Proceedings of European conference on computer vision, pp 446–461 4. Eitz M, Hays J, Alexa M (2012) How do humans sketch objects? ACM Trans Graph (TOG) 31(4):1–10 5. Everingham M, Van Gool L, Williams CK, Winn J, Zisserman A (2010) The PASCAL visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 6. Ge Y, Zhang R, Wu L, Wang X, Tang X, Luo P (2019) A versatile benchmark for detection, pose estimation, segmentation and re-identification of clothing images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5337–5345 4 http://www.wikipedia.org/. 5 http://www.xeno-canto.org/. 6 http://www.bird-sounds.net/.

30

3 Benchmark Datasets

7. He X, Peng Y (2017) Fine-grained image classification via combining vision and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5994–6002 8. He X, Peng Y, Liu X (2019) A new benchmark and approach for fine-grained cross-media retrieval. In: Proceedings of the ACM international conference multimedia, pp 1740–1748 9. Hou S, Feng Y, Wang Z (2017) VegFru: a domain-specific dataset for fine-grained visual categorization. In: Proceedings of the IEEE international conference on computer vision, pp 541–549 10. iNaturalist Competition (2018). https://sites.google.com/view/fgvc5/competitions/inaturalist 11. Jia M, Shi M, Sirotenko M, Cui Y, Cardie C, Hariharan B, Adam H, Belongie S (2020) Fashionpedia: ontology, segmentation, and an attribute localization dataset. In: Proceedings of the European conference on computer vision, pp 316–332 12. Khosla A, Jayadevaprakash N, Yao B, Fei-Fei L (2011) Novel dataset for fine-grained image categorization. In: Proceedings of the IEEE conference on computer vision and pattern recognition. Workshop on fine-grained visual categorization, pp 806–813 13. Khosla A, Jayadevaprakash N, Yao B, Fei-Fei L (2011) Novel dataset for fine-grained image categorization: stanford dogs. In: Proceedings of the IEEE conference on computer vision and pattern recognition. Workshop on fine-grained visual categorization, vol 2. Citeseer 14. Krause J, Stark M, Deng J, Fei-Fei L (2013) 3D object representations for fine-grained categorization. In: Proceedings of the IEEE international conference on computer vision. Workshop on 3D representation and recognition 15. Li K, Pang K, Song YZ, Hospedales TM, Xiang T, Zhang H (2017) Synergistic instance-level subspace alignment for fine-grained sketch-based image retrieval. IEEE Trans Image Process 26(12):5908–5921 16. Li Y, Hospedales TM, Song YZ, Gong S (2014) Fine-grained sketch-based image retrieval by matching deformable part models. In: Proceedings of British machine vision conference, pp 1–12 17. Liu Z, Luo P, Qiu S, Wang X, Tang X (2016) DeepFashion: powering robust clothes recognition and retrieval with rich annotations. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1096–1104 18. Liu Z, Miao Z, Zhan X, Wang J, Gong B, Yu SX (2019) Large-scale long-tailed recognition in an open world. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2537–2546 19. Maji S, Rahtu E, Kannala J, Blaschko M, Vedaldi A (2013) Fine-grained visual classification of aircraft. arXiv:1306.5151 20. Nilsback ME, Zisserman A (2008) Automated flower classification over a large number of classes. In: Indian conference on computer vision graphics and image processing, pp 722–729 21. Pang K, Li K, Yang Y, Zhang H, Hospedales TM, Xiang T, Song YZ (2019) Generalising finegrained sketch-based image retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 677–686 22. Reed S, Akata Z, Lee H, Schiele B (2016) Learning deep representations of fine-grained visual descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 49–58 23. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) ImageNet large scale visual recognition challenge. Int J Comput Vis 115:211–252 24. Sangkloy P, Burnell N, Ham C, Hays J (2016) The sketchy database: learning to retrieve badly drawn bunnies. ACM Trans Graph 35(4):1–12 25. Song J, Yu Q, Song YZ, Xiang T, Hospedales TM (2017) Deep spatial-semantic attention for fine-grained sketch-based image retrieval. In: Proceedings of the IEEE conference on computer vision, pp 5551–5560

References

31

26. Sun M, Yuan Y, Zhou F, Ding E (2018) Multi-attention multi-class constraint for fine-grained image recognition. In: Proceedings of the European conference on computer vision, pp 834–850 27. Van Horn G, Branson S, Farrell R, Haber S, Barry J, Ipeirotis P, Perona P, Belongie S (2015) Building a bird recognition app and large scale dataset with citizen scientists: the fine print in fine-grained dataset collection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 595–604 28. Van Horn G, Mac Aodha O, Song Y, Cui Y, Sun C, Shepard A, Adam H, Perona P, Belongie S (2017) The iNaturalist species classification and detection dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8769–8778 29. Van Horn G, Cole E, Beery S, Wilber K, Belongie S, Mac Aodha O (2021) Benchmarking representation learning for natural world image collections. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 12884–12893 30. Wah C, Branson S, Welinder P, Perona P, Belongie S (2011) The Caltech-UCSD birds-200-2011 dataset. Technical report CNS-TR-2011-001 31. Wang X, Lian L, Miao Z, Liu Z, Yu SX (2021) Long-tailed recognition by routing diverse distribution-aware experts. In: Proceedings of international conference on learning representations, pp 1–15 32. Wei XS, Cui Q, Yang L, Wang P, Liu L, Yang J (2022) RPC: a large-scale and fine-grained retail product checkout dataset. SCIENCE CHINA Inf Sci 65(9):197101 33. Yu Q, Liu F, Song YZ, Xiang T, Hospedales TM, Loy CC (2016) Sketch me that shoe. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 799–807 34. Zhou B, Cui Q, Wei XS, Chen ZM (2020) BBN: bilateral-branch network with cumulative learning for long-tailed visual recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 9719–9728 35. Zhou F, Lin Y (2016) Fine-grained image classification by exploring bipartite-graph labels. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1124–1133 36. Zhu C, Tan X, Zhou F, Liu X, Yue K, Ding E, Ma Y (2018) Fine-grained video categorization with redundancy reduction attention. In: Proceedings of the European conference on computer vision, pp 136–152

4

Fine-Grained Image Recognition

In this chapter, we systematically sort out deep learning based approaches of fine-grained image recognition with different learning paradigms, including recognition by localizationclassification subnetworks, recognition by end-to-end feature encoding and recognition with external information. Moreover, we also elaborate several representative works of each research branch. Finally, a summary discussion of fine-grained image recognition is presented.

4.1

Introduction

Fine-grained image recognition has been by far the most active research area of FGIA in the past decade. Fine-grained recognition aims to discriminate numerous visually similar subordinate categories that belong to the same basic category, such as the fine distinction of animal species [136], cars [73], fruits [57], aircraft models [100], and so on. It has been frequently applied in real-world tasks, e.g., ecosystem conservation (recognizing biological species) [1], intelligent retail systems [152, 156], etc. Recognizing fine-grained categories is difficult due to the challenges of discriminative region localization and fine-grained feature learning. Researchers have attempted to deal with these challenges from diverse perspectives. In this chapter, we review the main fine-grained recognition approaches since the advent of deep learning. Broadly, existing fine-grained recognition approaches can be organized into the following three main paradigms: • Recognition by localization-classification subnetworks;

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 X.-S. Wei, Fine-Grained Image Analysis: Modern Approaches, Synthesis Lectures on Computer Vision, https://doi.org/10.1007/978-3-031-31374-5_4

33

34

4 Fine-Grained Image Recognition

Fig.4.1 Chronological overview of representative deep learning based fine-grained recognition methods which are categorized by different learning approaches. (Best viewed in color)

• Recognition by end-to-end feature encoding; • Recognition with external information. Among them, the first two paradigms restrict themselves by only utilizing the supervisions associated with fine-grained images, such as image labels, bounding boxes, part annotations, etc. To further resolve ambiguous fine-grained problems, there is a body of work that uses additional information such as where and when the image was taken [15, 99], web images [106, 129], or text description [46, 114]. In order to present these representative deep learning based fine-grained recognition methods intuitively, we show a chronological overview in Fig. 4.1 by organizing them into the three aforementioned paradigms. For performance evaluation, when the test set is balanced (i.e., there is a similar number test examples from each class), the most commonly used metric in fine-grained recognition is classification accuracy across all subordinate categories of the datasets. It is defined as Accuracy =

|Icorrect | , |Itotal |

(4.1)

where |Itotal | represents the number of images across all sub-categories in the test set and |Icorrect | represents the number of images which are correctly categorized by the model.

4.2

Recognition by Localization-Classification Subnetworks

Researchers have attempted to create models that capture the discriminative semantic parts of fine-grained objects and then construct a mid-level representation corresponding to these parts for the final classification, cf. Fig. 4.2. More specifically, a localization subnetwork is designed for locating key parts, and then the corresponding part-level (local) feature vectors are obtained. This is usually combined with object-level (global) image features

4.2

Recognition by Localization-Classification Subnetworks

35

Fig. 4.2 Illustration of the high-level pipeline of the fine-grained recognition by localizationclassification subnetworks paradigm

for representing fine-grained objects. This is followed by a classification subnetwork which performs recognition. The framework of such two collaborative subnetworks forms the first paradigm, i.e., fine-grained recognition with localization-classification subnetworks. The motivation for these models is to first find the corresponding parts and then compare their appearance. In concretely, it is desirable to capture semantic parts (e.g., heads and torsos) that are shared across fine-grained categories and for discovering the subtle differences between these part representations. Existing methods in this paradigm can be divided into four broad types: (1) employing detection or segmentation techniques, (2) utilizing deep filters, (3) leveraging attention mechanisms, and (4) other methods.

4.2.1

Employing Detection or Segmentation Techniques

It is straightforward to employ detection or segmentation techniques [37, 95, 115] to locate key image regions corresponding to fine-grained object parts, e.g., bird heads, bird tails, car lights, dog ears, dog torsos, etc. Thanks to localization information, i.e., part-level bounding boxes or segmentation masks, the model can obtain more discriminative mid-level (partlevel) representations w.r.t. these parts. Thus, it could further enhance the learning capability of the classification subnetwork, thus significantly boost the final recognition accuracy. Earlier works in this paradigm made use of additional dense part annotations (aka key point localization, cf. Fig. 3.2 on the left) to locate semantic key parts of objects. For example, Branson et al. [6] proposed to use groups of detected part keypoints to compute multiple warped image regions and further obtained the corresponding part-level features

36

4 Fine-Grained Image Recognition

by pose normalization. In the same period, Zhang et al. [178] first generated part-level bounding boxes based on ground truth part annotations, and then trained a R-CNN [37] model to perform part detection. Di et al. [84] further proposed a Valve Linkage Function, which not only connected all subnetworks, but also refined localization according to the part alignment results. In order to integrate both semantic part detection and abstraction, SPDACNN [176] designed a top-down method to generate part-level proposals by inheriting prior geometric constraints and then used a Faster R-CNN [115] to return part localization predictions. Other approaches made use of segmentation information. PS-CNN [59] and Mask-CNN [154] employed segmentation models to get part/object masks to aid part/object localization. Compared with detection techniques, segmentation can result in more accurate part localization [154] as segmentation focuses on the finer pixel-level targets, instead of just coarse bounding boxes. However, employing traditional detectors or segmentation models requires dense part annotations for training, which is labor-intensive and would limit both scalability and practicality of real-world fine-grained applications. Therefore, it is desirable to accurately locate fine-grained parts by only using image level labels [36, 47, 92, 149, 181]. These set of approaches are referred to as “weakly-supervised” as they only use image level labels. It is interesting to note that since 2016 there is an apparent trend in developing fine-grained methods in this weakly-supervised setting, rather than the strong-supervised setting (i.e., using part annotations and bounding boxes), cf. Table 4.1. Recognition methods in the weakly-supervised localization based classification setting always rely on unsupervised approaches to obtain semantic groups which correspond to object parts. Specifically, Zhang et al. [181] adopted the spatial pyramid strategy [77] to generate part proposals from object proposals. Then, by using a clustering approach, they generated part proposal prototype clusters and further selected useful clusters to get discriminative part-level features. Co-segmentation [42] based methods are also commonly used in this weakly supervised case. One approach is to use co-segmentation to obtain object masks without supervision, and then perform heuristic strategies, e.g., part constraints [47] or part alignment [71], to locate fine-grained parts. It is worth noting that the majority of previous works overlook the internal semantic correlation among discriminative part-level features. In concretely, the aforementioned methods pick out the discriminative regions independently and utilize their features directly, while neglecting the fact that an object’s features are mutually semantic correlated and region groups can be more discriminative. Therefore, very recently, some methods attempt to jointly learn the interdependencies among part-level features to obtain more universal and powerful fine-grained image representations. By performing different feature fusion strategies (e.g., LSTMs [36, 74], graphs [149], or knowledge distilling [92]) these joint part feature learning methods yield significantly higher recognition accuracy over previous independent part feature learning methods. In the following, we elaborate three representative methods in this research direction.

ECCV 2014 CVPR 2015 CVPR 2015 CVPR 2016 CVPR 2016 CVPR 2016 IEEE TIP 2016 CVPR 2017 AAAI 2017 PR 2018

PB R-CNN [178]

Krause et al. [71]

Deep LAC [84]

PS-CNN [59]

SPDA-CNN [176]

SPDA-CNN [176]

Zhang et al. [181]

HSnet [74]

TSC [47]

Mask-CNN [154]

AAAI 2020

FDL [92]

ResNet-50

ResNet-50+BN

AAAI 2020

GCL [149]

VGG-16

VGG-16

GoogLeNet

Alex-Net

VGG-16

CaffeNet

CaffeNet

Alex-Net

CaffeNet

Alex-Net

CaffeNet

Backbones

GoogLeNet+BN

BBox

BBox

BBox

BBox

BBox

Test anno.

Ge et al. [36] CVPR 2019

Parts

Parts

BBox+Parts

BBox+Parts

BBox+Parts

BBox+Parts

BBox

BBox+Parts

BMVC 2014 BBox+Parts

Branson et al. [6]

Employing detection or segmentation techniques

Fine-grained recognition by localizationclassification subnetworks

Train anno.

Published in

Methods

82.0% 80.3% 76.6%

224 × 224 227 × 227 227 × 227

90.3% 88.3% 88.6%

Shorter side to 800px 448 × 448 448 × 448

85.7%

448 × 448

87.5%

224 × 224

84.7%

78.9%

224 × 224

Not given

85.1%

Longer side to 800px

81.0%

76.4%

224 × 224

Longer side to 800px

75.7%

Birds

Accuracy

224 × 224

Img. resolution

85.0%



93.9%







80.4%















Dogs

94.3%

94.0%







93.9%











92.6%





Cars

(continued)

93.4%

93.2%

























Aircrafts

Table 4.1 Comparative fine-grained recognition results of two learning paradigms (cf. Sects. 4.2 and 4.3) on the fine-grained benchmark datasets, i.e., Birds (CUB200-2011 [140]), Dogs (Stanford Dogs [67]), Cars (Stanford Cars [73]), and Aircrafts (FGVC Aircraft [100]). Note that, “Train anno.” and “Test anno.” mean which supervised signals used in the training and test phases, respectively. The symbol “–” means the results are unavailable

4.2 Recognition by Localization-Classification Subnetworks 37

Methods

Attention mechanisms

Utilizing deep filters

Table 4.1 (continued)

CVPR 2019 IEEE TIP 2020 CVPR 2020

TASN [184]

PA-CNN [185]

Ji et al. [64]

ResNet-50

VGG-19

ResNet-50

ResNet-50

ICCV 2019

ResNet-50

ResNet-50

VGG-19

VGG-19

ResNet-101

ResNet-50

VGG-16

VGG-16

VGG-19

Alex-Net

VGG-16

Backbones

MGE-CNN [177]

Test anno.

VGG-16

Parts+Attr.

Train anno.

OPAM [109] IEEE TIP 2018

ECCV 2018

Sun et al. [128]

Huang et al. [60]

AAAI 2017

ICCV 2019 CVPR 2020

S3N [27]

Liu et al. [94]

CVPR 2018

DFL-CNN [148]

ICCV 2017

CVPR 2016

PDFS [179]

MA-CNN [182]

ICCV 2015

NAC [121]

CVPR 2017

CVPR 2015

CL [93]

RA-CNN [32]

CVPR 2015

Two-level atten. [158]

Published in

87.9% 87.8% 88.1%

224 × 224 448 × 448 448 × 448

88.5%

86.5%

448 × 448

448 × 448

85.4%

448 × 448

85.8%

86.5%

448 × 448

Not given

85.3%

448 × 448

88.5%

448 × 448

87.3%

86.7%

448 × 448

Shorter side to 448px

84.5%

Not given

81.0%

73.5%

448 × 448

Not given

77.9%

Birds

Accuracy

Not given

Img. resolution











84.8%





87.3%







72.0%







Dogs

94.6%

93.3%

93.8%

93.9%

92.2%

93.0%



92.8%

92.5%



94.7%

93.8%









Cars

(continued)

92.4%

91.0%











89.9%





92.8%

92.0%









Aircrafts

38 4 Fine-Grained Image Recognition

Others

Fine-grained High-order recognition feature by interactions end-to-end feature encoding

Methods

Table 4.1 (continued)

ICCV 2017

ECCV 2018 ECCV 2018 ECCV 2018 NeurIPS 2019 IEEE TIP 2020

DeepKSPD [31]

HBP [169]

GP [151]

DBTNet50 [183]

MOMN [102]

CVPR 2017

G2 DeNet [144]

CVPR 2018

CVPR 2017

LRBP [68]

iSQRTCOV [81]

CVPR 2017

Cai et al. [8]

CVPR 2016

CVPR 2020

DF-GMM [150]

KP [22]

IJCV 2019

M2DRL [50]

C-BCNN [34]

ECCV 2018

NTS-Net [168]

ICCV 2015

CVPR 2016

BoT [147]

Bilinear CNN [90]

NeurIPS 2015

STN [63]

Published in

Train anno.

Test anno.

VGG-16

VGG-16

VGG-16

VGG-16

VGG-16

ResNet-101

VGG-16

VGG-16

VGG-16

VGG-16

VGG-16

VGG-16+VGG-M

ResNet-50

VGG-16

ResNet-50

Alex-Net

GoogLeNet+BN

Backbones

86.5% 87.1% 85.8% 87.5% 87.3%

448 × 448 448 × 448 448 × 448 448 × 448 448 × 448

84.2%

224 × 224

88.7%

86.2%

224 × 224

85.3%

84.3%

448 × 448

224 × 224

84.1%

448 × 448

448 × 448

88.8%

448 × 448

87.1%

87.2%

448 × 448

Longer side to 200px

– 87.5%

448 × 448

84.1%

Birds

Accuracy

Not given

448 × 448

Img. resolution



































Dogs

92.8%

94.1%

92.8%

93.7%

93.2%

93.3%

91.7%

92.5%

90.0%

92.4%

91.2%

91.3%

94.8%

93.3%

93.9%

92.5%



Cars

(continued)

90.4%

91.2%

89.8%

90.3%

91.0%

91.4%

88.3%

89.0%

87.3%

86.9%

84.1%

84.1%

93.8%



91.4%

88.4%



Aircrafts

4.2 Recognition by Localization-Classification Subnetworks 39

Methods

Others

Specific loss functions

Table 4.1 (continued)

ResNet-50 ResNet-50

Cross-X [97] ICCV 2019

PMG [28]

ECCV 2020

ResNet-50

VGG-16

VGG-16

CVPR 2019

BBox

Bilinear CNN

DCL [13]

MC-Loss [10]

DenseNet-161

CVPR 2016

IEEE TIP 2020

API-Net [191]

ResNet-50

CVPR 2016

AAAI 2020

API-Net [191]

ResNet-50

ResNet-101

BGL [188]

AAAI 2020

Sun et al. [127]

ResNet-50

BGL [188]

AAAI 2020 AAAI 2020

CIN [35]

Sun et al. [128]

Bilinear CNN

ECCV 2018 ECCV 2018

PC [29]

Bilinear CNN

Backbones

MaxEnt [30] NeurIPS 2018

Test anno.

VGG-16

Train anno.

MaxEnt [30] NeurIPS 2018

Published in

88.1% 88.6% 87.7% 90.0% 86.4% 75.9% 80.4% 87.8% 87.7% 89.6%

448 × 448 448 × 448 448 × 448 448 × 448 448 × 448 224 × 224 224 × 224 448 × 448 448 × 448 448 × 448

85.6% 86.5%

224 × 224

85.3%

224 × 224

448 × 448

77.0%

Birds

Accuracy

224 × 224

Img. resolution



88.9%









89.4%

88.3%

87.7%



84.8%

83.0%

83.2%

65.4%

Dogs

95.1%

94.6%

94.5%

90.5%

86.0%

94.4%

95.3%

94.8%

94.9%

94.5%

93.0%

92.4%

92.8%

83.9%

Cars

93.4%

92.6%

93.0%





92.9%

93.9%

93.0%

93.5%

92.8%



85.7%

86.1%

78.1%

Aircrafts

40 4 Fine-Grained Image Recognition

4.2

Recognition by Localization-Classification Subnetworks

41

4.2.1.1 Mask-CNN Mask-CNN [154] proved that selecting useful deep descriptors contributes well to finegrained image recognition, and it is the first model that selects deep convolutional descriptors for object recognition, especially for fine-grained image recognition. Mask-CNN [154] is a fully convolutional network (FCN) [95] without fully connected layers, in order to both locate the discriminative parts (e.g., head and torso), and more importantly generate weighted object/part masks for selecting useful and meaningful convolutional descriptors. Based on that, a three-stream Mask-CNN model is built for aggregating the selected object- and partlevel descriptors simultaneously. The three streams correspond to the whole image, head and torso image patches, respectively. The framework of Mask-CNN is shown in Fig. 4.3. Mask-CNN uses Fully Convolution Networks (FCN) [95] to learn object and part masks. In concretely, it splits part key points annotations into two sets, including the head key points and torso key points. Based on the key points, two ground-truth of part masks are generated. One is the head mask, which corresponds to the smallest rectangle covering all the head key points. The other is the torso mask, which is the smallest rectangle covering all the torso key points. The overlapping part of the two rectangles is regarded as the head mask. The rest of the image is as the background. Thus, Mask-CNN models the part mask learning procedure as a three-class segmentation problem, i.e., head, torso, and background. The mask learning network architecture is shown in Fig. 4.4. During the FCN inference, without using any annotation, three class heat maps (in the same size as the original input image) are returned for each image. Moreover, the predicted segmentation class scores are regarded as the learned part weights for the later descriptors aggregation process. For the final recognition performance, both part masks, if accurately predicted, will benefit the later deep descriptor selection process and the final fine-grained classification. Therefore, during both the training and test phases, Mask-CNN uses the

Head 224×224

conv1_1 relu1_1 ... pool1 ... conv5_3 relu5_3 pool5

Avg Pool Max Pool

1024-d

Torso 224×224

(a) Inputs

(b) CNN w/o FCs

(c) Convolutional activation tensor

(d) Descriptor selection

Fig. 4.3 Overall framework of the Mask-CNN method [154]

Image feature

K fc+softmax

conv1_1 relu1_1 ... pool1 ... conv5_3 relu5_3 pool5

Avg Pool Max Pool

1024-d

Head feature

Whole image 448×448

3072-d

Torso feature

conv1_1 relu1_1 ... pool1 ... conv5_3 relu5_3 pool5

Avg Pool Max Pool

1024-d

(f) Classification

(e) Weighted aggregation and concatenation

3

40

40

6 25

4

4

38

6

38

25

96

4 Fine-Grained Image Recognition

96

42

96

(a) The input

(b) FCN

3

(c) The mask g.t.

Fig. 4.4 Demonstration of the mask learning procedure by fully convolutional network (FCN) [95]

predicted masks for both part localization and descriptor selection. Mask-CNN also combines the two masks to form a mask for the whole object, which is called the object mask. After obtaining the object and part masks, Mask-CNN builds the three-stream model for joint training. The overall architecture of the model is presented in Fig. 4.3. In the following, we take the whole image stream as an example to illustrate the pipeline of each stream of Mask-CNN. In concretely, the inputs of the whole image stream are the original images resized to h × h. The input images are fed into a traditional convolutional neural network, but the fully connected layers are discarded. In its implementation, Mask-CNN uses VGG-16 [122] as the baseline model, and the layers before pool5 are kept (including pool5 ). A 7 × 7 × 512 activation tensor in pool5 is obtained if the input image is 224 × 224. Therefore, 49 deep convolutional descriptors of 512-d are obtained, which also correspond to 7 × 7 spatial positions in the input images. Then, the learned object mask is firstly resized to 7 × 7 by the bilinear interpolation and then used for selecting useful and meaningful deep descriptors. As illustrated in Fig. 4.3c, d, the descriptor should be kept by weights when it locates in the object region. If it locates in the background region, that descriptor will be discarded. In Mask-CNN, the mask contains the learned part/object segmentation scores, which is a real matrix whose elements are in the range of [0, 1]. Correspondingly, 1 stands for absolutely keeping and 0 is for absolutely discarding. Mask-CNN implements the selection process as an element-wise product operation between the convolutional activation tensor and the mask matrix. Therefore, the descriptors located in the object region will remain by weights, while the other descriptors will become zero vectors. Concretely, if the pixels are predicted as head/torso by FCN, the real values of the mask are kept. Otherwise, if the pixels indicate the regions are background, the value of these background regions in the mask is reset to zero value. Then, the processed masks are used for selecting descriptors and the rest processing. For these selected descriptors, in the end-to-end Mask-CNN learning process, MaskCNN aggregates them by both average- and max-pooling into two 512-d feature vectors, respectively. Then, the 2 -normalization is followed for each of them. After that, MaskCNN concatenates them into a 1024-d feature as the final representation of the whole image stream. The streams for the head and torso have similar processing steps as the whole image one. However, different from the inputs of the whole image stream, Mask-CNN generates the

4.2

Recognition by Localization-Classification Subnetworks

43

input images of the head and torso streams as follows. After obtaining the two part masks, Mask-CNN uses the part masks as the part detectors to localize the head part and torso part in the input images. For each part, the smallest rectangle bounding box which contains the part mask regions are returned. Based on the rectangle bounding box, Mask-CNN crops the image patch which acts as the inputs of the part stream. The last two streams of Fig. 4.3 show the head and torso streams in Mask-CNN. In the classification step shown in Fig. 4.3f, the final 3,072-d image representation is the concatenation of the whole image, the head and the torso features. The three-stream MaskCNN is learned end-to-end, with the parameters of three CNNs learned simultaneously. During training Mask-CNN, the parameters of the learned FCN segmentation network are fixed. During inference, when faced with a test image, the learned FCN model first returns the corresponding mask predictions for both the head and torso. Then, based on the masks, Mask-CNN uses them as the part detectors to localize the head part and torso part in the input images. The extracted head and torso image patches are regarded as the inputs for the head and torso streams in Mask-CNN. After obtaining the convolutional descriptors through the convolution layers of three-stream Mask-CNN, the predicted masks are employed again. While, at this time, the masks are utilized for selecting descriptors. At last, the selected descriptors are aggregated following the strategy in the training stage, and then the predicted label based on the 3,072-d final image representation is generated.

4.2.1.2 Fine-Grained Recognition Without Part Annotations Krause et al. [71] proposed a method for fine-grained recognition that uses no part annotations, based on generating parts using co-segmentation [139] and alignment. At the core of this method for generating parts is the concept of alignment by segmentation, the process of aligning images is via aligning their figure-ground segmentation masks. The key insight is that, even for complicated and deformable objects such as birds, a figure-ground segmentation (cf. Fig. 4.5b) is often sufficient in order to determine an object’s pose and localize its parts. The work of [71] decomposes the process of aligning all images as aligning pairs of images with similar poses, which they represent in a graph (Fig. 4.5c), producing a global

(a) Input

(b) co-segmentation (Eq. 1)

(d) alignment -Align edges of pose graph w/segmentations -Propagate alignment along pose graph

(e) Output: parts Expand points to region tight around segmentation

(c) pose graph k-MSTs w/CNN dist

Fig. 4.5 An overview of the method proposed by Krause et al. [71] to generate parts used for finegrained recognition

44

4 Fine-Grained Image Recognition

alignment (Fig. 4.5d) from these easier, local alignments. Based on these alignments, [71] samples points across all images, where each determines a part, cf. Fig. 4.5e. More specifically, co-segmentation is first used to split each image into foreground and background. Let θ if be a foreground color model for an image I, represented as a Gaussian mixture model; θbi be a similar background model, and θ cf be a shared foreground color model for class c. The binary assignment of pixel p in image I to either foreground or background is denoted x ip , its corresponding RGB value is z ip , the set of segmentation assignments across all images is X , and p f is a pixel-wise foreground prior. The co-segmentation objective is          E x ip , θ i , θ cf ; pif + E x ip , xqi max , (4.2) X ,θ

where +

x ip 2

i

p

p,q

       E x ip , θ i , θ cf ; pif = 1 − x ip log p z ip ; θbi          log p z ip ; θ if + log p z ip ; θ cf + E x ip ; p f ,     log p f if x ip = 1 i  , E x p; p f = if x ip = 0 log 1 − p f

(4.3)

(4.4)

  and E x ip , xqi is the standard pairwise term between pixels p and q for a GrabCut [116] segmentation model, enforcing consistency between neighboring pixels with respect to their f assigned binary foreground/background values. If p f = 0.5 and θi = θic then this is equivalent to GrabCut, and if only p f = 0.5 then it reduces to the “image + class” model of [42] without the learned per-term weights. Then, optimization is performed separately for each fine-grained class c, and proceeds by iteratively updating the appearance models θ if , θbi , θ cf and optimizing the foreground/background masks x i . As is standard in a GrabCut formulation, [71] initializes the appearance models using the provided bounding boxes, with the pixels inside each bounding box marked foreground and the rest as background. This initial background remains fixed as background throughout the optimization. In order to get better co-segmentation results, [71] uses the foreground refinement in their method. In concretely, they constrain the ratio of a foreground segmentation to the total area of its bounding box to be between ω1 and ω2 and span at least ρ of its width and height. To satisfy these constraints, [71] performs a binary search over the pixel-wise foreground prior p f , cf. Eq. (4.4), on a per image basis after the initial segmentation until the constraints are satisfied, initializing with p f = 0.5. In the experiments, they set ω1 = 10%, ω2 = 90%, and ρ = 50% [71]. After co-segmentation, the next step is choosing images to align. However, it is a difficult problem to align two objects with arbitrary poses, so [71] chooses images with similar poses to align, decomposing the global task of aligning all training images into many smaller, simpler tasks of aligning images containing objects of similar poses. [71] formalizes this

4.2

Recognition by Localization-Classification Subnetworks

45

 n requirement as building a connected graph G of images {Ii }i=1 where each edge Ii , I j is between two images containing objects of similar poses to be aligned. The similarity of the two images is measured by the cosine distance between features around each bounding box. To reduce the variance in alignment, they furthermore require that each image Ii ∈ G be connected to at least k other images, aggregating all image-to-image alignments from the neighbors of Ii to increase robustness. Because G represents a graph of pose similarity, they refer to it as a pose graph, cf. Fig. 4.5c. Using this distance metric graph G satisfying the constraints is constructed by iteratively computing disjoint minimum spanning trees of the images, merging the trees into a single

k Mi , where M1 is the graph. In concretely, [71] decomposes the pose graph as G = i=1 minimum spanning tree of the dense graph G D on all n images with edge weights given

j−1 by cosine distance, and M j is the minimum spanning tree of G D \ i=1 Mi , which can be computed by setting the weights of all edges used in M1 , . . . , M j−1 to infinity. Since minimum spanning trees are connected, G is connected, and since G is composed of k disjoint minimum spanning trees, each node in G is connected to at least k other vertices, satisfying the constraints. Given a pose graph connecting images with similar poses, [71] first samples a large set of points in one image, representing the overall shape of an object, and then propagates these points to all images using the structure of the graph. Concretely, [71] samples a set of points of size k1 on the segmented foreground of a single image Ir , which they choose to be the image with the highest degree in G. Then, while there is still at least one image that the points have not been propagated to, they propagate to the image I j adjacent to the largest number of images in G which have already been propagated to. Let τi, j : R2 → R2 be a dense alignment function mapping a point in image Ii to its corresponding point in image I j , which is learned based on the segmentations for Ii and I j . Then, to propagate each of the k1 points, [71] uses τi, j to propagate the corresponding point from each image Ii adjacent to I j in G and aggregate these separate propagated points via an aggregation function α, which they take to be the median of points propagated from each adjacent image. After globally aligning all images, the problem remains of using the alignment to generate parts for use in recognition. Specifically, [71] selects a subset of the propagated points of size k2 to be expanded into parts for recognition. [71] conducts this by clustering the trajectories of the k1 points across all images, i.e., they represent each point by its 2 × ndimensional trajectory across all images, then cluster each of these trajectories via k-means into k2 clusters, providing a good spread of points across the foreground of each image, cf. Fig. 4.5d. After that, [71] generates a single part from each one of these k2 points by taking an area around each point with a fixed size with respect to the object’s bounding box, then shrink the region until it is tight around the estimated segmentation, cf. Fig. 4.5e, yielding a tight bounding box around each generated part in each training image.

46

4 Fine-Grained Image Recognition

Through the steps introduced aforementioned, a set of parts are generated. Since not all parts are equally useful for recognition1 , it is important to make use of them delicately [71]. Concretely, let f ip be features for image I at part p, and let w p,c be classification weights for part p and class c, learned for each part independently. The goal is to learn a vector of (k2 + 1)-dimensional weights v satisfying min v

n   i=1 c=ci

2  max 0, 1 − v  u ici ,c + λv1 ,

(4.5)

where the p-th element of u ici ,c is the difference in decision values between correct class ci and incorrect class c  i  fp. (4.6) u ici ,c ( p) = w p,ci − w p,c Intuitively, this optimization tries to select a sparse weighing of classifiers such that, combined, the decision value for the correct class is always larger than the decision value for every other class by some margin, forming a discriminative combination of parts. Decision values for each ui can be calculated via cross-validation while training the independent classifiers at each part. The final classification is given by arg max c

k2 

v p w p,c f p .

(4.7)

p=1

The other main challenge in using automatically generated parts is finding them in novel, completely unannotated test images. In concretely, [71] first performs detection with a RCNN [38] trained on the whole bounding box, then uses this predicted bounding box in the segmentation framework above, removing the foreground class appearance term since the class label is unknown at test time. The nearest neighbors of the test image in the training set are calculated using features and alignment from those images is done exactly as described above.

4.2.1.3 GCL Graph-propagation based Correlation Learning (GCL) [149] was proposed to fully mine and exploit the discriminative potentials of region correlations for Weakly Supervised FineGrained Image Classification (WFGIC). Specifically, GCL consists of two graph propagation sub-networks, as shown in Fig. 4.6. In the discriminative region localization phase, a Criss-cross Graph Propagation (CGP) sub-network is proposed to learn region correlations, which establishes a correlation between regions and then enhances each region by weighted aggregating other regions in a criss-cross way. By this means, each region’s representation encodes the global image-level context and local spatial context simultaneously, thus the 1 In bird recognition, some, such as the legs, are only rarely useful, while others, like the head, contain

most information useful for discrimination.

4.2

Recognition by Localization-Classification Subnetworks

47

Fig. 4.6 Framework of the Graph-propagation based Correlation Learning (GCL) model [149]

network is guided to implicitly discover the more powerful discriminative region groups for WFGIC. In the discriminative feature representation phase, the Correlation Feature Strengthening (CFS) sub-network is proposed to explore the internal semantic correlation among discriminative patches’ feature vectors, to improve their discriminative power by iteratively enhancing informative elements while suppressing the useless ones. One graph propagation process of the CGP module includes the following two stages: the first stage is that CGP learns correlation weight coefficients between every two regions (i.e., adjacent matrix computing). In the second stage, the model combines the information of its criss-cross neighbor regions through a weighted sum operation for seeking the real discriminative regions (i.e., graph updating). In concretely, the global image-level context is integrated into CGP via calculating correlations between every two regions in the whole image, and the local spatial context information is encoded through the iterative criss-cross aggregating operations. Formally, given an input feature map M O ∈ RC×H ×W , where W , H represents the width and height of the feature map and C is the number of channels, then feed it into the CGP module F (·) by (4.8) M S = F (M O ) , where F (·) is composed of node representation, adjacent matrix computing and graph updating. M S ∈ RC×H ×W is the output feature maps. The node representation generation is achieved by a simple convolution operation f (·) as (4.9) M G = f (W T · M O + bT ) , where W T ∈ RC×1×1×C and bT are the learned weight parameters and bias vector of a convolution layer, respectively. M G ∈ RC×H ×W denotes the node feature map. In detail, GCL regards a 1 × 1 convolutional filter as a small region detector. Each V T ∈ RC×1×1 vector across channels at a fixed spatial location of MG represents a small region at a corresponding location of the image. GCL then uses the generated small region as a node

48

4 Fine-Grained Image Recognition

representation. Note that, W T is randomly initialized and the initial three node feature maps are obtained by three different f calculations: M 1G , M 2G , and M 3G . After obtaining the W × H nodes with C-dimension vectors in feature map M 1G , M 2G , GCL constructs a correlation graph to calculate the semantic correlations between nodes. Each element in the adjacent matrix of the correlation graph indicates the correlation intensity between nodes. In concretely, the adjacent matrix is obtained through performing node vector inner product between two feature maps M 1G ∈ RC×H ×W and M 2G ∈ RC×H ×W . The correlation of two positions at p1 in M 1G and p2 in M 2G is defined as p

p

c ( p 1 , p2 ) = V 1 1 · V 2 2 , p

(4.10)

p

where V 1 1 and V 2 2 mean node representation vectors of p1 and p2 , respectively. Note that, p1 and p2 must meet a specific spatial constraint that p2 can only be on the same row or column (i.e., criss-cross positions) of p1 . As a result, W + H − 1 correlation values for each node in M 1G are obtained. GCL organizes the relative displacements in channels and obtains an output correlation matrix M C ∈ R K ×H ×W , where K = W + H − 1. Then, M C is passed through a softmax layer to generate the adjacent matrix R ∈ R K ×H ×W by R

i jk





i jk MC



i jk

e MC

= K

k=1 e

i jk

MC

,

(4.11)

where R i jk is the correlation weight coefficient of the ith row , the jth column and the kth channel. In the process of the forward pass, the more discriminative the regions are, the greater their correlations are. In the backward pass, GCL implements the derivatives with respect to each blob of node vectors. When the probability value of classification is low, the penalty will be back-propagated to lower the correlation weight of the two nodes, and the node vectors calculated through the node representation generation operation will be updated at the same time. Then, GCL feeds M 3G ∈ RC×H ×W which is generated by the node representation generation phase, and the adjacent matrix R into the updating operation by ij

MU =

W +H −1 

 V3wh · R i jk ,

(4.12)

k=1

where V3wh is the node in the wth row and the hth column of M 3G , (w, h) is in set ij [(i, 1), . . . , (i, H ), (1, j), . . . , (W , j)]. The node MU can be updated by combining nodes at their horizontal and vertical directions with a corresponding correlation weight coefficient R i jk . The residual learning is also adopted in GCL by M S = α · MU + M O ,

(4.13)

4.2

Recognition by Localization-Classification Subnetworks

49

where α is a self-adaptive weight parameter and it gradually learns to assign more weight to the discriminative correlation features. It ranges from [0, 1], and is initialized approximating 0. In this way, M S aggregates the correlation features and the original input features to pick out more discriminative patches. Then, GCL feeds M S as the new input into the next iteration of CGP. After multiple graph propagations, each node can aggregate all regions with different frequencies, which indirectly learns the global correlations, and the closer the region is to the aggregating region, the higher the aggregation frequency is during the graph propagation, which indicates the local spatial context information. After obtaining the residual feature map M S , which aggregates the correlation features and the original input features, GCL feeds it into a discriminative response layer. In concretely, it introduces a 1 × 1 × N convolution layer and a sigmoid function σ to learn discriminative probability maps S ∈ R N ×H ×W , which indicates the impact of discriminative regions on the final classification. N is the number of the default patches at a given location in the feature maps. Afterwards, each default patch pi jk will be assigned the discriminative probability value accordingly. The formulaic representation is as

pi jk = tx , t y , tw , th , si jk ,

(4.14)

 where tx , t y , tw , th is the default coordinates of each patch and si jk denotes the discriminative probability value of the i-th row, the j-th column and the k-th channel. Finally, the network picks the top M patches according to the probability value, where M is a hyperparameter. GCL also proposes a CFS sub-network to explore the internal semantic correlation between region feature vectors to obtain better discriminative ability. The details of CFS are as follows. To construct the graph for mining correlation among selected patches, GCL extracts M nodes with D-dimension feature vectors from M selected patches as the input of Graph Convolution Network (GCN) [14]. After detecting the M nodes, the adjacent matrix of the correlation coefficient is computed which indicates correlation intensity between nodes. Therefore, each element of the adjacent matrix can be calculated as Ri j = ci j · < n i , n j >,

(4.15)

 where Ri j denotes the correlation coefficient between each two nodes n i , n j , and ci j is correlation weight coefficient in weighted matrix C ∈ R M×M , and ci j can be learned to adjust correlation coefficient Ri j through back propagation. Then, GCL performs normalization on each row of the adjacent matrix to ensure that the sum of all the edges connected to one node equals to 1. The normalization of the adjacent matrix A ∈ R M×M is realized by the softmax function shown as  exp R n i , n j Ai j = N (4.16) .  j=1 exp R n i , n j

50

4 Fine-Grained Image Recognition

As a result, the constructed correlation graph measures the relationship intensity between the selected patches. After obtaining the adjacent matrix, GCL both takes feature representations N ∈ R M×D with M nodes and the corresponding adjacent matrix A ∈ R M×M as inputs, and updates the  node features as N  ∈ R M×D . Formally, one layer process of GCL can be represented as N  = f (N, A) = h( AN W ),

(4.17)



where W ∈ R D×D is the learned weight parameters, and h(·) is a non-linear function (i.e., ReLU). After multiple propagations, the discriminative information in selected patches can be wider interacted with to obtain better discriminative ability. GCL proposes an end-to-end model which incorporates the CGP and CFS into a unified framework. The CGP and CFS are trained together under the supervision of multi-task loss L, which consists of a basic fine-grained classification loss Lcls , a guided loss Lgud , a rank loss Lrank and a feature strengthening loss Lfea . It can be shown as L = Lcls + λ1 · Lgud + λ2 · Lrank + λ3 · Lfea ,

(4.18)

where λ1 , λ2 , λ3 are balance hyper-parameter among these losses. More concretely, let X represent the original image and denote the selected discriminative patches with and without the CFS module as P = {P1 , P2 , . . . , PN } and P  =     P1 , P2 , . . . , PN respectively. C is the confidence function which reflects the probability of classification into the correct category, and S = {S1 , S2 , . . . , S N } means the discriminative probability scores. Then, the guided loss, rank loss and feature strengthening loss are defined as N  (4.19) Lgud (X , P) = (max {0, log C(X ) − log C (Pi )}) , i

Lrank (S, P) =





   max 0, Si − S j ,

(4.20)

C(Pi ) δ k=1

(4.32)

ˆ ∈ R5×H ×W is the class response maps correspond to Prˆob and δ is a threshold. where M R−min(R) . All local R is then mapped into [0, 1] by the Min-Max normalize, i.e., R = max(R)−min(R) 4 maximums within a window size of r in R are found and their locations are denoted as T = {(x1 , y1 ), (x2 , y2 ), . . . , (x Nt , y Nt )}, where Nt is the number of detected peaks. For each peak (x, y) ∈ T detected by the above procedure, random number ζ(x,y) is generated from the uniform distribution between 0 and 1. Then, peaks are partitioned into two sets, Td and Tc , according to their response values as Td = {(x, y)|(x, y) ∈ T , if Rx,y ≥ ζ } . Tc = {(x, y)|(x, y) ∈ T , if Rx,y < ζ }

(4.33)

Peaks of high response value which localize discriminative evidence (e.g., unique patterns for the fine-grained categories) are more likely to be partitioned into Td , while peaks of low response value which localize complementary evidence (e.g., supporting patterns) are more likely to be partitioned into Tc . Finally, the Gaussian kernel is utilized to compute a set of sparse attention A ∈ R Nt ×H ×W attending to each peak as ⎧ (x−xi )2 +(y−yi )2 ⎪ − ⎪ R xi ,yi β12 ⎨R , if (xi , yi ) ∈ Td xi ,yi e , (4.34) Ai,x,y = (x−xi )2 +(y−yi )2 ⎪ − ⎪ R xi ,yi β22 ⎩ 1 e , if (xi , yi ) ∈ Tc R xi ,yi

where β1 and β2 are learnable parameters and Rxi ,yi is the peak value of the i-th peak in T . With the sparse attention defined in Eq. (4.34), image re-sampling is performed to highlight fine-grained details from informative local regions while preserving surrounding context information. Two sampling maps Q d and Q c for the discriminative branch and complementary branch of feature extraction are constructed as Ai , if (xi , yi ) ∈ Td Qd = . (4.35) Qc = Ai , if (xi , yi ) ∈ Tc

58

4 Fine-Grained Image Recognition

Denote an input image I as a mesh grid with vertices V , where V = [v 0 , v 1 , . . . , v end ] and v i = (vxi , v iy ) ∈ R2 . The sampling procedure targets at exploring a new mesh geometry V  = [v 0 , v 1 , . . . , v end ], where regions of higher significance enjoy uniform scaling and those of lower significance are allowed to be suppressed to a large extent. This problem can be converted to find a mapping between the re-sampling image and the input image, and such mapping can be written as two functions, f (v) and g(v), so that X new (v) = X ( f (v), g(v)), where X new denotes the re-sampled image. The goal of designing f and g is to map pixels proportionally to the normalized weight assigned to them by the sampling map. An exact approximation to this problem is that f and g can satisfy the  f (v)  g(v) Q(v  )dvx dv y = vx v y . Following the method in [113], the solution condition: 0 0 can be described as     Q(v )k(v , v)v x f (v) = v , (4.36)    Q(v )k(v , v) v    v  Q(v )k(v , v)v y g(v) = , (4.37)   v  Q(v )k(v , v) where k(v  , v) is a Gaussian distance kernel to act as a regularizer and avoid extreme cases, such as all the pixels converging to the same value. By substituting Q in Eq. (4.36) and in Eq. (4.37) with Q d and Q c that are computed in Eq. (4.35), two re-sampled images can be obtained, and are named as the discriminative branch image and the complementary branch image, respectively. With the sparse attention and selective sampling procedure defined above, the feature learning procedure is implemented in an end-to-end manner. During the process, an image I is first fed to S3Ns and generates two re-sampled images, the same size as the input image. They amplify a dynamic number of informative regions corresponding to discriminative and complementary features. The two resampled images are then taken as inputs by S3Ns for extracting fine-grained features. Therefore, feature representations for each image can be defined as: F J = {F O , F D , F C }, where F O , F D , F C denotes the feature extracted from the original image, the discriminative branch image, and the complementary branch image, respectively. These features are concatenated and fed to a fully-connection fusion layer with a softmax function for the final classification. During learning, the whole model is optimized by classification losses defined as  Lcls (Y i , y) + Lcls (Y j , y) , (4.38) L(X ) = i∈I

where Lcls denotes the cross-entropy loss. I is {O, D, C}. Y i is the predicted label vector from original and re-sampling images based on features F O , F D , and F C . Y j is the predicted label vector using joint features F J and y is the ground-truth label vector.

4.2

Recognition by Localization-Classification Subnetworks

4.2.3

59

Leveraging Attention Mechanisms

Even though the previous localization-classification fine-grained methods have shown strong classification performance, one of their major drawbacks is that they require meaningful definitions of the object parts. In many applications, however, it may be hard to represent or even define common parts of some object classes, e.g., non-structured objects like food dishes [5] or flowers with repeating parts [105]. Compared to these localization-classification methods, a more natural solution of finding parts is to leverage attention mechanisms [62] as sub-modules. This enables CNNs to attend to loosely defined regions for fine-grained objects and as a result have emerged as a promising direction. It is common knowledge that attention plays an important role in human perception [19, 62]. Humans exploit a sequence of partial glimpses and selectively focus on salient parts of an object or a scene in order to better capture visual structure [76]. Inspired by this, Fu et al. and Zheng et al. [32, 182] were the first to incorporate attention processing to improve the fine-grained recognition accuracy of CNNs. Specifically, RA-CNN [32] uses a recurrent visual attention model to select a sequence of attention regions (corresponding to object “parts”2 ). RA-CNN iteratively generates region attention maps in a coarse to fine fashion by taking previous predictions as a reference. MA-CNN [182] is equipped with a multi-attention CNN, and can return multiple region attentions in parallel. Subsequently, Peng et al. [109] and Zheng et al. [185] proposed multi-level attention models to obtain hierarchical attention information (i.e., both object- and part-level). He et al. [49] applied multi-level attention to localize multiple discriminative regions simultaneously for each image via an n-pathway end-to-end discriminative localization network that simultaneously localizes discriminative regions and encodes their features. This multi-level attention can result in diverse and complementary information compared to the aforementioned singlelevel attention methods. Sun et al. [128] incorporated channel attentions [58] and metric learning [7] to enforce the correlations among different attended regions. Zheng et al. [184] developed a trilinear attention sampling network to learn fine-grained details from hundreds of part proposals and efficiently distill the learned features into a single CNN. Recently, Ji et al. [64] presented an attention based convolutional binary neural tree, which incorporates attention mechanisms with a tree structure to facilitate coarse-to-fine hierarchical fine-grained feature learning. Although the attention mechanism achieves strong accuracy in fine-grained recognition, it tends to overfit in the case of small-scale data. In the following, we elaborate three representative methods in this research direction.

4.2.3.1 RA-CNN Recurrent Attention Convolutional Neural Network (RA-CNN) [32] was proposed to recursively learn discriminative region attention and region-based feature representation at 2 Note that here “parts” refers to the loosely defined attention regions for fine-grained objects, which

is different from the clearly defined object parts from manual annotations, cf. Sect. 4.2.1.

60

4 Fine-Grained Image Recognition

multiple scales in a mutually reinforced way. The learning at each scale consists of a classification sub-network and an attention proposal sub-network (APN). The APN starts from full images, and iteratively generates region attention from coarse to fine by taking previous predictions as a reference, while a finer scale network takes as input an amplified attended region from previous scales in a recurrent way. RA-CNN is optimized by an intra-scale classification loss and an inter-scale ranking loss, to mutually learn accurate region attention and fine-grained representation. In other words, RA-CNN does not need bounding box/part annotations and can be trained end-to-end. Given an input image I, RA-CNN first extracts region-based deep features by feeding the images into pre-trained convolution layers. The extracted deep representations are denoted as X = f (I; ) ∈ R H ×W ×C , where f (·) denotes a set of operations of convolution, pooling and activation, and  denotes the overall parameters. RA-CNN then further models the network at each scale as a multi-task formulation with two outputs using APN. The first task is designed to generate a probability distribution p over fine-grained categories, shown as p(I) = g1 (X),

(4.39)

where g1 (·) represents fully-connected layers to map convolutional features to a feature vector that could be matched with the category entries, as well as including a softmax layer to further transform the feature vector to probabilities. The second task is proposed to predict a set of box coordinates of an attended region for the next finer scale. By approximating the attended region as a square with three parameters, the representation is given by [tx , t y , tl ] = g2 (X),

(4.40)

where tx , t y denotes the square’s center coordinates in terms of x and y axis, respectively, and tl denotes the half of the square’s side length. The specific form of g2 (·) can be represented by two-stacked fully-connected layers with three outputs which are the parameters of the attended regions. Once the location of an attended region is hypothesized, RA-CNN crops and zooms in the attended region to a finer scale with higher resolution to extract more fine-grained features. Assume the top-left corner in original images as the origin of a pixel coordinate system, whose x-axis and y-axis are defined from left-to-right and top-to-bottom, respectively. RACNN adopts the parameterization of the top-left (denoted as “tl”) and bottom-right (denoted as “br ”) points from the attended region as tx(tl) = tx − tl , t y(tl) = t y − tl , tx(br ) = tx + tl , t y(br ) = t y + tl .

(4.41)

Based on the above representations, the cropping operation can be implemented by an element-wise multiplication between the original image at coarser scales and an attention mask, which can be computed as

4.2

Recognition by Localization-Classification Subnetworks

61

Iatt = I M(tx , t y , tl ) ,

(4.42)

where represents element-wise multiplication, Iatt denotes the cropped attended region, and M(·) acts as an attention mask, with the specific form as M(·) = [h(x − tx(tl) ) − h(x − tx(br ) )]

(4.43)

·[h(y − t y(tl) ) − h(y − t y(br ) )] , and h(·) is a logistic function with index k as h(·) = 1/{1 + exp(−kx)} .

(4.44)

Then, RA-CNN uses a bilinear interpolation to compute the amplified output Iamp from the nearest four inputs in Iatt by a linear map, which is given by amp

I(i, j) =

1  α,β=0

|1 − α − {i/λ}||1 − β − { j/λ}|Iatt (m,n) ,

(4.45)

where m = [i/λ] + α, n = [ j/λ] + β, λ is an upsampling factor, which equals the value of enlarged size divided by tl . [·] and {·} is the integral and fractional part, respectively. Finally, RA-CNN minimizes an objective function following a multi-task loss. The loss function for an image sample is defined as L(I) =

3  s=1

{Lcls (Y (s) , Y ∗ )} +

2 

(s)

(s+1)

{Lrank ( pt , pt

s=1

Fig. 4.10 Overall framework of the RA-CNN model [32]

)} ,

(4.46)

62

4 Fine-Grained Image Recognition

where s denotes each scale, Y (s) , Y ∗ denotes the predicted label vector from a specific scale, and the ground truth label vector, respectively. Lcls represents classification loss, which predominantly optimizes the parameters of convolution and classification layers in Fig. 4.10 (b1 to b3 and c1 to c3) for ensuring adequate discrimination ability at each scale. The training is implemented by fitting category labels on overall training samples via a softmax function. (s) Besides, pt in the pairwise ranking loss Lrank denotes the prediction probability on the correct category label t. Specifically, the ranking loss is given by (s)

(s+1)

Lrank ( pt , pt (s+1)

which enforces pt

(s)

> pt

(s)

) = max{0, pt

(s+1)

− pt

+ margin} ,

(4.47)

+ margin in training.

4.2.3.2 MA-CNN Multi-Attention Convolutional Neural Network (MA-CNN) [182] was proposed as a part-learning approach, where part generation and feature learning can reinforce each other in a parallel fashion. MA-CNN consists of convolution, channel grouping, and part classification sub-networks, as shown in Fig. 4.11. The channel grouping network takes input feature channels from convolution layers and generates multiple parts by clustering, weighting, and pooling from spatially-correlated channels. The part classification network further classifies an image by each individual part, through which more discriminative finegrained features can be learned. Two losses are proposed to guide the multi-task learning of channel grouping and part classification, which encourages MA-CNN to generate more discriminative parts from feature channels and learn better fine-grained features from parts in a mutually reinforced way. In concretely, given an input image I, MA-CNN first extracts region-based deep features by feeding the images into pre-trained convolution layers. The extracted deep representations are denoted as X = f (I; ) ∈ R H ×W ×C , where f (·) denotes a set of operations of convolution, pooling and activation,  denotes the overall parameters, and H , W , C indicate width, height and the number of feature channels. Since a convolutional feature channel can correspond to a certain type of visual pattern (e.g., stripe) [121, 179], MA-CNN represents

channels

Painted bunting

softmax

Lcls

softmax

Lcls

Painted bunting

Lcls

Painted bunting

softmax

Laysan albatross Bahemian waxwing Hooded warbler

Laysan albatross

(c) Bahemian waxwing Hooded warbler

... pooling

(a) input image

Laysan albatross Bahemian waxwing

(b) conv layers

Hooded warbler

softmax

(d) channel grouping layers

Lcls

Painted bunting Laysan albatross Bahemian waxwing Hooded warbler

(c) feature channels

Lcng

(e) part attentions

Fig. 4.11 Overall framework of the MA-CNN model [182]

(f) part representations

(g) classification layers

4.2

Recognition by Localization-Classification Subnetworks

63

each feature channel as a position vector whose elements are the coordinates from the peak responses overall training image instances, which is given by [tx1 , t y1 , tx2 , t y2 , . . . , tx , t y ],

(4.48)

where txi , t yi are the coordinates of the peak response of the i-th image in training set, and  is the number of training images. MA-CNN then considers the position vector as features, and clusters different channels into N groups as N part detectors. The resultant i-th group is represented by an indicator function overall feature channels, which is given by [1{1}, 1{2}, . . . , 1{C}],

(4.49)

where 1{·} equals one if the j-th channel belongs to the i-th cluster and zero otherwise. To generate N parts, MA-CNN defines a group of FC layers G(·) = [g1 (·), g2 (·), . . . , g N (·)]. Each gi (·) takes as input convolutional features, and produces a weight vector d i over different channels (from 1 to C), which is given by d i (I) = gi (X),

(4.50)

where d i = [d1 , d2 , . . . , dC ]. We omit subscript i for each dC for simplicity. Based on the learned weights over feature channels, MA-CNN further obtains the part attention map for the i-th part as C  Mi (I) = sigmoid( d j [X] j ), (4.51) j=1

where [·] j denotes the j-th feature channel of convolutional features X. The operation between d j and [·] j denotes multiplication between a scalar and a matrix. The resultant M i (I) is further normalized by the sum of each element, which indicates one part attention map. Later we denote M i (I) as M i for simplicity. Furthermore, the final convolutional feature representation for the i-th part is calculated via spatial pooling on each channel, which is given by C  pi (I) = ([X] j · M i ), (4.52) j=1

where the dot product denotes element-wise multiplication between [X] j and M i . In the following, we introduce the multi-task mechanism. In MA-CNN, the loss function for an image I is defined as L(I) =

N 

[Lcls (Y (i) , Y ∗ ))] + Lcng (M 1 , . . . , M N ),

(4.53)

i=1

where Lcls and Lcng represents the classification loss on each of the N parts, and the channel grouping loss, respectively. Y (i) denotes the predicted label vector from the i-th part by using

64

4 Fine-Grained Image Recognition

part-based feature pi (I), and Y ∗ is the ground truth label vector. The training is implemented by fitting category labels via a softmax function. The channel grouping loss for compact and diverse part learning is given by Lcng (M i ) = Dis(M i ) + λDiv(M i ),

(4.54)

where Dis(·) and Div(·) is a distance and diversity function with the weight of λ. Dis(·) encourages a compact distribution, and the concrete form is designed as  Dis(M i ) = m i (x, y)[ x − tx 2 +  y − t y 2 ], (4.55) (x,y)∈M i

where m i (x, y) takes as input the coordinates (x, y) from M i , and produces the amplitudes of responses. Div(·) is designed to favor a diverse attention distribution from different part attention maps, i.e., M 1 to M N . The concrete form is formulated as  Div(M i ) = m i (x, y)[max m k (x, y) − mrg], (4.56) (x,y)∈M i

k=i

where i, k indicates the index of different part attention maps. “mrg” represents a margin, which makes the loss less sensitive to noises, and thus enables robustness.

4.2.3.3 Attention Binary Neural Tree Attention Convolutional binary Neural tree (ACNet) [64] was proposed to leverage both the attention mechanism and the hierarchical binary tree structure for fine-grained image recognition. Specifically, ACNet incorporates convolutional operations along the edges of the tree structure and uses the routing functions in each node to determine the root-to-leaf computational paths within the tree. The deep convolutional operations learn to capture the representations of objects, and the tree structure characterizes the coarse-to-fine hierarchical feature learning process. The final decision is computed as the summation of the predictions from leaf nodes. In addition, ACNet uses the attention transformer module to enforce the network to capture discriminative features. ACNet consists of four modules, i.e., the backbone network, the branch routing, the attention transformer, and the label prediction modules, shown in Fig. 4.12. In concretely, given a sample I and the corresponding CNN feature maps X = f (I; ) ∈ R H ×W ×C derived from the backbone network f (·; ). The branch routing module then determines which child (i.e., left or right child) the sample would be sent to. In concretely, as shown in Fig. 4.12b, the i-th routing module Rik (·) at the k-th layer uses one convolution layer with the kernel size 1 × 1, followed by a global context block [9]. After that, ACNet uses the global average pooling (GAP) [85], element-wise square-root and 2 -normalization [88], and a fully connected (FC) layer with the sigmoid activation function to produce a scalar value in [0, 1] indicating the probability of

4.2

Recognition by Localization-Classification Subnetworks

65

Fig. 4.12 Overall framework of the ACNet model [64]

samples being sent to the left or right sub-branches. Let φik (I) denote the output probability of the sample I being sent to the right sub-branch produced by the branch routing module Rik (X), where φik (I) ∈ [0, 1], (i = 1, . . . , 2k−1 ). Thus, it has the probability of the sample I being sent to the left sub-branch to be 1 − φik (I). If the probability φik (I) is larger than 0.5, ACNet prefers the left path instead of the right one; otherwise, the left branch dominates the final decision. The attention transformer module is used to enforce the network to capture discriminative features. Specifically, as shown in Fig. 4.12c, ACNet integrates the Atrous Spatial Pyramid Pooling (ASPP) module [11] into the attention transformer, which provides different feature maps with each characterized by a different scale/receptive field and an attention module. Then, multi-scale feature maps are generated by four parallel dilated convolutions with different dilated rates, i.e., 1, 6, 12, 18. Following the parallel dilated convolution layers, the concatenated feature maps are fused by one convolution layer with kernel 1 × 1 and stride 1. Following the ASPP module, ACNet inserts an attention module, which generates a channel attention map with the size RC×1×1 using a batch normalization (BN) layer [61], a GAP layer, an FC layer and the ReLU activation function, and an FC layer and the Sigmoid function. Finally, for each leaf node, ACNet uses the label prediction module Pi (i.e., i = 1, . . . , 2h−1 ) to predict the subordinate category of the object I, which is formed by a BN layer, a convolution layer with kernel size 1 × 1, a max-pooling layer, a sqrt and 2 normalization layer, and an FC layer, see Fig. 4.12d. Let rik (I) to be the accumulated probability of the object I passing from the root node to the i-th node at the k-th layer. For example, if the root to the node Rik (·) path on the tree is R11 , R21 , . . . , Rk1 , i.e., the object I is k φ1i (I). Then, the final prediction C(I) of the always sent to the left child, it has rik (I) = i=1 object I is computed as the summation of all leaf predictions multiplied with the accumulated

66

4 Fine-Grained Image Recognition

2h−1 probability generated by the passing branch routing modules, i.e., C(I) = i=1 Pi (I)rih (I). Notably, ACNet emphasizes that C(I)1 = 1, i.e., the summation of confidences of I belonging to all subordinate classes equal to 1, which is formulated by C(I)1 = 

2h−1 i=1

Pi (I)rih (I)1 = 1,

(4.57)

where rih (I) is the accumulated probability of the i-th node at the leaf layer.

4.2.3.4 CAL Counterfactual Attention Learning (CAL) [112] was proposed to learn more effective attention based on causal inference [108] for fine-grained image recognition. Unlike most existing methods that learn visual attention based on conventional likelihood, CAL proposes to learn attention with counterfactual causality, which provides a tool to measure attention quality and a powerful supervisory signal to guide the learning process. In other words, it analyzes the effect of the learned visual attention on network prediction through counterfactual intervention and maximizes the effect to encourage the network to learn more useful attention for fine-grained image recognition. In concretely, given an image I and the corresponding CNN feature maps X = f (I; ) ∈ R H ×W ×C derived from a CNN model f (·; ), visual-spatial attention M aims to discover the discriminative regions of the image and improve CNN feature maps X by explicitly incorporating structural knowledge of objects. The attention model is designed to learn the spatial distributions of an object’s parts, which can be represented as attention maps A ∈ R H ×W ×M , where M is the number of attentions. Using the attention model M, attention maps can be computed by A = { A1 , A2 , . . . , A M } = M(X) ,

(4.58)

where Ai ∈ R H ×W is the attention map covering a certain part, such as the wing of a bird or the cloth of a person. The attention model M is implemented using a 2D convolution layer followed by ReLU activation in CAL [112]. The attention maps then are used to softly weight the feature maps and aggregate by global average pooling operation φ by hi = φ(X ∗ Ai ) =

H W 1   h,w h,w X Ai , HW

(4.59)

h=1 w=1

where ∗ denotes element-wise multiplication for two tensors. Also, CAL summarizes the representation of different parts to form the global representation h as h = normalize([h1 , h2 , . . . , h M ]) ,

(4.60)

where CAL concatenates these representations and normalizes the summarized representation to control its scale. The final representation h can be fed into a classifier (e.g., fully

4.2

Recognition by Localization-Classification Subnetworks X

67

A ... Y

CNN Image

feature maps

M attention maps -

A

X

X

Yeffect

Yeffect

... -

A

Y

-

A

Y(A)

-

Y(A)

Counterfactual intervention

Fig. 4.13 Schematic illustration of the CAL method [112], as well as its causal graph (at the bottom left)

connected layer) for classification. The overall framework of the baseline attention model of CAL is presented in Fig. 4.13. In the following, we introduce the causal learning mechanism. In CAL, it is realized by a structural causal model, i.e., causal graph, which is a directed acyclic graph G = {N, E}. Each variable in the model has a corresponding node in N while the causal links E describe how these variables interact with each other. As illustrated in Fig. 4.13, CAL uses nodes in the causal graph to represent variables in the attention, including the CNN feature maps (or the input image) X , the learned attention maps A, and the final prediction Y . The link X → A represents that the attention model takes as input the CNN feature maps and produces corresponding attention maps. (X , A) → Y indicates the feature maps and attention maps jointly determine the final prediction. Based on the causal graph, we can analyze causalities by directly manipulating the values of several variables and see the effect. Formally, the operation is termed intervention in causal inference literature [108], which can be denoted as do(·). Then, CAL adopts counterfactual intervention [107, 138] to investigate the effects of the learned visual attention. The counterfactual intervention is achieved by an imaginary intervention altering the state of the variables assumed to be different [108, 138]. In the CAL case, it conducts counterfactual ¯ by imagining non-existent attention maps A ¯ to replace the learned intervention do(A = A) attention maps and keeping the feature maps X unchanged. Then, it can obtain the final ¯ according to Eqs. (4.59) and (4.60) by prediction Y after the intervention A = A ¯ 1 ), . . . , φ(X ∗ A ¯ M )]) , ¯ X = X) = C([φ(X ∗ A Y (do(A = A),

(4.61)

where C is the classifier. In practice, CAL uses different options of attention, e.g., random attention, uniform attention, or reversed attention as the counterfactuals. After that, following [107, 138], the actual effect of the learned attention on the prediction can be represented by the difference between the observed prediction Y (A = A, X = X) ¯ X = X): and its counterfactual alternative Y (do(A = A), ¯ X = X)] , Yeffect = E A∼γ [Y (A = A, X = X) − Y (do(A = A), ¯

(4.62)

68

4 Fine-Grained Image Recognition

where we denote the effect on the prediction as Yeffect and γ is the distribution of counterfactual attentions. Intuitively, the effectiveness of attention can be interpreted as how the attention improves the final prediction compared to wrong attention. Thus, CAL can use Yeffect to measure the quality of learned attention. Furthermore, CAL can use the metric of attention quality as a supervision signal to explicitly guide the attention learning process. The new objective can be formulated as L = Lce (Yeffect , y) + Lothers ,

(4.63)

where y is the classification label, Lce is the cross-entropy loss, and Lothers represents the original objective, e.g., standard classification loss.

4.2.4

Other Methods

Many other approaches in the localization-classification paradigm have also been proposed for fine-grained recognition. Spatial Transformer Networks (STN) [63] were originally introduced to explicitly perform spatial transformations in an end-to-end learnable way. They can also be equipped with multiple transformers in parallel to conduct fine-grained recognition. Each transformer in an STN can correspond to a part detector with spatial transformation capabilities. Later, Wang et al. [147] developed a triplet of patches with geometric constraints as a template to automatically mine discriminative triplets and then generated midlevel representations for classification with the mined triplets. In addition, other methods have achieved better accuracy by introducing feedback mechanisms. Specifically, NTSNet [168] employs a multi-agent cooperative learning scheme to address the core problem of fine-grained recognition, i.e., accurately identifying informative regions in an image. M2DRL [48, 50] was the first to utilize deep reinforcement learning [66] at both the objectand part-level to capture multi-granularity discriminative localization and multi-scale representations using their tailored reward functions. Inspired by low-rank mechanisms in natural language processing [104], Wang et al. [150] proposed the DF-GMM framework to alleviate the region diffusion problem in high-level feature maps for fine-grained part localization. DF-GMM first selects discriminative regions from the high-level feature maps by constructing low-rank bases, and then applies spatial information of the low-rank bases to reconstruct low-rank feature maps. Part correlations can also be modeled by reorganization processing, which brings accuracy improvements. In the following, we introduce several representative works in detail.

4.2.4.1 STN Convolutional Neural Networks are limited by the lack of ability to be spatially invariant to the input data in a computationally and parameter-efficient manner. Spatial Transformer Network (STN) [63] introduces a learnable module, i.e., the spatial transformer, which can be

4.2

Recognition by Localization-Classification Subnetworks

69

Localisation network Parameterized sampling grid

U

Differentiable image sampler

V

Spatial Transformer

Fig. 4.14 Architecture of a spatial transformer module [63]

used for fine-grained image recognition. The spatial transformer explicitly allows the spatial manipulation of data within the network. This differentiable module can be inserted into existing convolutional architectures, giving neural networks the ability to actively spatially transform feature maps, conditional on the feature map itself, without any extra training supervision or modification to the optimization process. The use of spatial transformers results in models which learn invariance to translation, scale, rotation and more generic warping. More specifically, the spatial transformer mechanism is split into three parts, i.e., localization network, parameterized sampling grid, and differentiable image sampling, as shown in Fig. 4.14. In order of computation, first a localization network takes the input feature map, and through a number of hidden layers outputs the parameters of the spatial transformation that should be applied to the feature map, which gives a transformation conditional on the input. Then, the predicted transformation parameters are used to create a sampling grid, which is a set of points where the input map should be sampled to produce the transformed output. This is done by the grid generator. Finally, the feature map and the sampling grid are taken as inputs to the sampler, producing the output map sampled from the input at the grid points. These three components will be elaborately described as follows. The localization network takes the input feature map U ∈ R H ×W ×C with width W , height H and C channels and outputs θ , the parameters of the transformation T θ to be applied to the feature map: θ = f loc (U). The size of θ can vary depending on the transformation type that is parameterized. The localization network function f loc (·) can take any form, and include a final regression layer to produce the transformation parameters θ . To perform a warping of the input feature map, each output pixel is computed by applying a sampling kernel centered at a particular location in the input feature map. The output pixels are defined to lie on a regular grid G = {G i } of pixels G i = (xit , yit ), forming an output   feature map V ∈ R H ×W ×C , where H  and W  are the height and width of the grid, and C is the number of channels, which is the same in the input and output. Assume for the moment that T θ is a 2D affine transformation Aθ , the pointwise transformation is

70

4 Fine-Grained Image Recognition



xis



yis



xit



⎜ ⎟ t⎟ = T θ (G i ) = Aθ ⎜ ⎝ yi ⎠ = 1







xit



⎟ θ11 θ12 θ13 ⎜ ⎜ yt ⎟ , i ⎠ ⎝ θ21 θ22 θ23 1

(4.64)

where (xit , yit ) are the target coordinates of the regular grid in the output feature map, (xis , yis ) are the source coordinates in the input feature map that defines the sample points, and Aθ is the affine transformation matrix. STN uses height and width normalized coordinates, such that −1 ≤ xit , yit ≤ 1 when within the spatial bounds of the output, and −1 ≤ xis , yis ≤ 1 when within the spatial bounds of the input (and similarly for the y coordinates). The source target transformation and sampling are equivalent to the standard texture mapping and coordinates used in graphics. The transform defined in Eq. (4.64) allows cropping, translation, rotation, scale, and skew to be applied to the input feature map, and requires only 6 parameters (the 6 elements of Aθ ) to be produced by the localization network. The transformation can have any parameterized form, provided that it is differentiable with respect to the parameters, this allows gradients to be backpropagated through from the sample points T θ (G i ) to the localization network output θ . If the transformation is parameterized in a structured, low-dimensional way, this reduces the complexity of the task assigned to the localization network. To perform a spatial transformation of the input feature map, a sampler must take the set of sampling points T θ (G), along with the input feature map U and produce the sampled output feature map V . Each (xis , yis ) coordinate in T θ (G) defines the spatial location in the input where a sampling kernel is applied to get the value at a particular pixel in the output V . This can be written as Vic

=

W H   n

c Unm k(xis − m; x )k(yis − n;  y ) ∀i ∈ [1 . . . H  W  ] ∀c ∈ [1 . . . C] ,

m

(4.65) where x and  y are the parameters of a generic sampling kernel k(·) which defines the c is the value at location (n, m) in channel c of the input, and V c is image interpolation, Unm i the output value for pixel i at location (xit , yit ) in channel c. The sampling is done identically for each channel of the input, so every channel is transformed in an identical way (this preserves spatial consistency between channels). In theory, any sampling kernel can be used, as long as (sub-)gradients can be defined with respect to xis and yis . For example, using the integer sampling kernel reduces Eq. (4.65) to Vic

=

W H   n

c Unm δ(xis + 0.5 − m)δ(yis + 0.5 − n) ,

(4.66)

m

where x + 0.5 rounds x to the nearest integer and δ(·) is the Kronecker delta function. This sampling kernel equates to just copying the value at the nearest pixel to (xis , yis ) to the output location (xit , yit ). Alternatively, a bilinear sampling kernel can be used, giving

4.2

Recognition by Localization-Classification Subnetworks

Vic =

W H   n

71

c Unm max(0, 1 − |xis − m|) max(0, 1 − |yis − n|) .

(4.67)

m

To allow back-propagation of the loss through this sampling mechanism, the gradients with respect to U and G. For bilinear sampling the partial derivatives are  ∂ Vic = max(0, 1 − |xis − m|) max(0, 1 − |yis − n|) , c ∂Unm n m H

∂ Vic ∂ xis

=

W H   n

m

W

⎧ ⎪ ⎪ ⎨0

(4.68)

if |m − xis | ≥ 1

c Unm max(0, 1 − |yis − n|) 1 if m ≥ xis ⎪ ⎪ ⎩−1 if m < x s i

,

(4.69)

∂V c

and similarly to Eq. (4.69) for ∂ yis . i The combination of the localization network, grid generator, and sampler forms a spatial transformer. This is a self-contained module which can be dropped into a CNN architecture at any point, and in any number, giving rise to STN. Placing spatial transformers within a CNN allows the network to learn how to actively transform the feature maps to help minimize the overall cost function of the network during training. It is also possible to use spatial transformers to downsample or oversample a feature map, as one can define the output dimensions H  and W  to be different from the input dimensions H and W . One can also have multiple spatial transformers in a CNN. Placing multiple spatial transformers at increasing depths of a network allows transformations of increasingly abstract representations, and also gives the localization networks potentially more informative representations to base the predicted transformation parameters on.

4.2.4.2 NTS-Net Navigator-Teacher-Scrutinizer Network [168] was proposed as a self-supervision mechanism to localize informative regions without the need for bounding boxes or part annotations, which consists of a Navigator agent, a Teacher agent and a Scrutinizer agent. In consideration of intrinsic consistency between the informativeness of the regions and their probability of being ground-truth class, NTS-Net proposes a training paradigm, which enables Navigator to detect the most informative regions under the guidance of the Teacher. After that, the Scrutinizer scrutinizes the proposed regions from Navigator and makes predictions. The model can be viewed as a multi-agent cooperation, where agents benefit from each other, and make progress together. NTS-Net can be trained end-to-end, while providing accurate fine-grained classification predictions as well as highly informative regions during inference. In concretely, all regions are assumed to be rectangles and denote A as the set of all full images in the given image. Information function I : A → (−∞, ∞) is defined to evaluate how informative the region R ∈ A is. The confidence function C : A → [0, 1] is defined as

72

4 Fine-Grained Image Recognition

a classifier to evaluate the confidence that the region belongs to ground-truth classes. More informative regions should have higher confidence, so the following condition should hold: for any R1 , R2 ∈ A, if C(R1 ) > C(R2 ), I(R1 ) > I(R2 ). Navigator network is used to approximate information function I and Teacher network to approximate confidence function C. They choose M regions A M in the region space A. For each region Ri ∈ A M , the Navigator network evaluates its informativeness I(Ri ), and the Teacher network evaluates its confidence C(Ri ). In order to satisfy previous condition, {I(R1 ), I(R2 ), . . . , I(R M )} and {C(R1 ), C(R2 ), . . . , C(R M )} have the same order by optimize Navigator network. Every proposed region is used to optimize Teacher by minimizing the cross-entropy loss between ground-truth class and the predicted confidence. As the Navigator network improves in accordance with the Teacher network, it will produce more informative regions to help the Scrutinizer network make better fine-grained classification result. Inspired by the idea of anchors, the Navigator network takes an image as input, and produces a bunch of rectangle regions {R1 , R2 , . . . R A }, each with a score denoting the informativeness of the region. For an input image X of size 448, choose anchors to have scales of {48, 96, 192} and ratios {1:1, 3:2, 2:3}, Navigator network will produce a list denoting the informativeness of all anchors. Then, the information list is sorted as in Eq. (4.70), where A is the number of anchors, I(Ri ) is the ith element in the sorted information list: I(R1 ) ≥ I(R2 ) ≥ · · · ≥ I(R A ) .

(4.70)

Non-maximum suppression (NMS) on the regions based on their informativeness reduces region redundancy. The top M informative regions {R1 , R2 , . . . , R M } are fed into the Teacher network to get the confidence as {C(R1 ), C(R2 ), . . . , C(R M )}. Then Navigator network is optimized to make {I(R1 ), I(R2 ), . . . , I(R M )} and {C(R1 ), C(R2 ), . . . , C(R M )} having the same order. Every proposed region is used to optimize Teacher by minimizing the crossentropy loss between ground-truth class and the predicted confidence. As the Navigator network gradually converges, it will produce informative object-characteristic regions to help the Scrutinizer network make decisions. The Scrutinizer network uses the top K informative regions combined with the full image as input to be trained. Those K regions are used to facilitate fine-grained recognition. Figure 4.15 demonstrates this process with K = 3. In order to obtain correspondence between region proposals and feature vectors in the feature map, a fully-convolutional network as the feature extractor, without fully-connected layers. Specifically, the CNN feature extractor is ResNet-50 [45] that pre-trained on ImageNet [117], and Navigator, Scrutinizer, Teacher network all share parameters in the feature extractor. The feature extractor parameters are denoted as . For input image X , the extracted deep representations are denoted as X ⊗ , where ⊗ denotes the combinations of convolution, pooling, and activation operations. The network architecture consists of three parts. The first is the Navigator network. Inspired by the design of Feature Pyramid Networks (FPN) [87], multi-scale region detection with a top-down architecture with lateral connections. The feature hierarchy is computed

4.2

Recognition by Localization-Classification Subnetworks

73

Fig. 4.15 Inference process of the model [168]

layer by layer using convolution layers, followed by ReLU and max-pooling. Then, a series of feature maps with different spatial resolutions are obtained. The anchors in larger feature maps correspond to smaller regions. The Navigator network in Fig. 4.15 shows the sketch of the design. Using multi-scale feature maps from different layers can generate informativeness of regions among different scales and ratios. The parameters in the Navigator network are denoted as I (including shared parameters in the feature extractor). The second is the Teacher network approximates the mapping C : A → [0, 1] which denotes the confidence of each region. After receiving M scale-normalized (224 × 224) informative regions {R1 , R2 , . . . , R M } from the Navigator network, the Teacher network outputs confidence as teaching signals to help the Navigator network learn. In addition to the shared layers in the feature extractor, the Teaching network has a fully connected layer which has 2048 neurons. The parameters in the Teacher network are denoted as C for convenience. The third is the scrutinizer network. After receiving top K informative regions from the Navigator network, the K regions are resized to the pre-defined size and are fed into the feature extractor to generate those K regions’ feature vectors, each with length 2048. Then K features are concatenated with the features of the input image and fed into a fully-connected layer. The function S represents the combination of these transformations. The parameters in the Scrutinizer network are denoted as S . The M most informative regions predicted by Navigator network denote as R = {R1 , R2 , . . . , R M }, their informativeness as I = {I1 , I2 , . . . , I M }, and their confidence predicted by Teacher network as C = {C1 , C2 , . . . , C M }. Then, the navigation loss is defined as  f (Is − Ii ) , (4.71) LI (I , C) = (i,s):Ci 0 ∀i, and E x∼ px [ f (x)] = 0.  i are n-dimensional covariance matrices and μi is the mean feature vector for each class. The zero-mean implies that μ = m i=1 αi μi = 0. For this distribution, the equivalent covariance matrix can be given by Var[ f (x)] =

m  i=1

αi  i +

m  i=1

m       αi μi − μ μi − μ = αi  i + μi μi   ∗ . i=1

(4.98) characterize the Now, the eigenvalues λ1 , . . . , λn of the overall covariance matrix variance of the distribution across n dimensions. Since  ∗ is positive-definite, all eigenvalues are positive (this can be shown using the fact that each covariance matrix is itself ∗

90

4 Fine-Grained Image Recognition

  2 positive-definite, and diag μi μi k = μik ≥ 0 ∀i, k. Thus, to describe the variance of the feature distribution it defines Diversity. Definition 4.1 Let the data distribution be p x over space X, and the feature extractor be given by f (·). Then, the Diversity ν of the features is defined as ν (, p x ) 

n 

 λi , where {λ1 , . . . , λn } satisfy det  ∗ − λi I n = 0 .

i=1

This definition of diversity is consistent with multivariate analysis and is a common measure of the total variance of a data distribution [65]. Now, let p xL (·) denote the data distribution under a large-scale image classification task such as ImageNet, and let pxF (·) denote the data distribution under a fine-grained image classification task. Then, the finegrained problems can be characterized as data distributions p xF (·) for any feature extractor f (·) that have the property as     (4.99) ν , p xF  ν , p xL . By the Tikhonov regularization of a linear classifier [39], MaxEnt [30] would select w & &2 such that j &w j &2 is small (2 regularization), to get higher generalization performance. " C 2 It uses the following result to lower-bound the norm of the weights w2 = i=1 w i 2 in terms of the expected entropy and the feature diversity: Theorem 4.1 Let the final layer weights be denoted by w = {w 1 , . . . , wC }, the data distribution be p× over X, and feature extractor be given by f (·). For the expected conditional entropy, the following holds true: w2 ≥

log(C) − E x∼ px [H[ p(· | x; θ )]] . √ 2 ν (, p x )

In the case when ν (, p× )is large (ImageNet classification), this lower bound is very weak and inconsequential. However, in the case of small ν (, p x ) (fine-grained classification), the denominator is small, and this lower bound can subsequently limit the space of model selection, by only allowing models with large values of weights, leading to potential overfitting. It can be seen that if the numerator is small, the diversity of the features has a smaller impact on limiting the model selection, and hence, it can be advantageous to maximize prediction entropy. Note that since this is a lower bound, the proof is primarily expository. More intuitively, however, it can be understood that problems that are fine-grained will often require more information to distinguish between classes, and regularizing the prediction entropy prevents creating models that memorize a lot of information about the training

4.3

Recognition by End-to-End Feature Encoding

91

data, and thus can potentially benefit generalization. Now, Theorem 4.1 involves the expected conditional entropy over the data distribution. However, during training, it only has sample access to the data distribution, which can be used as a surrogate. It is essential to then ensure that the empirical estimate of the conditional entropy (from N training samples) is an accurate estimate of the true expected conditional entropy. The next result ensures that for large N , in a fine-grained classification problem, the sample estimate of average conditional entropy is close to the expected conditional entropy. Theorem 4.2 Let the final layer weights be denoted by w = {w1 , . . . , wC }, the data distribution be p× over X, and feature extractor be given by f (·). With probability at least 1 − δ > 21 and w∞ = max (w1 2 , . . . , wC 2 ), the following is obtained:   $ ED [H[ p(· | x; θ)]] − E x∼ p× [H[ p(· | x; θ)]] '  !   2 4 −0.75 + N ≤ w∞ . ν (, p x ) log N δ It can be seen that as long as the diversity of features is small, and N is large, the estimate for entropy will be close to the expected value. Using this result, Theorem 4.1 can be expressed in terms of the empirical mean conditional entropy. Corollary 4.1 With probability at least 1 − δ > 21 , the empirical mean conditional entropy follows: w2 ≥

log(C) − $ E x∼D [H[ p(· | x; θ)]] ! . "   √ 2 − N2 log 2δ ν (, p x ) −  N −0.75

It can be seen that the result is recovered from Theorem 4.1 as N → ∞. Corollary 4.1 shows that as long as the diversity of features is small, and N is large, the same conclusions drawn from Theorem 4.1 apply in the case of the empirical mean entropy as well.

4.3.2.2 MAMC Sun et al. [128] proposed an attention-based convolutional neural network which regulates multiple object parts among different input images. The method first learns multiple attention region features of each input image through the One-Squeeze Multi-Excitation (OSME) module, and then applies the Multi-Attention Multi-Class constraint (MAMC) in a metric learning framework. For each anchor feature, the MAMC functions pull same-attention same-class features closer, while pushing different-attention or different-class features away. The method can be easily trained end-to-end and is efficient which requires only one training stage.

Class 1

m1

S1

FC

W 12

Class 2

Attention 1

W 11

FC

z

Sigmoid

Class 2

FC

Class 1

ReLU

4 Fine-Grained Image Recognition Global pooling

92

f1

MAMC loss

FC

z

S2

FC

W 22

ReLU

Global pooling

τ Input image pairs

FC

W 21

Attention 2

U

Sigmoid

Conv

x

Conv

W 31

f2

Combined softmax loss

W 32

m2

OSME module

Fig. 4.23 Overview framework of the MAMC method [128]. Here it visualizes the case of learning two attention branches given a training batch with four images of two classes. The MAMC and softmax losses would be replaced by a softmax layer during testing. Unlike hard-attention methods like [32], the method does not explicitly crop the parts out. Instead, the feature maps (S1 and S2 ) generated by the two branches provide soft responses for attention regions such as the birds’ head or torso, respectively

As shown in Fig. 4.23, the framework is a feedforward neural network where each image    is first processed by a base network, e.g., ResNet-50. Let X ∈ RW ×H ×C denote the input fed into the last residual block τ . In the literature, the goal of SENet [58] is to re-calibrate the output feature map U = τ (x) = [u1 , . . . , uc ] ∈ RW ×H ×C ,

(4.100)

through a pair of squeeze-and-excitation operations. In order to generate P attentionspecific feature maps, [128] extends the idea of SENet by performing one-squeeze but multi-excitation operations. In the first one-squeeze step, it aggregates the feature map U across spatial dimensions W × H to produce a channel-wise descriptor z = [z 1 , . . . , z C ] ∈ RC . The global average pooling is adopted as a simple but effective way to describe each channel statistic by zc =

W H 1  uc (w, h) . WH

(4.101)

w=1 h=1

In the second multi-excitation step, a gating mechanism is independently employed on z for each attention p = 1, . . . , P as p

p

p

p

m p = σ (W 2 δ(W 1 z)) = [m 1 , . . . , m C ] ∈ RC ,

(4.102)

where σ and δ refer to the Sigmod and ReLU functions, respectively. It adopts the same design of SENet by forming a pair of dimensionality reduction and increasing layers parameterized C C p p with W 1 ∈ R r ×C and W 2 ∈ RC× r . Because of the property of the Sigmod function, each m p encodes a non-mutually-exclusive relationship among channels. Therefore, it is utilized to re-weight the channels of the original feature map U by

4.3

Recognition by End-to-End Feature Encoding

93

S p = [m 1 u1 , . . . , m C uC ] ∈ RW ×H ×C p

p

(4.103)

To extract attention-specific features, it feeds each attention map S p to a fully connected p layer W 3 ∈ R D×W H C as  p (4.104) f p = W 3 vec S p ∈ R D , where the operator vec(·) flattens a matrix into a vector. Furthermore, the MAMC method explores much richer correlations of object parts by the proposed multi-attention multi-class constraint (MAMC). More specifically, we suppose that a set of training images {(x, y), . . .} of K fine-grained classes is given, where y = 1, . . . , K denotes the label associated with the image x. To model both the within-image and inter-class N , by attention relations, the method constructs each training batch, i.e., B = {(x i , x i+ , yi )}i=1 + sampling N pairs of images similar to [124]. For each pair (x i , x i ) of class yi , the OSME p p+ module extracts P attention features { f i , f i } Pp=1 from multiple branches according to Eq. (4.104). Given 2N samples in each batch (cf. Fig. 4.23a), the intuition comes from the natural clustering of the 2N P features (cf. Fig. 4.23b) extracted by the OSME modules. By picking p f i , which corresponds to the i-th class and p-th attention region as the anchor, the rest features can be divided into four groups (Fig. 4.24):  p ( p+ ) – Same-attention same-class features, Ssasc f i = f i ;  p ( p p+ ) – Same-attention different-class features, Ssadc f i = f j , f j ; j=i  p ( q q+ ) ; – Different-attention same-class features, Sdasc f i = f i , f i q= p  p ( q q+ ) – Different-attention different-class features Sdadc f i = f j , f j

j=i,q= p

fip

Anchor (A )

Attention 1 Attention P Attention 1 Attention P

f11

...

Class N xN

x+ N

1 fN

...

f1P

,

f11+

...

P fN

,

1+ fN

...

(a) Input image

fip+

Positive ( P )

f1P +

...

...

x+ 1 ...

Class 1 x1

(b) OSME

P+ fN

Negative ( N )

(c) MAMC

.

fjp

fiq

fjp+

fiq+

j = i

q = p

fjp

fiq

fjq

fjq

fjp+

fiq+

fjq+

fjq+

j = i q = p j = i q = p

j = i q = p

Fig. 4.24 Data hierarchy in training [128]. a Each batch is composed of 2N input images in N -pair style. b OSME extracts P features for each image according to Eq. (4.104). c The group of features p for three MAMC constraints by picking one feature f i as the anchor

94

4 Fine-Grained Image Recognition P

Anchor

N

S sasc P

N

N

P

S sadc

A

A

A

S dasc

N

S dadc

N

(a)

N

(b)

(c)

Fig. 4.25 Feature embedding of a synthetic batch [128]. a Initial embedding before learning. b The result embedding by applying Eq. (4.105). c The final embedding by enforcing Eqs. (4.106) and (4.107). See text for more details

The goal is to excavate the rich correlations among the four groups in a metric learning framework. As summarized in Fig. 4.23c, the method composes three types of triplets p according to the choice of the positive set for the anchor f i . To keep the notation concise, p it omits f i in the following equations. p p+ Same-attention same-class positives. The most similar feature to the anchor f i is f i , while all the other features should have a larger distance to the anchor. The positive and negative sets are then defined as Psasc = Ssasc , Nsasc = Ssadc ∪ Sdasc ∪ Sdadc .

(4.105)

Same-attention different-class positives. For the features from different classes but extracted from the same attention region, they should be more similar to the anchor than the ones also from different attentions as Psadc = Ssadc , Nsadc = Sdadc .

(4.106)

Different-attention same-class positives. Similarly, for the features from the same class but extracted from different attention regions, the following is obtained by Pdasc = Sdasc , Ndasc = Sdadc .

(4.107)

For any positive set P ∈ {Psasc , Psadc , Pdasc } and negative set N ∈ {Nsasc , Nsadc , Ndasc } combinations, the anchor is expected to be closer to the positive than to any negative by a distance margin m > 0, i.e., & p & & & & f − f + &2 + m ≤ & f p − f − &2 , ∀ f + ∈ P, f − ∈ N. i i

(4.108)

To better understand the three constraints, let us consider the synthetic example of six feature points shown in Fig. 4.25. In the initial state (i.e., Fig. 4.25a), Ssasc feature point p (marked with green hexagon) stays further away from the anchor f i at the center than the others. After applying the first constraint (Eq. (4.105)), the underlying feature space is transformed to Fig. 4.25b, where the Ssasc positive point (marked with green ) has been pulled towards the anchor. However, the four negative features (cyan rectangles and

4.3

Recognition by End-to-End Feature Encoding

95

triangles) are still in disordered positions. In fact, Ssadc and Sdasc should be considered as the positives compared to Sdadc given the anchor. By further enforcing the second (Eq. (4.106)) and third (Eq. (4.107)) constraints, a better embedding can be achieved in Fig. 4.25c, where Ssadc and Sdasc are regularized to be closer to the anchor than the ones of Sdadc . To enforce the triplet constraint in Eq. (4.108), a common approach is to minimize the following hinge loss as *& + & & & & f p − f + &2 − & f p − f − &2 + m . (4.109) i i +

Despite being broadly used, optimizing Eq. (4.109) using standard triplet sampling leads to slow convergence and unstable performance in practice. Inspired by the recent advance in metric learning, the method enforces each of the three constraints by minimizing the N -pair loss [124],

Lnp

⎧ ⎛ ⎞⎫  ⎬   1  ⎨ pT pT = log ⎝1 + exp f i f − − f i f + ⎠ . ⎭ N p ⎩ − f i ∈B

f +∈P

(4.110)

f ∈N

In general, for each training batch B, MAMC jointly minimizes the softmax loss and the N -pair loss with a weight parameter λ as  np np Lmamc = Lsoftmax + λ Lnp sasc + Lsadc + Ldasc .

(4.111)

Given a batch of N images and P parts, MAMC is able to generate 2(P N − 1) + 4(N − 1)2 (P − 1) + 4(N − 1)(P − 1)2 constraints of three types (cf. Eq. (4.105) to Eq. (4.107)), while the N -pair loss can only produce N − 1. To put it in perspective, this method is able to generate 130× more constraints than N -pair loss with the same data under the normal setting where P = 2 and N = 32. This implies that MAMC leverages much richer correlations among the samples, and is able to obtain better convergence than either triplet or N -pair loss.

4.3.2.3 DAM Discrimination-Aware Mechanism (DAM) [161] was designed to extract more cues for better discriminative representation w.r.t. fine-grained image recognition. Inspired by random erasing [186], DAM designs a guided erasure mechanism to encourage the model to iteratively learn the new discriminative cues. To be specific, it obtains instructive input signals by exploiting the difference between classes. Intuitively, when two features are different in each element, those two features have a larger difference. Therefore, high-discriminative inter-class features by constraining as many feature elements to be different as possible can be achieved, thereby prompting the model to mine more plentiful cues for distinguishing examples.

96

4 Fine-Grained Image Recognition

In order to make the feature elements between inter-classes as different as possible, elements of each feature are split into two sets every iteration during the training process. One set with low-discriminativeness elements is iteratively optimized, and the other set of elements already with high-discriminativeness is erased. With the help of this erasing mechanism, new gated features are obtained to further optimize the model parameters. From a feature space perspective, DAM projects the training samples from the original feature space into a more difficult (to distinguish each sample) space. The entire generation process of new features is shown in Fig. 4.26. To determine which elements should be erased, DAM [161] utilizes the differences between classes, which can be obtained by calculating the difference between the class centers. Specifically, when using the softmax cross-entropy loss to supervise the classification task, the weight w (w ∈ RC×D , where C is the class number, and D is the element number of a feature) of the last fully connected layer is used as proxies to pull intra-class close and push inter-classes away. Mathematically, a feature f i is projected onto all weight vectors [w 1 , . . . , wC ] to determine its class, where wi is a D-dim vector as shown in Fig. 4.26a. During training, the softmax cross-entropy loss function optimizes the classification object by lengthening the projection of f i on the true class weight w yi . In other words, the more similar the value of each dimension of the feature and class weight, the longer the projection. Therefore, class weights can represent the average feature of intra-class samples

Fig. 4.26 Schematic illustration of DAM [161] to generate new gated features. wi and w j are D-dim weight vectors in the last FC layer

4.3

Recognition by End-to-End Feature Encoding

97

(i.e., class center), and the difference between class weights can represent the difference between inter-classes. The class differences W i, j between class ci and c j can be defined as   (4.112) W i, j = wi − w j  , where |.| takes the absolute value of each vector dimension. W i, j is a D-dim vector, which indicates the signification of each element in features for distinguishing class ci and c j . The higher the value, the more discriminative that feature element between the two classes. With a representation of the differences between the two classes, it is possible to decide which elements of features for the two classes need to be further optimized. When the value of the difference is smaller, it means that this element of features has lower-discriminativeness between the two classes. To select effective feature elements, the method designs a gate mechanism to cut off unnecessary parts. Specifically, it uses the mean value of the difference to measure effectiveness. To increase flexibility, it sets an adjustable parameter λ. As shown in Fig. 4.26b, c, the new gated difference weight T i, j for class ci and c j can be defined as  k T i, j

=

1

k if W i, j < λW i, j

0

k if W i, j >= λW i, j

,

(4.113)

D k expresses the signification of the k-th element for where W i, j = D1 l=1 W li, j , and W i, j distinguishing class ci and c j . After getting T i, j , a new gated feature embedding F i, j can be achieved from the original feature f i as shown in Fig. 4.26d-Figure 4.26e, which has low-discriminativeness to samples of class c j . The new gated F i, j can be defined as: F i, j = f i × T i, j ,

(4.114)

where “×” represents element-wise multiplication. Besides, for feature f i , its element set which cannot be distinguished from all other classes needs further optimization as well. To represent the discriminativeness of feature dimensions, the average difference between feature f i and all other classes is measured. The average difference of f i can be defined as W i,all =

1 C −1

C 

W i, j .

(4.115)

j=1, j=i

By using the same gate mechanism as a specific class (cf. Eq. (4.113)), the gated difference T i,all of the feature f i between all other classes can be obtained. Then, the low discriminative F i,all for f i can be achieved by F i,all = f i × T i,all .

(4.116)

When the discrimination-aware mechanism generates new gated features, they can be fed into proxy-loss and pair-based loss for further model optimization.

98

4 Fine-Grained Image Recognition

Discrimination-Aware Mechanism for Proxy-based Loss. For proxy-based loss function, multiple proxies (the weight wi of the last fully connected layer) are adopted to optimize the model parameters. The softmax cross-entropy-based discrimination-aware mechanism (SCEDAM) can be rewritten as LSCEDAM = −

N C 1  l e F i,k ·wk yi log C , F i,k ·w j N ×C j=1 e

(4.117)

i=1 k=1



where F i,k =

F i,all

if k = arg max (yi )

F i,k

if k  = arg max (yi )

,

(4.118)

where N is batch size, yi is the label of image xi , which is a C-dim one-hot vector. yil represents the label of sample xi is l, i.e., yil = 1. Discrimination-Aware Mechanism for Pair-base Loss. For the pair-based loss, triplet  loss (e.g., TriHard [51]) is choosed as the baseline. A triplet xa , x p , xn has a shared anchor xa , where xa and xn are positive pairs, while xa and x p are negative pairs. For the triplet loss, the feature is directly used to calculate the distance, and the optimization goal is achieved by pulling the distance of the same class close and pushing the distance of the different classes far away. The TriHardDAM loss function can be rewritten as LTriHardDAM =

N 

1   d ¬Fa,all , ¬F p,all − d Fa,n , Fn,a + α + , N

(4.119)

a=1

where [.]+ = max(·, 0), α is a pre-defined value, and d(·, ·) is the Euclidean distance calculation function. ¬Fa,all , ¬F p,all express the high-discriminativeness elements of feature f a and f p . To be specific, the gated mechanism sets the feature elements between the intra-class with large differences to 1 and the other elements to 0. Discrimination-Aware Mechanism-based Loss. By using the informative features of proxy-based loss and pair-based loss, the overall loss using DAM can be defined as LDAM = μLSCEDAM + νLTriHardDAM ,

(4.120)

where μ and ν are two adjustable parameters, representing the weights of LSCEDAM and LTriHardDAM , respectively. The architecture supervised by the DAM-based loss is illustrated in Fig. 4.27. Following the current popularity, the CNN backbone is used to extract features, and input the features to proxy-based loss and pair-based loss to optimize model parameters. When loss convergence, a feature extractor for inference can be obtained. The inputs xa , x p , xn from one training batch include samples from the same class and different class, where xa and x p are from the same class, xa and xn are different class. During testing, only feature embeddings are used for similarity comparison between samples.

4.3

Recognition by End-to-End Feature Encoding

99

Fig. 4.27 The framework of representation learning with the discrimination-aware mechanism [161]. It uses the weights of the last FC layer to represent the class center, and the difference of the class center to determine discriminativeness of feature elements. Then, the new gated features are inputted to proxy-based and pair-based loss for further parameters optimation

4.3.3

Other Methods

Beyond modeling the interactions between higher-order features and designing novel loss functions, another set of approaches involves constructing fine-grained tailored auxiliary tasks for obtaining unified and discriminative image representations. BGL [188] was proposed to incorporate rich bipartite-graph labels into CNN training to model the important relationships among fine-grained classes. DCL [13] performed a “destruction and construction” process to enhance the difficulty of recognition to guide the network to focus on discriminative parts for fine-grained recognition (i.e., by destruction learning) and then model the semantic correlation among parts of the object (i.e., by construction learning). Similar to DCL, Du et al. [28] tackled fine-grained representation learning using a jigsaw puzzle generator proxy task to encourage the network to learn at different levels of granularity and simultaneously fuse features at these levels together. Recently, a more direct fine-grained feature learning method [160] was formulated with the goal of generating identity-preserved fine-grained images in an adversarial learning manner to directly obtain a unified fine-grained image representation. The authors showed that this direct feature learning approach not only preserved the identity of the generated images, but also significantly boosted the visual recognition performance in other challenging tasks like fine-grained few-shot learning [153]. In the following, we elaborate several representative works.

4.3.3.1 BGL The BGL approach [188] was proposed to exploit the rich relationships through bipartitegraph labels for fine-grained recognition. Given a food image, BGL is able to tell which restaurant which dish the food belongs to, which is an ultra-fine-grained image recognition task.

100

4 Fine-Grained Image Recognition

Suppose there is a set of n images X = {(x, y), . . .} for training, where each image x is annotated with one of k fine-grained labels, y ∈ {1, . . . , k}. Let x ∈ Rd denote the input feature of the last fully-connected layer, which generates k scores f ∈ Rk through a linear function f = W  x defined by the parameters W ∈ Rd×k . In a nutshell, the last layer of CNN is to minimize the negative log-likelihood over the training data, i.e.,   min − log P(y  x, W ) , (4.121) W

where the softmax score

(x,y)∈X

 e fi P(i  x, W ) = k j=1

e fj

. = pi ,

(4.122)

encodes the posterior probability of image x being classified as the i-th fine-grained class. Given m types of coarse classes, where each type j contains k j coarse classes, BGL models their relations with the k fine-grained classes as m bipartite graphs grouped in a star-like structure. Take Fig. 4.28 for instance, where the three types of coarse classes form three separated bipartite graphs with the fine-grained Tofu dishes, and there is no direct connection among the three types of coarse classes. For each graph of coarse type j, BGL encodes its bipartite structure in a binary association matrix G j ∈ {0, 1}k×k j , whose element j gic j = 1 if the i-th fine-grained label is connected with coarse label c j . As it will become clear later, this star-like composition of bipartite graphs enables BGL to perform exact inference as opposed to the use of other general label graphs (e.g., [24]). To generate the scores f j = W j x ∈ Rk j for coarse classes of type j, we augment the last fully-connected layer with m additional variables, {W j } j , where W j ∈ Rd×k j . Given an input image x of the i-th fine-gained class, BGL models its joint probability with any m coarse labels {c j } j as

Fig. 4.28 Illustration of three ultra-fine grained classes (middle) [188]: Mapo Tofu of Restaurant A, Salt Pepper Tofu of Restaurant B, and Mapo Tofu of Restaurant B. Their relationships can be modeled through three bipartite graphs, fine-grained classes versus general food dishes (left) and fine-grained classes versus two ingredients (right)

Graph 1 Mapo Tofu

Salt Pepper Tofu

A Mapo Tofu

B Salt Pepper Tofu

B Mapo Tofu

Graph 2

Graph 3

4.3

Recognition by End-to-End Feature Encoding

101

m  1 fi  j f cjj  P(i, {c j } j x, W , {W j } j ) = e gic j e , z

(4.123)

j=1

where z is the partition function computed as z=

k 

e

fi

kj m   j=1 c j =1

i=1

j

j

gic j e

fc j

.

(4.124)

At first glance, computing z is infeasible in practice. Because of the bipartite structure of the j label graph, however, we could denote the non-zero element in i-th row of G j as φi = c j j where gic j = 1. With this auxiliary function, the computation of z can be simplified as z=

k  i=1

e

fi

m 

f

e

j j φi

.

(4.125)

j=1

Compared to general CRF-based methods (e.g., [24]) with an exponential number of possible states, the complexity O(km) of computing z in BGL through Eq. (4.125) scales linearly with respect to the number of fine-grained classes (k) as well as the type of coarse labels (m). Given z, the marginal posterior probability over fine-grained and coarse labels can be computed as j m f j   1 . e φi = pi , (4.126) P(i  x, W , {W j } j ) = e fi z j=1

k m  1  j fi  fφl l . j P(c j  x, W , {W l }l ) = gic j e e i = pc j . z i=1

(4.127)

l=1

As discussed before, one of the difficulties in training CNN is the possibility of being overfitting. One common solution is to add a 2 weight decay term, which is equivalent to sampling the columns of W from a Gaussian prior. Given the connection among fine-grained and coarse classes, BGL provides another natural hierarchical prior for sampling the weights by kj k  m  j j  − λ g wi −w c j 2 . e 2 ic j (4.128) P(W , {W j } j ) = = pw . i=1 j=1 c j =1 j

This prior expects wi and wc j have a small distance if the i-th fine-grained label is connected to coarse class c j of type j. Notice that this idea is a generalized version of the one proposed in [126]. However, [126] only discussed a special type of coarse label (big class), while BGL can handle much more general coarse labels such as multiple attributes. In summary, given the training data X and the graph label defined by {G j } j , the last layer of CNN with BGL aims to minimize the joint negative log-likelihood with proper regularization over the weights by

102

4 Fine-Grained Image Recognition

min

W ,{W j } j

  (x,y)∈X

− log p y −

m 

log p

j=1



j j φy

− log pw .

(4.129)

4.3.3.2 DCL Delicate feature representation of object parts plays a critical role in fine-grained recognition. For example, experts can even distinguish fine-grained objects relying only on object parts according to professional knowledge. A Destruction and Construction Learning (DCL) [13] method was proposed to enhance the difficulty of fine-grained recognition and exercise the classification model to acquire expert knowledge. Besides the standard classification backbone network, another destruction-and-construction stream is introduced to carefully destruct and then reconstruct the input image, for learning discriminative regions and features. As shown in Fig. 4.29, DCL consists of four parts. (1) Region Confusion Mechanism: a module to shuffle the local regions of the input image. (2) Classification Network: the backbone classification network that classifies images into fine-grained categories. (3) Adversarial Learning Network: an adversarial loss is applied to distinguish original images from destructed ones. (4) Region Alignment Network: appended after the classification network to recover the spatial layout of local regions. As shown in Fig. 4.30, Region Confusion Mechanism (RCM) is designed to disrupt the spatial layout of local image regions. Given an input image I, RCM first uniformly partitions the image into N × N sub-regions denoted by Ri, j , where i and j are the horizontal and vertical indices respectively and 1 ≤ i, j ≤ N . Inspired by [75], RCM shuffles these partitioned local regions in their 2D neighborhood. For the j-th row of R, a random vector q j of size N is generated, where the i-th element q j,i = i + r , where r ∼ U (−k, k) is a random variable following a uniform distribution in the range of [−k, k]. Here, k is a tunable parameter (1 ≤ k < N ) defining the neighborhood range. Then, we can get a new permutation σ rj ow of regions in j-th row by sorting the array q j , verifying the condition

ClassificaƟon Network

Cls Loss

Region Confusion Mechanism

RCM

Feature Vector

Fig. 4.29 Overall framework of the proposed DCL method [13]

Loc Loss

··· ··· ···

Adv Loss

Adversarial Learning Network

Region Alignment Network

4.3

Recognition by End-to-End Feature Encoding

Fig. 4.30 Example images for fine-grained recognition (top) and the corresponding destructed images by RCM (bottom) in DCL [13]

103

Location Matrix (0,1)

···

(1,0)

···

(5,6)

···

(6,5)

(6,6)

(0,0)

(1,4)

···

(0,3)

···

(6,4)

(5,5)

(5,6)

RCM

(0,0)

···

N×N

  ∀i ∈ {1, . . . , N }, σ rj ow (i) − i  < 2k .

(4.130)

Similarly, we apply the permutation σicol to the regions column-wisely, verifying the condition   (4.131) ∀ j ∈ {1, . . . , N }, σicol ( j) − j  < 2k . Therefore, the region at (i, j) in the original region location is placed in a new coordinate σ (i, j) = (σ rj ow (i), σicol ( j)) .

(4.132)

This shuffling method destructs the global structure and ensures that the local region jitters inside its neighborhood with tunable size. The original image I, its destructed version φ(I) and its ground truth one-vs-all label y indicating the fine-grained categories are coupled as I, φ(I), y for training. The classification network maps the input image into a probability distribution vector C(I, θcls ), where θcls are all learnable parameters in the classification network. The loss function of the classification network Lcls can be written as  Lcls = − y · log[C(I)C(φ(I))] , (4.133) I∈D

where D is the image set for training. Since the global structure has been destructed, to recognize these randomly shuffled images, the classification network has to find the discriminative regions and learn the delicate differences among categories. Destructing images with RCM does not always bring beneficial information for finegrained classification. For example in Fig. 4.30, RCM also introduces noisy visual patterns as we shuffle the local regions. Features learned from these noise visual patterns are harmful to the classification task. To this end, we propose another adversarial loss Ladv to prevent overfitting the RCM-induced noise patterns from creeping into the feature space. Considering the original images and the destructed ones as two domains, the adversarial loss and classification loss work in an adversarial manner to 1) keep domain-invariant patterns, and 2) reject domain-specific patterns between I and φ(I).

104

4 Fine-Grained Image Recognition

RCM labels each image as an one-hot vector d ∈ {0, 1}2 indicating whether the image is destructed or not. A discriminator can be added as a new branch in the framework to judge whether an image I is destructed or not by [1,m] )) , D(I, θadv ) = softmax(θadv C(I, θcls

(4.134)

[1,m] where C(I, θcls ) is the feature vector extract from the outputs of the m-th layer in the [1,m] backbone classification network, θcls is the learnable parameters from the first layer to m-th layer in the classification network, and θ adv ∈ Rd×2 is a linear mapping. The loss of the discriminator network Ladv can be computed as  Ladv = − d · log[D(I)] + (1 − d) · log[D(φ(I))] . (4.135) I∈D

Given an image I and its corresponding destructed version φ(I), the region Ri, j located at (i, j) in I is consistent with the region Rσ (i, j) in φ(I). Region alignment network works on [1,n] ), where the output features of one convolution layer of the classification network C(·, θcls the n-th layer is a convolution layer. The features are processed by a 1 × 1 convolution to obtain outputs with two channels. Then, the outputs are handled by a ReLU and an average pooling to get a map with the size of 2 × N × N . The outputs of the region alignment network can be written as [1,n] ), θloc ) , (4.136) M(I) = h(C(I, θcls where the two channels in M(I) correspond to the location coordinates of rows and columns, respectively. h is the proposed region alignment network, and θloc are the parameters in the region alignment network. We denote the predicted location of Rσ (i, j) in I as Mσ (i, j) (φ(I)), predicted location of Ri, j in I as Mi, j (I, i, j). Both ground truth of Mσ (i, j) (φ(I)) and Mi, j (I) should be (i, j). The region alignment loss Iloc is defined as the 1 distance between the predicted coordinates and original coordinates, which can be expressed as Lloc =

 / 0 / 0 N  N        Mσ (i, j) (φ(I)) − i  +  Mi, j (I) − i  .    j j 

I∈D i=1 j=1

1

(4.137)

1

Finally, the classification, adversarial and region alignment losses are trained in an end-toend manner, in which the network can leverage both enhanced local details and well-modeled object parts correlation for fine-grained recognition. Specifically, DCL aims to minimize the following objective by (4.138) L = αLcls + βLadv + γ Lloc .

4.3.3.3 PMG Du et al. [28] proposed a manner of Progressive Multi-Granularity (PMG) training of jigsaw patches for fine-grained recognition. It consists of: (i) a progressive training strategy that

4.3

Recognition by End-to-End Feature Encoding

105

adds new layers in each training step to exploit information based on the smaller granularity information found at the last step and the previous stage, and (ii) a simple jigsaw puzzle generator to form images containing information of different granularity levels. The design of PMG is generic and could be implemented on the top of any state-of-the-art backbone feature extractor, e.g., ResNet [45]. Let F be the backbone feature extractor, which has L stages. The output feature-map from any intermediate stages is represented as F l ∈ R Hl ×Wl ×Cl , where Hl , Wl , Cl are the height, width, and number of channels of the feature map at the l-th stage, and l = {1, 2, . . . , L}. Here, the objective is to impose classification loss on the feature map extracted at different intermediate stages. Hence, in addition to F , l (·) that takes the l-th intermediate stage output F l as we introduce convolution block Hconv l input and reduces it to a vector representation V l = Hconv (F l ). Thereafter, a classification l (·) consisting of two fully-connected stage with BatchNorm [61] and ELU [17] module Hclass non-linearity, corresponding to the l-th stage, predicts the probability distribution over the l (V l ). Here, we consider last S stages: l = L, L − 1, . . . , L − S + 1. classes as yl = Hclass Finally, we concatenate the output from the last three stages as V concat = concat[V L−S+1 , . . . , V L−1 , V L ] .

(4.139)

concat (V concat ). This is followed by an additional classification module yconcat = Hclass PMG adopts progressive training where we train the low stage first and then progressively add new stages for training. Since the receptive field and representation ability of the low stage are limited, the network will be forced to first exploit discriminative information from local details (i.e., object textures). Compared to training the whole network directly, this increment nature allows the model to locate discriminative information from local details to global structures when the features are gradually sent into higher stages, instead of learning all the granularities simultaneously. For the training of the outputs from each stage and the output from the concatenated features, we adopt cross entropy (CE) LCE between ground truth label y and prediction probability distribution for loss computation as

LCE ( yl , y) = −

m 

yil · log(yil ) .

(4.140)

i=1

and LCE ( yconcat , y) = −

m 

yiconcat · log(yiconcat ) .

(4.141)

i=1

At each iteration, a batch of data d will be used for S + 1 steps, and we only train one stage’s output at each step in the series. It needs to be clear that all parameters used in the current prediction will be optimized, even if they might have been updated in the previous steps, and this can help each stage in the model work together.

106

4 Fine-Grained Image Recognition 1

2

3

4

1

3

2

Step 1

4

Step 2

Step 3 Step 4

Stage 1

Frozen stages

Stage i Conv Block L-2

Stage L-2

Conv Block L-1

Stage L-1

Stage L

Classifier L-2

Classifier L-1

Classifier L

Conv Block L Concatenate

Classifier concat.

Fig. 4.31 The training procedure of the progressive training [28] which consists of S + 1 steps at each iteration (Here S = 3 for explanation)

During training, a batch d of training data will first be augmented to several jigsaw puzzle generator-processed batches, obtaining P(d, n). All the jigsaw puzzle generator-processed batches share the same label y. Then, for the l-th stage’s output yl , we input the batch P(d, n), n = 2 L−l+1 , and optimize all the parameters used in this propagation. Figure 4.31 illustrates the training procedure step by step. It should be clarified that the jigsaw puzzle generator cannot always guarantee the completeness of all the parts which are smaller than the size of the patch. Although there could exist some parts which are smaller than the patch size, those still have a chance of getting split. However, it should not be bad news for model training, since it adopts random cropping which is a standard data augmentation strategy before the jigsaw puzzle generator and leads to the result that patches are different compared with those of previous iterations. Small discriminative parts, which are split at this iteration due to the jigsaw puzzle generator, will not be always split in other iterations. Hence, it brings the additional advantage of forcing the model to find more discriminative parts at the specific granularity level. At the inference step, PMG merely inputs the original images into the trained model and the jigsaw puzzle generator is unnecessary. If we only use yconcat for prediction, the FC layers for the other three stages can be removed which leads to less computational budget. In this case, the final result C1 can be expressed as C1 = arg max( yconcat ) .

(4.142)

However, the prediction from a single stage based on information of a specific granularity is unique and complementary, which leads to better performance when we combine all outputs together with equal weights. The multi-output combined prediction C2 can be written as   L  l concat . (4.143) y +y C2 = arg max l=L−S+1

4.4

Recognition with External Information

107

Hence, both the prediction of yconcat and the multi-output combined prediction can be obtained in the model. In addition, although all predictions are complementary for the final result, yconcat is enough for those objects whose shapes are relatively smooth, for example, cars.

4.4

Recognition with External Information

Beyond the conventional recognition paradigms, which are restricted to using supervision associated with the images themselves, another paradigm is to leverage external information, e.g., web data, multi-modal data, or human-computer interactions, to further assist finegrained recognition.

4.4.1

Noisy Web Data

Large and well-labeled training datasets are necessary in order to identify subtle differences between various fine-grained categories. However, acquiring accurate human labels for finegrained categories is difficult due to the need for domain expertise and the myriads of finegrained categories (e.g., potentially more than tens of thousands of subordinate categories in a meta-category). As a result, some fine-grained recognition methods seek to utilize freely available, but noisy, web data to boost recognition performance. The majority of existing works in this line can be roughly grouped into two directions. The first direction involves scraping noisy labeled web data for the categories of interest as training data, which is regarded as webly-supervised learning [129, 165, 190]. These approaches typically concentrate on: 1) overcoming the domain gap between easily acquired web images and the well-labeled data from standard datasets; and 2) reducing the negative effects caused by the noisy data. For instance, HAR-CNN [159] utilized easily annotated meta-classes inherent in the fine-grained data and also acquired a large number of meta-classlabeled images from the web to regularize the models for improving recognition accuracy in a multi-task manner (i.e., for both the fine-grained and the meta-class data recognition task). Xu et al. [164] investigated if fine-grained web images could provide weakly-labeled information to augment deep features and thus contribute to robust object classifiers by building a multi-instance (MI) learner, i.e., treating the image as the MI bag and the proposal part bounding boxes as the instances of MI. Krause et al. [72] introduced an alternative approach to combine a generic classification model with web data by excluding images that appear in search results for more than one category to combat cross-domain noise. Inspired by adversarial learning [40], [129] proposed an adversarial discriminative loss to encourage representation coherence between standard and web data. The second direction is to transfer the knowledge from auxiliary categories with welllabeled training data to the test categories, which usually employs zero-shot learning [106]

108

4 Fine-Grained Image Recognition

or meta learning [56]. Niu et al. [106] exploited zero-shot learning to transfer knowledge from annotated fine-grained categories to other fine-grained categories. Subsequently, Zhang et al. [180], Yang et al. [167], and Zhang et al. [174] investigated different approaches for selecting high-quality web training images to expand the training set. Zhang et al. [180] proposed a novel regularized meta-learning objective to guide the learning of network parameters so they are optimal for adapting to the target fine-grained categories. Yang et al. [167] designed an iterative method that progressively selects useful images by modifying the label assignment using multiple labels to lessen the impact of the labels from the noisy web data. Zhang et al. [174] leveraged the prediction scores in different training epochs to supervise the separation of useful and irrelevant noisy web images. In the following, we introduce several representative works.

4.4.1.1 MetaFGNet To employ large models for fine-grained recognition without suffering from overfitting, existing methods usually adopt a strategy of pre-training the models using a rich set of auxiliary data, followed by fine-tuning on the target fine-grained recognition task. However, such obtained models are suboptimal for fine-tuning. To address this issue, MetaFGNet [180] was proposed. Training of MetaFGNet is based on a novel regularized meta-learning objective, which aims to guide the learning of network parameters so that they are optimal for adapting to the target fine-grained recognition task. For a target fine-grained recognition of interest, suppose MetaFGNet has training data |T | T = {(Iit , yit )}i=1 , where each pair of Iit and yit represents an input image and its one|S| hot vector representation of class label. Denote the auxiliary data as S = {(Iis , yis )}i=1 . As illustrated in Fig. 4.32, MetaFGNet consists of two parallel classifiers of fully connected layers that share a common base network. The two classifiers are respectively used for T and S.

Fig.4.32 Schematic illustrations of MetaFGNet [180] with regularized meta-learning objective (solid line) and the process of sample selection from auxiliary data (dashed line)

4.4

Recognition with External Information

109

More specifically, given sample mini-batches Ti , Si from T , S. One-step gradient descent can be written as 1 (4.144) ∇θ s R(Si ; θ s ) , [(θb ; Si ), (θcs ; Si )] = |Si | [(θb ; Ti ), (θct ; Ti )] =

1 ∇θ t L(Ti ; θ t ) , |Ti |

(4.145)

where θcs and θct denote parameters of the two classifiers and denote the base network collectively as θb , which contains parameters of layer weights and bias. MetaFGNet denotes the parameters of target and source model as θ t = (θb , θct ) and θ s = (θb , θcs ), respectively. MetaFGNet computes adapted parameters with stochastic gradient descent (SGD) by 

θt = θt − η

1 ∇θ t L(Ti ; θ t ) , |Ti |

(4.146)

where η denotes the step size. After that, given sample another mini-batch T j from T , solving the objective via SGD involves computing gradient descent of T j , which can be derived as [(θb ; T j ), (θct ; T j )] = θ t



 1 ∂ 2 L(Ti ; θ t ) 1 )] , L(T j ; θ t )[I − η ( |T j | |Ti | ∂(θ t )2

(4.147)

where I is hessian matrix. Thus, the parameter update process can be written as θb ← θb − α[(θb ; Si ) + (θb ; T j )] ,

(4.148)

θct ← θct − α(θct ; T j ) ,

(4.149)

θcs ← θcs − α(θcs ; Si ) .

(4.150)

In the following, we introduce the sample selection of auxiliary data using the proposed MetaFGNet. Given a trained MetaFGNet, for each auxiliary sample Is from S, MetaFGNet computes through the network to get two prediction vectors z ss and z st , which are respectively the output vectors of the two classifiers (before the softmax operation) for the source and target tasks. Length of z ss (or z st ) is essentially equal to the category number of the source task (or that of the target task). To achieve sample selection from the auxiliary set S, MetaFGNet takes the approach of assigning a score to each Is and then ranking scores of all auxiliary samples. The score of Is is computed as follows: MetaFGNet sets negative values in z ss and z st as zero; MetaFGNet then concatenates the resulting vectors and apply 2 normalization, s  s producing z˜ s = [˜z s s , z˜ t ] ; MetaFGNet finally computes the score for I as O s = z˜ s t · 1,

(4.151)

where 1 represents a vector with all entry values of 1. A specified ratio of top samples can be selected from S and form a new set of auxiliary data.

110

4 Fine-Grained Image Recognition

4.4.1.2 Peer-learning Peer-learning [130] was proposed for dealing with webly-supervised fine-grained recognition. This model jointly leverages both “hard” and “easy” examples for training, which can keep the peer networks diverged and maintain their distinct learning abilities in removing noisy images. Thus, the authors denote this algorithm as “Peer-learning”. This strategy can alleviate both the accumulated error problem in MentorNet [96] and the consensus issue in Co-teaching [43], which can boost the performance of webly-supervised learning. As illustrated in Fig. 4.33, the framework of Peer-learning includes two networks h 1 and h 2 . In concretely, given a mini-batch of data G = {(Ii , yi )}, where yi is the label with noise of the image Ii . h 1 and h 2 first separately predict the labels { yˆi , h 1 } and { yˆi , h 2 } of Ii , based on which G is divided into G s = {(Ik , yk ) ∈ G | yˆk , h 1 = yˆk , h 2 } (instances with identical predictions) and G d = {(Ik , yk ) ∈ G | yˆk , h 1  = yˆk , h 2 } (instances with different predictions). Peer-learning treats G d as “hard examples”, which can benefit the training of h 1 and h 2 . Peer-learning explores “useful knowledge” in G s by selecting a small proportion of instances G s1 and G s2 , according to the losses computed by h 1 and h 2 , respectively. G si (i ∈ {1, 2}) consists of instances that have the top (1 − d(T )) smallest training losses by using h i , and is concretely defined as  G si = arg minG˜ s ⊂G s :|G˜ s |≥(1−d(T ))|G s | Lh i (I j , y j ) , (4.152) ˜ (I j ,y j )∈G s

where Lh i (I j , y j ) is the training loss of instances I j computed by h i (i ∈ {1, 2}), and |G s | indicates the number of elements in G s . Particularly,

Fig. 4.33 Overall framework of Peer-learning model [130]. The input is a mini-batch of web images. Each network in h 1 and h 2 individually feeds forward data to separately predict the labels, based on which the input data is split into two sets G s (instances with identical predictions) and G d (instances with different predictions). Then, h 1 and h 2 individually sort and fetch small-loss instances in G s as the useful knowledge G s1 and G s2 . Subsequently, h 1 updates its parameters using G d and G s2 , while h 2 updates its parameters using G d and G s1

4.4

Recognition with External Information

d(T ) = ξ · min{

111

T , 1} , Tk

(4.153)

is the drop rate that dynamically controls |G si |, where ξ is the maximum drop rate, and Tk is the the number of epochs after which d(T ) is no longer updated. After obtaining G d and G si (i ∈ {1, 2}), Peer-learning treats G d ∪ G s1 as the “useful knowledge” to train h 2 , provided by its peer network h 1 . Similarly, G d ∪ G s2 is adopted to train h 1 . The parameters θh i of the network h i (i ∈ {1, 2}) are updated by using the gradient ∇Lh i with a learning rate λ by  ∇Lh 1 (Ii , yi ) , (4.154) θh 1 ← θh 1 − λ · (Ii ,yi )∈G d ∪G s2

θh 2 ← θh 2 − λ ·

 (Ii ,yi )∈G d ∪G s1

∇Lh 2 (Ii , yi ) ,

(4.155)

through mutually communicating “useful knowledge”, both the performance of h 1 and h 2 can be improved.

4.4.1.3 Webly-Supervised Fine-Grained Recognition with Partial Label Learning The task of webly-supervised fine-grained recognition is to boost recognition accuracy of classifying subordinate categories by utilizing freely available but noisy web data. As the label noises significantly hurt the network training [106], it is desirable to distinguish and eliminate noisy images. The pipeline of webly-supervised fine-grained recognition with partial label learning is shown in Fig. 4.34. Webly-supervised fine-grained recognition with partial label learning [163] proposed an open-set label noise removal strategy and a closedset label noise correction strategy to deal with the practical but challenging webly-supervised fine-grained recognition task. This method particularly corrects the closed-set label noise with label sets via partial label learning [98]. In concretely, assuming the label space to be Y and the image space to be R, for each set R = {I1 , I2 , I3 , . . . , In } ⊂ R which contains n images, the open-set label noise removal strategy feeds them to a pre-trained CNN model f pre (·; ) and obtain the corresponding feature map: t i = f pre (Ii ; ) ∈ R H ×W ×d , where H , W , d present the height, width and depth of t i . After that, they put all the feature maps together to derive a feature set T ∈ Rn×H ×W ×d . This method gets the common pattern in T by applying Principal Component Analysis (PCA) [157] along the depth dimension. Therefore, they get the eigenvector p ∈ Rd corresponding to the largest eigenvalue as a common pattern detector after the PCA process. Then, each spatial location of the given feature maps is channel-wise weighted and summarized to get the indicator matrixes H. To put it more precisely, the indicator matrix H i in H corresponding with the i-th feature map t i is formulated as: H i = t i · p, and   C i ∈ R H ×W is obtained by upsampling the indicator matrix H i ∈ R H ×W according to the input size. Then, this method sets a threshold δ about the correlation value to detect each image with

112

4 Fine-Grained Image Recognition

Fig. 4.34 Overall framework of webly-supervised fine-grained recognition with partial label learning [163], which consists of two strategies. The first strategy erases open-set noisy images and obtains an image space X. The second strategy is composed of two components, i.e., (1) performing the top k recall optimization loss on remaining images to gain the label sets containing the ground truth; (2) utilizing the distance between closed-set noisy images and each row of the encoding matrix M to obtain the prediction category for each closed-set noisy image. Finally, the samples in image space X will be put into the network for training

H  W  j=1

t=1 1 (C i ( j, t)

|C i |

≥ 0)

≥ δ,

(4.156)

where 1(·) represents the indicator function. If a sample does not satisfy Eq. (4.156), it will be regarded as an open-set label noise and removed from the noisy image space. Finally, a sample space X is obtained via the open-set label noise removal strategy. In the following, we introduce the closed-set label noise correction strategy which is composed of two components. The first component is a top k recall optimization loss. Formally, in the label space Y = {y1 , y2 , . . . , y N }, define the sample space X as   X = X y1 , X y2 , . . . , X y N , where X yi represents a collection of instances belong to the i-th category yi . During the training data selecting stage, this method randomly selects C categories to generate a mini-batch A. For each selected category yi , they get n ∗   samples in X yi . They can obtain embedding features F = f 1 , f 2 , . . . , f a based on A = {A1 , A2 , . . . , Aa } ⊂ X where a = n ∗ × C. More specifically, this method gains the embedding feature f i of its input image Ai in A via a backbone CNN model f CNN (·; ) by f i = f CNN (Ai ; ) ∈ Rc , where c is the length of the embedding feature f i . The similarity matrix s ∈ Ra×a is generated by cosine similarity based on embedding features to measure the distance among A. In detail, the similarity of the i-th query image and the j-th support f f

image is calculated as si, j =  f i fj  . i j This method defines the set K as the group of top k images sorted by the sim/ sq,: , i.e., K = ilarity of each query image and other sq,: , where sq,: ∈ s and sq,q ∈   A j ∈ A : sq, j ≥ sq,[k] , q  = j , [k] denotes the k-th largest element. Images belonging

4.4

Recognition with External Information

113

to the same label as the query image but not in set K are called positive images, while negative images are samples in K but have a different label with the query image. The posi  tive images excluded from K constitute P = A j ∈ A\K : y j = yq , while these negative images in K constitute N, where A\K means the relative complement set of K in A and yq is a label of the query image. Therefore, the loss function is defined as ⎛ ⎞ a    ⎝ sq,i − sq, j ⎠ . Ltopk = (4.157) q=1

Ai ∈N

A j ∈P

Optimization of the model using the loss function can include the correct labels in the top k classes with high confidence. Then, the top k predicted labels with high confidence are used as label sets S for corresponding samples. The second component is the improved Error-Correcting Output Codes (ECOC) [26] to find correct labels. In the encoding stage, an encoding matrix M ∈ {+1, −1} N ×L is produced to support the learning process, where N represents the number of categories, and L represents the number of binary classifiers. More specifically, let v = [v1 , v2 , . . . , v N ] ∈ {+1, −1} N denote the N -bits column coding which divides the label space into positive half     Yv+ = y j |v j = +1, 1  j  N and negative half Yv− = y j |v j = −1, 1  j  N . Given a training sample (X m , Sm ), X m ∈ X, this method regards Sm as an entirety to help build a binary classifier. Furthermore, the image X m is used as a positive or negative sample only when the whole label set Sm falls into Yv+ or Yv− . Thus, binary training sets B   can be generated for binary classifier training, where B = B M(:,1) , B M(:,2) , . . . , B M(:,L) , B M(:,l) = {(X m , +1/ − 1) | 1  m  |X|}. In the decoding stage, the connected set E y is constructed for each category. Furthermore, the j-th connected set E y j can be written as   E y j = E y j ∪ Sm : y j ∈ Sm , 1  m  |X| .

(4.158)

With the assistance of E y , a performance matrix G N ×L is obtained to represent the capability of classifiers. The performance of the t-th classifier gt on the j-th category is calculated as G( j, t) = min ( z∈E y j

1 |Qz |



1 (gt (X m ) = M(z, t))) ,

(4.159)

(X m ,Sm )∈Qz

where Qz = {(X m , Sm )|yz ∈ Sm , 1  m  |X|}, and 1(·) represents the indicator function. Then, they normalized the performance matrix G by row G ∗ ( j, t) = LG( j,t) , where r =1

G( j,r )

N and t ∈ N L . Given a closed-set noisy image X , the label prediction can be obtained j ∈ N+ cs + via L  G ∗ ( j, t) exp(−gt (X cs )M( j, t)) . (4.160) arg min y j (1 jN ) t=1

114

4 Fine-Grained Image Recognition

Finally, this method obtains the correct labels yˆ of the closed-set noisy images by Eq. (4.160) and sends the rest of the samples to the backbone network for re-training.

4.4.2

Multi-Modal Data

Multi-modal analysis has attracted a lot of attention with the rapid growth of multi-media data, e.g., image, text, knowledge bases, etc. In fine-grained recognition, multi-modal data can be used to establish joint-representations/embeddings by incorporating multi-modal data in order to boost fine-grained recognition accuracy. Compared with strong semantic supervision from fine-grained images (e.g., part annotations), text descriptions are a weak form of supervision (i.e., they only provide image-level supervision). One advantage, however, is that text descriptions can be relatively accurately generated by non-experts. Thus, they are both easy and cheap to be collected. In addition, high-level knowledge graphs, when available, can contain rich knowledge (e.g., DBpedia [79]). In practice, both text descriptions and knowledge bases are useful extra guidance for advancing fine-grained image representation learning. Reed et al. [114] collected text descriptions, and introduced a structured joint embedding for zero-shot fine-grained image recognition by combining text and images. Later, He and Peng [46] combined vision and language bi-streams in a joint end-to-end fashion to preserve the intra-modality and inter-modality information for generating complementary finegrained representations. Later, PMA [125] proposed a mask-based self-attention mechanism to capture the most discriminative parts in the visual modality. In addition, they explored using out-of-visual-domain knowledge using language with query-relational attention. Multiple PMA blocks for the vision and language modalities were aggregated and stacked using the proposed progressive mask strategy. For fine-grained recognition with knowledge bases, some existing works [12, 162] have introduced knowledge base information (using attribute label associations, cf. Figure 4.35) to implicitly enrich the embedding space, while also reasoning about the discriminative attributes of fine-grained objects. Concretely, T-CNN [162] explored using semantic embeddings from knowledge bases and text, and then trained a CNN to linearly map image features to the semantic embedding space to aggregate multi-modal information. To incorporate the knowledge representation into image features, KERL [12] employed a gated graph network to propagate node messages through the graph to generate the knowledge representation. Finally, [175] incorporated audio information related to the fine-grained visual categories of interest to boost recognition accuracy. In the following, we introduce two representative works of this research line, as well as a recent multi-modality fine-grained dataset with insightful empirical observations.

4.4

Recognition with External Information Crown: Yellow

115 Hooded Warbler

Pine Warbler Throat: Black

Crown: Black

Wing: Brown White

Tennessee Warbler

Belly: yellow Eye: Black Wing: Yellow Olive Throat: Yellow

Wing: Grey Yellow Crown: Grey Yellow

Kentucky Warbler

Fig. 4.35 An example knowledge graph for modeling category-attribute correlations in CUB2002011 [140]

4.4.2.1 PMA Progressive Mask Attention (PMA) [125] is an end-to-end trainable model for fine-grained recognition by leveraging both visual and language modalities. It applies a progressive mask strategy into the attention module to attend to a set of distinct and non-overlap parts stageby-stage, while most of the existing attention-based methods for multiple discriminative part localization are only focusing on a few important parts repeatedly. Unlike other attention methods requiring bounding boxes/part annotations in images or key word annotations in texts, PMA only needs raw images and text descriptions to capture the key patch from images or text descriptions and bridge the connection between them. In concretely, PMA consists of three key components. The first is Self-Attention Module (SAM) which is a component used to gather semantics from a single modality. Assume x ∈ Rd as the input, thus the formulation of SAM is presented by SAM(x) = W 2 · δ (W 1 · x) , d

(4.161)

d

where W 1 ∈ Rd× r and W 2 ∈ R r ×1 are learnable matrices, and r = 16 is a reduction ratio. δ(·) refers to the ReLU activation function (in visual) or the tanh activation function (in language), respectively. The second is Query-Relational Module (QRM) which is used to establish the connection between visual and language modalities. It is able to guide visual features to query the relevant keys in language modality. PMA denotes x ∈ Rd1 as the key vector and y ∈ Rd2 as the query vector, the formulation of QRM is as  QRM(x, y) = y W q · x ,

(4.162)

where represents the dot product and W q ∈ Rd2 ×d1 is a learnable matrix. The third is Mask Template which can be assumed as M = {m 1 , . . . , m n } which will be adopted by the following progressive mask strategy. n is consistent with the quantity of the input vectors and m i ∈ {0, −∞}, where m i is 0 at initialization during each

116

4 Fine-Grained Image Recognition 0

Self-Attention Module (SAM)

0

−∞



3 4

Attend and Locate Part

1

Visual modality

Conv.

GAP

Final Representation

0

FC

~

1

~

2

~

3

~

SoftMax

2

SAM 1

4

2 1 2 3 4

3

SoftMax

1

Global Language Stream

4

SAM

Local Vision Stream

3

~1 ~2 ~3 ~4

Global Vision Stream

2

~1 ~2 ~3 ~4

SoftMax

1

2× 2 Max-pool

×

Query-Relational Module (QRM)

MatMul

Update Mask Template (Language)

Update Mask Template (Vision)

0

×1

(·) 1



Local Language Stream

2

−∞

4

QRM

Language modality

Fig. 4.36 Detailed architecture of the PMA module [125]. “ ” is the weighted sum operation, “⊕” is the add operation, and δ(·) is the non-linear activation function

training step. The mask template M for visual modality and language modality is defined as     M V = m v1 , . . . , m vn and M T = m t1 , . . . , m tn , respectively. Figure 4.36 also illustrates these components. For the visual modality, PMA uses a mask to discard the located part in the previous stage to gain a set of distinct and non-overlap parts. It also aggregates the global image semantics calculated by attention weights and discriminative part features as the final state of a single stage. Given an image, PMA uses a conventional CNN to encode it and obtains the outputs from the last convolution layer X ∈ Rh×w×d (e.g., conv5_3 in VGG-16). Then, it employs an additional 2 × 2 max-pooling operator to X to gather more compact information, obtaining X˜ = { x˜ 1 , x˜ 2 , . . . , x˜ n }, where x˜ i ∈ R1×1×d is the aggregated local feature vector. After that a self-attention and the visual mask template M V are followed to evaluate the attention weight aiv corresponding to each local feature vector x˜ i as  exp SAM ( x˜ i ) + m iv v ,  ai = (4.163)  n v ˜ + m exp SAM x j j=1 j SAM (·) is equal to Eq. (4.161) and m iv is the i-th element of M V . PMA calculates the n aiv x˜ i . weighted sum of each local feature vector as the content vector f vglobal = i=1 To find and locate the key object parts which have the discriminative information, PMA takes the block x i which has the largest aiv as x max and employs 1 × 1-conv and global average-pooling operation over x max to get a compact part-level feature f vlocal . Finally, PMA concatenates the aforementioned global feature f vglobal and local feature f vlocal as f fusion to form the final representation in the visual modality as + * (4.164) f vision = f vglobal ; f vlocal , where [·; ·] denotes the concatenate operation. For the sake of multiple distinct parts localization, PMA designs a mask strategy for visual mask template M V to force the stacked modules to capture different discriminative visual parts in a stage-by-stage fashion. When v , PMA will update the the largest attention weight is returned in the current stage as amax

4.4

Recognition with External Information

117

v . So in the following stage, the PMA elements of the mask template m iv as −∞, if aiv = amax attention module will locate another important part on the basis of the updated mask M V . For the language modality, given the raw texts T describing the characteristics of the fine-grained objects in an image, PMA uses word embeddings and long-short term memory (LSTM) [54] to extract phrase-level embeddings for each noun phrase. These phrase-level   embeddings can be denoted as Z = z 1 , z 2 , . . . , z p , where p is the number of noun phrases and zi ∈ Rd is the vector of phrase embeddings. PMA designs a query-based attention mechanism with text mask template M T for Z and located part-level feature f vlocal to generate its corresponding text representation. The attention weight ait for each noun phrase is calculated as   exp QRM f vlocal , zi + m it t ,   v (4.165) ai = m t j=1 exp QRM f local , z i + m i

where QRM (·, ·) is equal to Eq. (4.162) and m it is the i-th element of M T . The content p vector can be calculated as f tlocal = i=1 ait zi . Besides the text features guided by the located part feature, PMA also mines some textual knowledge beyond the visual domain. It discards phrases which are highly relevant to the located part and employs a self-attention mechanism over the remaining phrases to generate a textual representation. The attention weight a˜ it for each phrase is  exp SAM (zi ) + m it t ,  a˜ i = m (4.166) t j=1 exp SAM (z i ) + m i where SAM (·) is equal to Eq. (4.161). PMA calculates the weighted sum of each noun phrase p as the content vector f tglobal = i=1 a˜ it z i . Finally, PMA concatenates the global feature f tglobal and local feature f tlocal as f text to form the final representation in the language modality by + * (4.167) f text = f tglobal ; f tlocal . The progressive mask strategy for language modality is that PMA updates the element m it as −∞ as its weight ait is ranked in top 3 and higher than 1/ p. After getting the visual and textual representation in each stage, PMA concatenates them as the final output and appends a shared fully connected layer after each stage outputs for dimensionality reduction by + * , (4.168) f ifinal = FC f ivisual ; f itext PMA aggregates three stage outputs and the object-level representation as the final representation F for predictions where the object-level representation f object is extracted by conducting global average-pooling over the feature map X by

F = f object ; f 1final ; f 2final ; f 3final .

(4.169)

118

4 Fine-Grained Image Recognition Raw Image

Softmax FC Global Average-Pooling

CNN

FC Layer

Raw Text Description • This bird has a red crown, a black eye, and a short and sharp bill. • …

Global Stream (Vision)

FC Layer

FC Layer

Stage 1

Stage 2 Local Stream (Vision)

Local Stream (Language)

Global Stream (Language)

Mask

long beak grey wing black spotted back

Noun phrase extraction

red crown long beak grey wing black spotted back normal head white breast long tail

red crown long beak grey wing

Global Stream (Vision)

Stage 3 Local Stream (Vision)

Local Stream (Language)

Mask

white breast long tail

long beak grey wing black spotted back

Global Stream (Language)

Global Stream (Vision)

Local Stream (Vision)

long beak grey wing

Local Stream (Language)

Mask

Global Stream (Language)

long beak grey wing white breast

white breast long tail

long beak grey wing white breast long tail

white breast long tail

black spotted back normal head

PMA (Vision)

PMA (Language)

PMA (Vision)

PMA (Language)

PMA (Vision)

PMA (Language)

white breast long tail

Fig. 4.37 Overall framework of the Progressive Mask Attention model for fine-grained recognition [125]. The processed bi-modal representations are fed into PMA module at the first stage. Later, by employing the progressive mask strategy, multiple PMA modules can be stacked corresponding to multiple processing stages

At last, PMA appends a fully connected layer with the softmax function upon F to conduct the final classification. Figure 4.37 shows the whole framework of the Progressive Mask Attention model in multi-stages. To further support the model to predict in the single-modality environment (e.g., only using image data), PMA performs a knowledge distillation approach [53] to compress the knowledge of both visual and language modalities into the student model. Here, the PMA model is the teacher model, and a standard network (i.e., only adopting the original image as inputs) is the student model. For the teacher model, the training corpus is defined as (si , yi ) ∈ {S, Y} where si means a pair of image and text data, and yi is the ground truth. The standard cross-entropy loss function is used for the PMA model as Lteacher (Y | S; θT ) =

N  C 

1 {yi = j} log P (yi | si ; θT ) ,

(4.170)

i=1 j=1

where N and C is the number of training samples and classes, and θT is the parameter of the teacher model (i.e., the PMA model). For the student model, the training corpus is defined as (ti , yi ) ∈ {T , Y}, where ti is the image data. Instead of using the ground truth of images for prediction, the distiller enforces the student model to learn the output probability P (yi | si ; θT ) of the teacher model. Therefore, the loss function for knowledge distillation can be formulated by Lstudent (Y | T ; θ S ) =

N  C  i=1 j=1

P ( j | si ; θT ) · log P ( j | ti ; θ S ) ,

(4.171)

4.4

Recognition with External Information

119

where θ S is the parameter of the student model. Based on Eq. (4.171), the knowledge can be distilled from two modalities into the visual modality, and thus allows the model to be able to return predictions even without text data during inference.

4.4.2.2 GLAVNet GLAVNet [120] was proposed to recognize materials with the combined use of auditory and visual perception. Unlike most existing methods that infer materials information solely based on global geometry information, GLAVNet takes the auditory information into account and demonstrates that local geometry has a greater impact on the sound than global geometry which offers more cues in material recognition. The fusion of global and local information also enables GLAVNet to deal with fine-grained material recognition. Specifically, GLAVNet accepts three different inputs and consists of a multi-branch CNN as depicted in Fig. 4.38. It comprises four parts: a global geometry subnetwork, a local geometry subnetwork, an audio subnetwork and a fusion subnetwork. The global and local geometry subnetwork is constructed by the basic structure of VoxNet [101]. The input layer accepts a grid of fixed size: 32 × 32 × 32. After passing through two 3D convolution layers (kernel sizes: 5 × 5 × 5 and 3 × 3 × 3, respectively), a 3D pooling layer (kernel size: 2 × 2 × 2) and a fully-connected layer, each subnetwork outputs a 384-dimensional latent vector which encodes main characteristics of the input geometry. The Leaky ReLU activation is used in each convolution layer. Since both global geometry and local geometry are voxelized with a fixed spatial resolution of 32 × 32 × 32, the grid of local geometry possesses more details and hence provides more visual features that are closely related to the generated sound.

Fig. 4.38 Overall framework of GLAVNet [120] which consists of a global geometry subnetwork, a local geometry subnetwork, an audio subnetwork and a fusion subnetwork. GLAVNet can predict the fine-grained material categories based on the global geometry, the local geometry and the sound

120

4 Fine-Grained Image Recognition

The audio subnetwork first performs Melscaled Short-Time Fourier Transform (STFT) to the audio clip and then computes the squared magnitude of the STFT coefficients to generate the spectrogram of an audio clip. Once the spectrogram is generated, the audio subnetwork uses a shallow subnetwork with a 2D convolution layer (kernel size: 5 × 5) and a 2D pooling layer (kernel size: 2 × 2) as well as a fully-connected layer, to convert each spectrogram into a 384-dimensional latent vector. After obtaining three 384-dimensional vectors from global geometry, local geometry, and spectrogram, respectively, GLAVNet proposes a multimodal fusion subnetwork which contains two Multi-modal Factorized Bilinear pooling (MFB) modules [170] and a concatenation module. Each MFB module accepts visual and auditory features, and expands and fuses them in the high-dimensional space. Then, the fused feature is squeezed to produce a compact output. Two fused features are then concatenated and fed to an N -way classifier consisting of two fully-connected layers with the cross-entropy loss.

4.4.2.3 SSW60 Sapsucker Woods 60 (SSW60) [137] is a recent benchmark dataset for advancing research on audiovisual fine-grained categorization. The dataset covers 60 species of birds and is comprised of images from existing datasets, and brand-new, expert-curated audio and video datasets. Unlike those existing bird video datasets which focused primarily on benchmarking the performance of video frame classification, SSW60 contains a new collection of expertly curated ten-second video clips for each species, totaling 5,400 video clips. SSW60 also contains an “unpaired” expert-curated set of ten-second audio recordings for the same set of species, totaling 3,861 audio recordings. The videos in SSW60 come from recordings archived at the Macaulay Library at the Cornell Lab of Ornithology. Each video is a ten-second clip from the original video and all videos are converted to a frame rate of 25FPS. All 60 bird species in SSW60 have unpaired audio recordings from the Macaulay Library. These audios are trimmed to ten-second clips which are stored in the WAV format at a sampling rate of 22.05kHz. Images in SSW60 come from two existing datasets: NABirds [134] and iNaturalist 2021 [135]. Cross-modal experiments on the video and audio modalities and multi-modal fusion experiments for audiovisual categorization are performed on SSW60. For cross-modal experiments, different fixed backbone architectures are utilized for processing different modality data. The waveforms of audio are converted to spectrogram images and videos are proceeded by TSN [142], using 2D image backbones to encode features and perform fusion on top of them. The experimental procedure is straightforward: the backbone model is trained by a particular training modality and then evaluates the performance on an evaluation modality directly. The backbone model is also trained by first fine-tuning the weights using the training split of the evaluation modality, and then evaluate on that evaluation modality. The results on videos show that fine-tuning model weights on target datasets is always beneficial for improving classification accuracy and the performance on SSW60 videos

4.4

Recognition with External Information

121

benefits more from “lower quality” training images. For audios, the classification accuracy raises with the improvement of model presentation ability, but sometimes pretrained on higher resolution images might have a bad influence on the model. Besides, a recurring theme is that video is biased toward visual features and while it contains an audio channel, the ability to use the audio channel for classification appears to be difficult. For multi-modal fusion experiments, the transformer architecture [86] is applied for multi-modal fusion and different fusion strategies are conducted. As a result, there are three conclusions from audiovisual fusion investigations [137]. First, the best result from audiovisual fusion is always better than training on each modality separately. Second, there is no “best” fusion method. Sometimes late fusion works best, sometimes score fusion works better. Third, pretraining on external datasets can be very beneficial for the model.

4.4.3

Humans-in-the-Loop

Human-in-the-loop methods [171] combine the complementary strengths of human knowledge with computer vision algorithms. Fine-grained recognition with humans-in-the-loop is typically posed in an iterative fashion and requires the vision system to be intelligent about when it queries the human for assistance. Generally, for these kinds of recognition methods, in each round the system is seeking to understand how humans perform recognition, e.g., by asking expert humans to label the image class [21], or by identifying key part localization and selecting discriminative features [25] for fine-grained recognition. In the following, we present two representative works.

4.4.3.1 Deep Metric Learning with Humans in the Loop Considering the challenges of lack of training data, a large number of fine-grained categories, and high intra-class versus low inter-class variance in existing FGVC methods, [21] was proposed to alleviate them at the data and model level. Specifically, this method learns a lowdimensional feature embedding with anchor points on manifolds for each category by deep metric learning. These anchor points capture intra-class variances and remain discriminative between classes. In each round, images with high confidence scores from the model are sent to humans for labeling. By comparing with exemplar images, labelers mark each candidate image as either a “true positive” or a “false positive”. True positives are added to the current dataset and false positives are regarded as “hard negatives” for the metric learning model. Then, the model is retrained with an expanded dataset and hard negatives for the next round. The pipeline of the method is illustrated in Fig. 4.39. In concretely, for dataset bootstrapping, given an initial fine-grained dataset S0 of N categories and a candidate set C, the goal of dataset bootstrapping is to select a subset S of the images from C that match with the original N categories. The candidate set is divided into a list of k subsets: C = C1 ∪ C2 ∪ · · · ∪ Ck and an iterative approach are used

122

4 Fine-Grained Image Recognition

Dataset Expansion

lotus flower

Candidateflower images tulip

Fine-grained flower classifier trained with deep metric learning input

u13

… True positives

lotusflower

False positives

“hard negatives” for lotusflower

u23 u11

u22

u12



u21 “lotus flower” manifold

“tulip” manifold



tulip



“hard negatives” for tulip



Confidence > 50%



exemplar lotusflowers and description lotus flowers? yes!

“lotus flower” candidates

“tulip” candidates



Human labeling

Fig. 4.39 Overall framework of the method of [21]. Using deep metric learning with humans in the loop, a low dimensional feature embedding is learned for each category that can be used for fine-grained visual categorization and iterative dataset bootstrapping

for dataset bootstrapping with k iterations in total. Each iteration consists of three steps. Consider the i-th iteration. A CNN-based classifier is firstly trained using the seed dataset Si−1 ∪ Hi−1 , where Hi−1 contains the hard negatives from the previous step. Secondly, using this classifier, each candidate image I ∈ Ci is assigned to one of the N categories. Images with a confidence score larger than 0.5 form a high-quality candidate set Di ⊂ Ci for the original N categories. Thirdly, human labelers with domain expertise are asked to identify true positives Ti and false positives Fi , where Ti ∪ Fi = Di . Exemplar images and category definitions are shown to the labelers. Note that these false positives Fi are very similar to ground truths, and are regarded as hard negative Hi ← Hi−1 ∪ Fi . True positives were also included to expand the dataset: Si ← Si−1 ∪ Ti for the next iteration. As for deep metric learning, the goal is to learn a non-linear low-dimensional feature embedding f (·) via CNN, such that given two images I j and Ik , the Euclidean distance  between f I j and f (Ik ) can reflect their semantic dissimilarity (whether they come from the same category or not). Triplet-based metric learning is very suitable for this [55]. In each  iteration, the input triplet I, I p , In is sampled from the training set, where image I is more similar to I p relative to In . Then, the triplet of three images is fed into an identical

4.4

Recognition with External Information

123

 CNN simultaneously to get their non-linear feature embeddings f (I), f I p and f (In ). 2 -normalization is used to normalize all the features to eliminate the scale differences: f (I) . Triplet loss which is the same as [141] is used as f (I) ←  f (I) f (I)

( & )  &2  L I, I p , In = max 0, & f (I) − f I p &2 −  f (I) − f (In )22 + m ,

(4.172)

where m is a hyper-parameter that controls the distance margin after the embedding. This &  &2 & hinge loss function will produce a non-zero penalty of f (I) − f I p &2 −  f (I) − f (In )22 + m if the 2 distance between I and In is smaller than the 2 distance between I and I p adding a margin m in feature space. When sampling triplets, an online hard negatives mining scheme is utilized: only train on those triplets that violate the triplet constraint and give non-zero loss will be included in the training [119]. When learning manifolds,      suppose there is a training set with n images I j with labels C I j from K categories,  where j = 1, 2, . . . , n and C I j ∈ {1, 2, . . . , K }. In this setting, considering a reference image I within a fine-grained category, suppose the maximum between-class distance for I in feature space is bounded by D. That is,  f (I) − f (In )2 ≤ D, ∀C (In )  = C (I). In order to have for the reference image I, the following condition must & 0 triplet loss &2  be satisfied: & f (I) − f I p &2 ≤  f (I) − f (In )22 − m, ∀C I p = C (I) , C (In )  =  C (I). Therefore, ∀I j , Ik where C I j = C (Ik ) = C (I): &  & &  &  & f I j − f (Ik )&2 ≤ & f (I) − f I j &2 +  f (I) − f (Ik )2 ≤ 2 D 2 − m , 2 2 2 (4.173)  where the squared within-class pairwise distance is bounded by 2 D 2 − m . Thus, by using triplet loss with positives sampled from all images in √ the same class, all images within that 2( D 2 −m ) . As D 2 − m could be very class are mapped into a hypersphere with radius r = 2 close to or even less than 0 resulting from small between-class distances, positives are forced to be close to the reference locally, in which case the model can learn an extended manifold rather than a contracted sphere. After the manifold learning step, a soft voting scheme is adopted using anchor points on manifolds for classification. For each category, the anchor points are generated by kmeans clustering on the training set in the feature space. Suppose there are N categories and each category has K anchor points. The j-th anchor point for category i is represented as u i j , where i = 1, 2, . . . , N , j = 1, 2, . . . , K . Given an input query image I, its feature embedding f (I) is extracted from the network first, then the confidence score for category i is generated as K −γ  f (I)−u i j 2 2 j=1 e . (4.174) pi =  2 N k −γ  f (I)−u i j 2 e l=1 j=1 The predicted label of I is the category with the highest confidence score: arg max pi . γ is a parameter controlling the “softness” of label assignment and closer anchor points play more significant roles in soft voting. If γ → ∞, only the nearest anchor point is considered

124

4 Fine-Grained Image Recognition

and the predicted label is “hard” assigned to be the same as the nearest anchor point. On the other hand, if γ → 0, all the anchor points are considered to have the same contribution regardless of their distances between f (I). However, using k-means to generate anchor points for representing manifolds and prediction after metric learning could lead to suboptimal performance. Therefore, the method goes one step further and learns anchor points by including soft voting in the triplet-based metric learning model. The category label C (I) for the reference image I is also leveraged to learn anchor points for classification. Confidence scores pi for f (I) can be generated using anchor points u i j by soft voting in Eq. (4.174). The classification loss we used is logistic loss on top of the confidence score as     Lclassification I, u i j , C (I) = − log pC(I) ,

(4.175)

where pC(I) is given in Eq. (4.174) by substituting i with C (I). If the confidence score on the true category is very high, pC(I) → 1, then the loss will be very small: Lclassfication → 0. The overall loss is the weighted sum of triplet and classification loss as L = ωLtriplet + (1 − ω) Lclassification .

(4.176)

4.4.3.2 Leveraging the Wisdom of the Crowd In view of the necessity of a stronger prior for feature selection of fine-grained images, Deng et al. [25] proposed to include humans in the loop to help computers select discriminative features. A novel online game called “Bubbles” is also introduced to reveal discriminative features that humans use. The player’s goal is to identify the category of a heavily blurred image. During the game, the player can choose to reveal full details of circular regions (“bubbles”), with a certain penalty. With proper setup, the game generates discriminative bubbles with assured quality. Next, the “BubbleBank” representation is proposed to use the human-selected bubbles to improve machine recognition performance. Finally, the BubbleBank method is extended to a view-invariant 3D representation. The pipeline of the method is illustrated in Fig. 4.40. From where the “bubble game” stands, the goal is to correctly classify the center image into one of the two categories. A green “bubble” (size adjustable) follows the mouse cursor as the player hovers over the center image. When the player clicks, the area under the circle is revealed in full detail. If the player answers correctly, she earns new points. Otherwise, she loses points. Either way, the game then advances to the next round, with a new center image and possibly a new pair of categories. Note that all images are assumed to have ground truth class labels so that the player’s answers can be judged instantly. The reward of the game is designed such that a player can only earn high scores if she identifies the categories correctly and uses bubbles parsimoniously. First, the penalty for wrong answers is set to be very large. This renders random guessing an ineffective strategy. Also, the player is allowed to pass difficult images or categories with no penalty, such that

4.4

Recognition with External Information

125

Fig. 4.40 Schematic illustration of the method proposed by Deng et al. [25]. The crowd first plays the “Bubbles” game, trying to classify a blurred image into one of the two given categories. During the game, the crowd is allowed to inspect circular regions (“bubbles”), with a penalty of game points. In this process, discriminative regions are revealed. Next, when a computer tries to recognize fine-grained categories, it collects the human-selected bubbles and detects similar patterns on an image. The detection responses are max-pooled to form a “BubbleBank” representation that can be used for learning classifiers

they are not forced to guess. Second, there is a cost associated with the total area revealed. The points for correct identification will decrease as more area is revealed. This thus encourages careful bubble use. Another issue of game design is determining the amount of blurring for the center image. The game starts with a small amount of blurring and increases it gradually in new rounds until the use of bubbles becomes necessary. Note that this in fact can give potentially useful side information about the scale of the discriminative features. As for the deployment of the game, paid crowdsourcing platforms such as Amazon Mechanical Turk (AMT) are very suitable for this. Each AMT task would consist of multiple rounds of games. The worker must score enough points in order to submit the task, otherwise, the games will continue indefinitely. The threshold for submission is set high enough such that random guessing is infeasible. This ensures that only the good workers would be able to submit. Notably, there is no need to make approval/reject decisions, as is necessary for conventional tasks. All submissions are guaranteed to be high quality and can be automatically approved. Having designed and deployed the game, the next problem is how to use the humanselected bubbles to improve recognition. The basic idea is to generate a detector for each bubble and represent each image as a collection of responses from the bubble detectors.

126

4 Fine-Grained Image Recognition

Assuming there are two categories, the intuition is that since each bubble contains discriminative features for recognition, it suffices to detect such patterns in a test image. It is thus natural to obtain a detector for each bubble. Each bubble detector can be represented by a single descriptor such as SIFT [133], or a concatenation of simple descriptors. This descriptor acts as an image filter to detect a test image and is convolved with densely sampled patches, after which the maximum response is taken (max-pooling). To further exploit the cues provided by the bubbles, instead of convolving with the entire image, a pooling region is specified for each detector. That is, each detector operates on a fixed, rectangular region whose center is determined by the relative location of the bubble in the original image. In other words, there is a strong spatial prior about where to detect bubbles form localized objects. Then, a bank of bubble detectors can be formed (“BubbleBank”) and the image can be represented by a vector of the max-pooled responses from each detector, similar to the ObjectBank [80] representation. Then, a binary classifier can be learned on top of this representation. As all classes are equally similar to others, it is likely that a bubble useful for differentiating a class from another very confusing class is also helpful for discriminating the same class against less similar ones. Therefore, it is unnecessary to obtain bubbles for every pair of categories. To extend the method to multiple classes, a baseline classifier is trained first to find out the confusing pairs via cross-validation. Alternatively, if a semantic hierarchy is an available and visual similarity between classes is known to align well with the semantic hierarchy, pairs of categories can be directly selected from within small subtrees. Due to the fact that the same object part may be seen in radically different positions of the image, it makes sense to represent bubble coordinates in a fully 3D, object-centered space to establish more accurate correspondences. The general approach for doing so consists of three steps: 1) estimating 3D geometry, 2) using the geometry to calculate a 3D appearance which is invariant with respect to the viewpoint, and 3) extending the bubble detectors themselves to work in 3D space. For estimating 3D geometry, the proposed method first identifies one (or multiple) CAD model(s) that best fits the image. The matching between a 3D CAD model and a 2D image is implemented by a set of classifiers that have been trained to distinguish between CAD models and viewpoints. In order to match 3D CAD models to 2D images, a massive bank of classifiers is trained for nearly the entire cross-product of CAD models and viewpoints. All classifiers are based on HOG [23] features in connection with a one-versus-all linear SVM. Rather than commit to a single viewpoint, a list of the top N estimates is maintained, and features are max-pooled across all of them for a more accurate estimation of 3D geometry. The goal of 3D appearance representation is to ensure that a discriminative local feature is represented only once, as opposed to requiring multiple representations from different viewpoints. This is achieved by transforming local image patches into a unified frame of reference prior to feature computation. The basis of 3D appearance representation is given by a dense sampling of image patches that are extracted from an image. Patches are sampled directly from the 3D surface of the object of interest relative to its estimated 3D geometry. In particular, thousands of uniformly spaced patch locations on the surface of CAD models are

4.5

Summary

127

precomputed by dart throwing [18]. Each patch location comes with an associated surface normal and upward direction, determining its 3D orientation, and a flat, planar rectangle, determining its support region. For feature extraction, all patches visible from the estimated viewpoint are projected into the image, resulting in a set of perspectively distorted quadrilaterals. Prior to feature computation, the projected quadrilaterals are rectified to a common reference rectangle, effectively compensating for perspective projection. To convert the crowd-selected bubbles into 3D space, the 2D clicks of the users are mapped to 3D coordinates using the previously-described densely sampled patches. Taking advantage of the location in both 2D and 3D for each patch, the patch whose 2D location is closest to the location of the user’s click can be determined, and then the patch’s 3D coordinates can be used for the crowdsourced bubble, giving the user’s click a location in 3D space. As a result, rather than pooling in a 2D region centered on the position of the bubble in the image it was extracted from, it is done in 3D space, using the estimated 3D coordinates of each patch. For each bubble, all rectified patches whose 3D coordinates fall within some neighborhood of the bubble’s 3D location are pooled.

4.5

Summary

The CUB200-2011 [140], Stanford Dogs [67], Stanford Cars [73], and FGVC Aircraft [100] benchmarks are among the most influential datasets in fine-grained recognition. Tables 4.1 and 4.2 summarize results achieved by the fine-grained methods belonging to three recognition learning paradigms outlined above, i.e., “recognition by localization-classification subnetworks”, “recognition by end-to-end feature encoding”, and “recognition with external information”. A chronological overview can be seen in Fig. 4.1. The main observations can be summarized as follows: • There is an explicit correspondence between the reviewed methods and the aforementioned challenges of fine-grained recognition. Specifically, the challenge of capturing subtle visual differences can be overcome by localization-classification methods (cf. Sect. 4.2) or via specific construction-based tasks [13, 28, 160], as well as human-in-theloop methods. The challenge of characterizing fine-grained tailored features is alleviated by performing high-order feature interactions or by leveraging multi-modality data. Finally, the challenging nature of FGIA can be somewhat addressed by designing specific loss functions [29, 30, 127] for achieving better accuracy. • Among the different learning paradigms, the “recognition by localization-classification subnetworks” and “recognition by end-to-end feature encoding” paradigms are the two most frequently investigated ones. • Part-level reasoning of fine-grained object categories boosts fine-grained recognition accuracy especially for non-rigid objects, e.g., birds. Modeling the internal semantic

ICCV 2015 ECCV 2016 CVPR 2018 ECCV 2018 IEEE TPAMI 2018 IEEE TIP 2018 AAAI 2019 AAAI 2020

Xu et al. [164]

Krause et al. [72]

Niu et al. [106]

MetaFGNet [180]

Xu et al. [165]

Yang et al. [167]

Sun et al. [129]

Zhang et al. [174]

CVPR 2015

Fine-grained recognition with external information

With HAR-CNN [159] web/auxiliary data

Published in

Methods

BBox+Parts

BBox+Parts

BBox

Train anno. BBox

Test anno.

Web data

Web data

Web data

Web data

Auxiliary data

Web data

Web data

Web data

Web data

External info.

VGG-16

ResNet-50

ResNet-50

Alex-Net

ResNet-34

VGG-16

Inception-v3

CaffeNet

Alex-Net

Backbones –

84.6% 92.3% 76.5% 87.6% 84.6%

– – 77.2%

224 × 224

224 × 224 224 × 224 224 × 224 224 × 224 224 × 224

224 × 224 224 × 224 224 × 224



87.1%

87.4%



96.7%

85.2%

80.8%



49.4%

Accuracy Birds Dogs

Img. resolution

78.7%















80.8%

Cars

(continued)

72.9%











93.4%





Aircrafts

Table 4.2 Comparison of fine-grained “recognition with external information” (cf. Sect. 4.4) on multiple fine-grained benchmark datasets, including Birds (CUB200-2011 [140]), Dogs (Stanford Dogs [67]), Cars (Stanford Cars [73]), and Aircrafts (FGVC Aircraft [100]). “External info.” denotes which kind of external information is used by the respective approach. “Train anno.” and “Test anno.” indicate the supervision used during training and testing, and “–” means the results are unavailable

128 4 Fine-Grained Image Recognition

Methods

With multi-modal data

Table 4.2 (continued)

AAAI 2018 IJCAI 2018 IJCAI 2018 IEEE TIP 2020 IEEE TIP 2020

T-CNN [162]

KERL [12]

PMA [125]

PMA [125]

CVPR 2017 AAAI 2018

Zhang et al. [175]

Zhang et al. [175]

CVL [46]

Published in

BBox

BBox

BBox

Train anno.

BBox

Test anno.

Language texts

Language texts

Knowledge base

Knowledge base + Texts

Audio

Language texts Audio

External info.

ResNet-50

VGG-16

VGG-16

ResNet-50

VGG-16

VGG-16

VGG-16

Backbones

85.6% 85.6% 86.6% 86.5% 87.0% 88.2% 88.7%

227 × 227 227 × 227 224 × 224 224 × 224 448 × 448 448 × 448

Birds

Accuracy

Not given

Img. resolution















Dogs















Cars















Aircrafts

4.5 Summary 129

4 Fine-Grained Image Recognition

130

Table 4.3 Comparative fine-grained recognition results on CUB200-2011 using different input image resolutions. The results in this table are conducted based on a vanilla ResNet-50 trained at the respective resolution Resolution

224 × 224

280 × 280

336 × 336

392 × 392

Accuracy

81.6%

83.3%

85.0%

85.6%











interactions/correlations among discriminative parts has attracted increased attention in recent years, cf. [27, 35, 36, 74, 92, 128, 149, 150]. Non-rigid fine-grained object recognition (e.g., birds or dogs) is more challenging than rigid fine-grained objects (e.g., cars or aircrafts), which is partly due to the variation on object appearance. Fine-grained image recognition performance improves as image resolution increases [20]. Comparison results on CUB200-2011 of different image resolutions are reported in Table 4.3. There is a trade-off between recognition and localization ability for the “recognition by localization-classification subnetworks” paradigm, which might impact a single integrated network’s recognition accuracy. Such a trade-off is also reflected in practice when trying to achieve better recognition results, in that training usually involves alternating optimization of the two networks or separately training the two followed by joint tuning. Alternating or multistage strategies complicate the tuning of the integrated network. While effective, most end-to-end encoding networks are less human-interpretable and less consistent in their accuracy across non-rigid and rigid visual domains compared to localization-classification subnetworks. Recently, it has been observed that several higher-order pooling methods attempt to understand such kind of methods by presenting visual interpretation [91] or from an optimization perspective [146]. “Recognition by localization-classification subnetworks” based methods are challenging to apply when the fine-grained parts are not consistent across the meta-categories (e.g., iNaturalist [136]). Here, unified end-to-end feature encoding methods are more appropriate.

References 1. International Conference on Computer Vision 2019 Workshop on Computer Vision for Wildlife Conservation (2019). https://openaccess.thecvf.com/ICCV2019_workshops/ ICCV2019_CVWC 2. Second- and higher-order representations in computer vision. http://users.cecs.anu.edu.au/ ~koniusz/secordcv-iccv19/ 3. Andriluka M, Roth S, Schiele B (2009) Pictorial structures revisited: people detection and articulated pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1143–1151

References

131

4. Bishop CM (2006) Pattern recognition and machine learning, vol 1. Springer 5. Bossard L, Guillaumin M, Gool LV (2014) Food-101—mining discriminative components with random forests. In: Proceedings of the European conference on computer vision, pp 446–461 6. Branson S, Van Horn G, Belongie S, Perona P (2014) Bird species categorization using pose normalized deep convolutional nets. In: Proceedings of the british machine vision conference, pp 1–14 7. Bromley J, Guyon I, LeCun Y, Säckinger E, Shah R (1993) Signature verification using a “siamese” time delay neural network. In: Advances in neural information processing systems, pp 737–744 8. Cai S, Zuo W, Zhang L (2017) Higher-order integration of hierarchical convolutional activations for fine-grained visual categorization. In: Proceedings of the IEEE international conference on computer vision, pp 511–520 9. Cao Y, Xu J, Lin S, Wei F, Hu H (2019) GCNet: non-local networks meet squeeze-excitation networks and beyond. In: Proceedings of the IEEE international conference on computer vision workshops, pp 1–10 10. Chang D, Ding Y, Xie J, Bhunia AK, Li X, Ma Z, Wu M, Guo J, Song YZ (2020) The devil is in the channels: mutual-channel loss for fine-grained image classification. IEEE Trans Image Process 29:4683–4695 11. Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2017) DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848 12. Chen T, Lin L, Chen R, Wu Y, Luo X (2018) Knowledge-embedded representation learning for fine-grained image recognition. In: Proceedings of the international joint conferences on artificial intelligence, pp 627–634 13. Chen Y, Bai Y, Zhang W, Mei T (2019) Destruction and construction learning for fine-grained image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5157–5166 14. Chen ZM, Wei XS, Wang P, Guo Y (2019) Multi-label image recognition with graph convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5177–5186 15. Chu G, Potetz B, Wang W, Howard A, Song Y, Brucher F, Leung T, Adam H (2019) Geo-aware networks for fine-grained recognition. In: Proceedings of the IEEE international conference on computer vision Workshops 16. Cimpoi M, Maji S, Vedaldi A (2015) Deep filter banks for texture recognition and segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3828– 3836 17. Clevert DA, Unterthiner T, Hochreiter S (2015) Fast and accurate deep network learning by exponential linear units (ELUs). arXiv:1511.07289 18. Cline D, Jeschke S, White K, Razdan A, Wonka P (2009) Dart throwing on surfaces. In: Proceedings of the Eurographics conference on rendering, pp 1217–1226 19. Corbetta M, Shulman GL (2002) Control of goal-directed and stimulus-driven attention in the brain. Nat Revi Neurosci 3:201–215 20. Cui Y, Song Y, Sun C, Howard A, Belongie S (2018) Large scale fine-grained categorization and domain-specific transfer learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4109–4118 21. Cui Y, Zhou F, Lin Y, Belongie S (2016) Fine-grained categorization and dataset bootstrapping using deep metric learning with humans in the loop. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1153–1162

132

4 Fine-Grained Image Recognition

22. Cui Y, Zhou F, Wang J, Liu X, Lin Y, Belongie S (2017) Kernel pooling for convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2921–2930 23. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 886–893 24. Deng J, Ding N, Jia Y, Frome A, Murphy K, Bengio S, Li Y, Neven H, Adam H (2014) Large-scale object classification using label relation graphs. In: Proceedings of the European conference on computer vision, pp 48–64 25. Deng J, Krause J, Stark M, Fei-Fei L (2016) Leveraging the wisdom of the crowd for fine-grained recognition. IEEE Trans Pattern Anal Mach Intell 38(4):666–676 26. Dietterich TG, Bakiri G (1994) Solving multiclass learning problems via error-correcting output codes. J Arti Intell Res 2:263–286 27. Ding Y, Zhou Y, Zhu Y, Ye Q, Jiao J (2019) Selective sparse sampling for fine-grained image recognition. In: Proceedings of the IEEE international conference on computer vision, pp 6599– 6608 28. Du R, Chang D, Bhunia AK, Xie J, Song YZ, Ma Z, Guo J (2020) Fine-grained visual classification via progressive multi-granularity training of Jigsaw patches. In: Proceedings of the European conference on computer vision, pp 153–168 (2020) 29. Dubey A, Gupta O, Guo P, Raskar R, Farrell R, Naik N (2018) Pairwise confusion for finegrained visual classification. In: Proceedings of the European conference on computer vision, pp 71–88 30. Dubey A, Gupta O, Raskar R, Naik N (2018) Maximum entropy fine-grained classification. In: Advances in neural information processing systems, pp 637–647 31. Engin M, Wang L, Zhou L, Liu X (2018) DeepKSPD: Learning kernel-matrix-based SPD representation for fine-grained image recognition. In: Proceedings of the European conference on computer vision, pp 629–645 32. Fu J, Zheng H, Mei T (2017) Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4438–4446 33. Gao BB, Wei XS, Wu J, Lin W (2015) Deep spatial pyramid: The devil is once again in the details. arXiv:1504.05277 34. Gao Y, Beijbom O, Zhang N, Darrell T (2016) Compact bilinear pooling. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 317–326 35. Gao Y, Han X, Wang X, Huang W, Scott MR (2020) Channel interaction networks for finegrained image categorization. In: Proceedings of the conference on AAAI, pp 10818–10825 36. Ge W, Lin X, Yu Y (2019) Weakly supervised complementary parts models for fine-grained image classification from the bottom up. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 3034–3043 37. Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587 38. Girshick R, Donahue J, Darrell T, Malik J (2016) Region-based convolutional networks for accurate object detection and segmentation (1):142–158 39. Golub GH, Hansen PC, O’Leary DP (1999) Tikhonov regularization and total least squares. SIAM J Matrix Anal Appl 21(1):185–194 40. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Advances in neural information processing systems, pp 2672–2680

References

133

41. Gosselin PH, Murray N, Jégou H, Perronnin F (2015) Revisiting the fisher vector for fine-grained classification. Pattern Recogn Lett 49:92–98 42. Guillaumin M, Küttel D, Ferrari V (2014) ImageNet auto-annotation with segmentation propagation. Int. J Comput Vis 110:328–348 43. Han B, Yao Q, Yu X, Niu G, Xu M, Hu W, Tsang I, Sugiyama M (2018) Co-teaching: Robust training of deep neural networks with extremely noisy labels. In: Advances in neural information processing systems, pp 8527–8537 44. Hariharan B, Malik J, Ramanan D (2012) Discriminative decorrelation for clustering and classification. In: Proceedings of the European conference on computer vision, pp 459–472 45. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 46. He X, Peng Y (2017) Fine-grained image classification via combining vision and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5994–6002 47. He X, Peng Y (2017) Weakly supervised learning of part selection model with spatial constraints for fine-grained image classification. In: Proceedings of the conference on AAAI, pp 4075–4081 48. He X, Peng Y, Zhao J (2018) StackDRL: stacked deep reinforcement learning for fine-grained visual categorization. In: Proceedings of the international joint conference on artificial intelligence, pp 741–747 49. He X, Peng Y, Zhao J (2019) Fast fine-grained image classification via weakly supervised discriminative localization. IEEE Trans Circuits Syst Video Technol 29(5):1394–1407 50. He X, Peng Y, Zhao J (2019) Which and how many regions to gaze: focus discriminative regions for fine-grained visual categorization. Int J Comput Vis 127:1235–1255 51. Hermans A, Beyer L, Leibe B (2017) In defense of the triplet loss for person re-identification. arXiv:1703.07737 52. Higham NJ (2008) Functions of matrices: theory and computation. SIAM 53. Hinton G, Vinyals O, Dean J et al (2015) Distilling the knowledge in a neural network. arXiv:1503.02531 54. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780 55. Hoffer E, Ailon N (2015) Deep metric learning using triplet network. In: Int. Workshop on similarity-based pattern recognition, pp 84–92 56. Hospedales T, Antoniou A, Micaelli P, Storkey A (2020) Meta-learning in neural networks: a survey. arXiv:2004.05439 57. Hou S, Feng Y, Wang Z (2017) VegFru: a domain-specific dataset for fine-grained visual categorization. In: Proceedings of the IEEE international conference on computer vision, pp 541–549 58. Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141 59. Huang S, Xu Z, Tao D, Zhang Y (2016) Part-stacked CNN for fine-grained visual categorization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1173– 1182 60. Huang Z, Li Y (2020) Interpretable and accurate fine-grained recognition via region grouping. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8662– 8672 61. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of the international conference on machine learning, pp 448–456 62. Itti L, Koch C, Niebur E (1998) A model of saliency-based visual attention for rapid scene analysis. IEEE Trans Pattern Anal Mach Intell 20(11):1254–1259 63. Jaderberg M, Simonyan K, Zisserman A, Kavukcuoglu K (2015) Spatial transformer networks. In: Advances in neural information processing systems, pp 2017–2025

134

4 Fine-Grained Image Recognition

64. Ji R, Wen L, Zhang L, Du D, Wu Y, Zhao C, Liu X, Huang F (2020) Attention convolutional binary neural tree for fine-grained visual categorization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 10468–10477 (2020) 65. Jonsson D (1982) Some limit theorems for the eigenvalues of a sample covariance matrix. J Multivar Anal 12(1):1–38 66. Kaelbling LP, Littman ML, Moore AW (1996) Reinforcement learning: a survey. J Artif Intell Res 4:237–285 67. Khosla A, Jayadevaprakash N, Yao B, Fei-Fei L (2011) Novel dataset for fine-grained image categorization. In: Proceedings of the IEEE conference on computer vision and pattern recognition. Workshop on Fine-Grained Visual Categorization, pp 806–813 68. Kong S, Fowlkes, C (2017) Low-rank bilinear pooling for fine-grained classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 365–374 (2017) 69. Koniusz P, Yan F, Gosselin PH, Mikolajczyk K (2017) Higher-order occurrence pooling for bags-of-words: visual concept detection 39(2):313–326 70. Koniusz P, Zhang H (2022) Power normalizations in fine-grained image, few-shot image and graph classification (2):591–609 71. Krause J, Jin H, Yang J, Fei-Fei L (2015) Fine-grained recognition without part annotations. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5546–5555 72. Krause J, Sapp B, Howard A, Zhou H, Toshev A, Duerig T, Philbin J, Fei-Fei L (2016) The unreasonable effectiveness of noisy data for fine-grained recognition. In: Proceedings of the European conference on computer vision, pp 301–320 73. Krause J, Stark M., Deng J, Fei-Fei L 3D object representations for fine-grained categorization. In: Proceedings of the IEEE international conference on computer vision Workshop on 3D Representation and Recognition (2013) 74. Lam M, Mahasseni B, Todorovic S (2017) Fine-grained recognition as HSnet search for informative image parts. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2520–2529 75. Lample G, Conneau A, Denoyer L, Ranzato M (2017) Unsupervised machine translation using monolingual corpora only. arXiv:1711.00043 76. Larochelle H, Hinton G (2010) Learning to combine foveal glimpses with a third-order boltzmann machine. In: Advances in neural information processing systems, pp 1243–1251 77. Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–8 78. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444 79. Lehmann J, Isele R, Jakob M, Jentzsch A, Kontokostas D, Mendes PN, Hellmann S, Morsey M, van Kleef P, Auer S, Bizer C (2015) DBpedia—a large-scale, multilingual knowledge base extracted from Wikipedia. Semant Web J 167–195 80. Li LJ, Su H, Fei-Fei L, Xing E (2010) Object bank: A high-level image representation for scene classification & semantic feature sparsification. Advances in neural information processing systems, vol 23 81. Li P, Xie J, Wang Q, Gao Z (2018) Towards faster training of global covariance pooling networks by iterative matrix square root normalization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 947–955 82. Li P, Xie J, Wang Q, Zuo W (2017) Is second-order information helpful for large-scale visual recognition? In: Proceedings of the IEEE international conference on computer vision, pp 2070–2078

References

135

83. Li Y, Wang N, Liu J, Hou X (2017) Factorized bilinear models for image recognition. In: Proceedings of the IEEE international conference on computer vision, pp 2079–2087 84. Lin D, Shen X, Lu C, Jia J (2015) Deep LAC: Deep localization, alignment and classification for fine-grained recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1666–1674 85. Lin M, Chen Q, Yan S (2013) Network in network. arXiv:1312.4400 86. Lin T, Wang Y, Liu X, Qiu X (2021) A survey of transformers. arXiv:2106.04554 87. Lin TY, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2117–2125 88. Lin TY, Maji S (2017) Improved bilinear pooling with CNNs. In: Proceedings of the british machine vision conference, pp 1–12 89. Lin TY, Maji S, Koniusz P (2018) Second-order democratic aggregation. In: Proceedings of the European conference on computer vision, pp 620–636 90. Lin TY, RoyChowdhury A, Maji S (2015) Bilinear CNN models for fine-grained visual recognition. In: Proceedings of the IEEE international conference on computer vision, pp 1449–1457 91. Lin TY, RoyChowdhury A, Maji S (2018) Bilinear convolutional neural networks for finegrained visual recognition. IEEE Trans Pattern Anal Mach Intell 40(6):1309–1322 92. Liu C, Xie H, Zha ZJ, Ma L, Yu L, Zhang Y (2020) Filtration and distillation: enhancing region attention for fine-grained visual categorization. In: Proceedings of the conference on AAAI, pp 11555–11562 93. Liu L, Shen C, van den Hengel A (2015) The treasure beneath convolutional layers: crossconvolutional-layer pooling for image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4749–4757 94. Liu X, Wang J, Wen S, Ding E, Lin Y (2017) Localizing by describing: attribute-guided attention localization for fine-grained recognition. In: Proceedings of the conference on AAAI, pp 4190– 4196 95. Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431– 3440 96. Lu J, Zhou Z, Leung T, Li L, Li F (2018) MentorNet: learning data-driven curriculum for very deep neural networks on corrupted labels. In: Proceedings of the international conference on machine learning, pp 2304–2313 97. Luo W, Yang X, Mo X, Lu Y, Davis LS, Li J, Yang J, Lim SN (2019) Cross-X learning for fine-grained visual categorization. In: Proceedings of the IEEE international conference on computer vision, pp 8242–8251 98. Lv J, Xu M, Feng L, Niu G, Geng X, Sugiyama M (2020) Progressive identification of true labels for partial-label learning. In: Proceedings of the international conference on machine learning, pp 6500–6510 99. Mac Aodha O, Cole E, Perona P (2019) Presence-only geographical priors for fine-grained image classification. In: Proceedings of the IEEE international conference on computer vision, pp 9596–9606 100. Maji S, Rahtu E, Kannala J, Blaschko M, Vedaldi A (2013) Fine-grained visual classification of aircraft. arXiv:1306.5151 101. Maturana D, Scherer S (2015) VoxNet: a 3D convolutional neural network for real-time object recognition. In: Proceedings of the international conference on intelligent robots and systems, pp 922–928 102. Min S, Yao H, Xie H, Zha ZJ, Zhang Y (2020) Multi-objective matrix normalization for finegrained visual recognition. IEEE Trans Image Process 29:4996–5009

136

4 Fine-Grained Image Recognition

103. Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533 104. Mu J, Bhat S, Viswanath P (2017) Representing sentences as low-rank subspaces. In: Proceedings of the conference on association for computational linguistics, pp 629–634 105. Nilsback ME, Zisserman A (2008) Automated flower classification over a large number of classes. In: The Indian conference on computer vision, graphics and image processing, pp 722–729 106. Niu L, Veeraraghavan A, Sabharwal A (2018) Webly supervised learning meets zero-shot learning: A hybrid approach for fine-grained classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7171–7180 107. Pearl J (2013) Direct and indirect effects. arXiv:1301.2300 108. Pearl J, Mackenzie D (2018) The book of why: the new science of cause and effect. Basic Books (2018) 109. Peng Y, He X, Zhao J (2018) Object-part attention model for fine-grained image classification. IEEE Trans Image Process 27(3):1487–1500 110. Perronnin F, Sánchez J, Mensink T (2010) Improving the Fisher kernel for large-scale image classification. In: Proceedings of the European conference on computer vision, pp 143–156 111. Pham N, Pagh R (2013) Fast and scalable polynomial kernels via explicit feature maps. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 239–247 112. Rao Y, Chen G, Lu J, Zhou J (2021) Counterfactual attention learning for fine-grained visual categorization and re-identification. In: Proceedings of the IEEE international conference on computer vision, pp 1025–1034 113. Recasens A, Kellnhofer P, Stent S, Matusik W, Torralba A (2018) Learning to zoom: a saliencybased sampling layer for neural networks. In: Proceedings of the European conference on computer vision, pp 51–66 114. Reed S, Akata Z, Lee H, Schiele B (2016) Learning deep representations of fine-grained visual descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 49–58 115. Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99 116. Rother C, Kolmogorov V, Blake A (2004) “GrabCut” interactive foreground extraction using iterated graph cuts. ACM Trans Graph 23(3):309–314 117. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) ImageNet large scale visual recognition challenge. Int J Comput Vis 115:211–252 118. Sánchez J, Perronnin F, Mensink T, Verbeek J (2013) Image classification with the fisher vector: theory and practice. Int J Comput Vis 105(3):222–245 119. Schroff F, Kalenichenko D, Philbin J (2015) FaceNet: a unified embedding for face recognition and clustering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 815–823 120. Shi F, Guo J, Zhang H, Yang S, Wang X, Guo Y (2021) GLAVNet: Global-local audio-visual cues for fine-grained material recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 14433–14442 121. Simon M, Rodner E (2015) Neural activation constellations: Unsupervised part model discovery with convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 1143–1151

References

137

122. Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: Proceedings of the international conference on learning representations, pp 1–14 123. Smola AJ, Kondor R (2003) Kernels and regularization on graphs. In: Proceedings of the learning theory and kernel machines, pp 144–158 124. Sohn K (2016) Improved deep metric learning with multi-class n-pair loss objective. In: Advances in neural information processing systems, vol 29 125. Song K, Wei XS, Shu X, Song RJ, Lu J (2020) Bi-modal progressive mask attention for finegrained recognition. IEEE Trans Image Process 29:7006–7018 126. Srivastava N, Salakhutdinov RR (2013) Discriminative transfer learning with tree-based priors. Adva Neural Inf Process Syst 26:2094–2102 127. Sun G, Cholakkal H, Khan S, Khan FS, Shao L (2020) Fine-grained recognition: accounting for subtle differences between similar classes. In: Proceedings of the conference on AAAI, pp 12047–12054 128. Sun M, Yuan Y, Zhou F, Ding E (2018) Multi-attention multi-class constraint for fine-grained image recognition. In: Proceedings of the European conference on computer vision, pp 834–850 129. Sun X, Chen L, Yang J (2019) Learning from web data using adversarial discriminative neural networks for fine-grained classification. In: Proceedings of the conference on AAAI, pp 273– 280 130. Sun Z, Yao Y, Wei X, Zhang Y, Shen F, Wu J, Zhang J, Shen H (2021) Webly supervised fine-grained recognition: Benchmark datasets and an approach. In: Proceedings of the IEEE international conference on computer vision, pp 10602–10611 131. Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. MIT Press, Cambridge, MA vol 22447 132. Tenenbaum JB, Freeman WT (2000) Separating style and content with bilinear models. Neural Comput 12(6):1247–1283 133. Van De Sande K, Gevers T, Snoek C (2009) Evaluating color descriptors for object and scene recognition. IEEE Trans Pattern Anal Mach Intell 32(9):1582–1596 134. Van Horn G, Branson S, Farrell R, Haber S, Barry J, Ipeirotis P, Perona P, Belongie S (2015) Building a bird recognition app and large scale dataset with citizen scientists: the fine print in fine-grained dataset collection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 595–604 135. Van Horn G, Cole E, Beery S, Wilber K, Belongie S, Mac Aodha O (2021) Benchmarking representation learning for natural world image collections. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 12884–12893 136. Van Horn G, Mac Aodha O, Song Y, Cui Y, Sun C, Shepard A, Adam H, Perona P, Belongie S (2017) The iNaturalist species classification and detection dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8769–8778 137. Van Horn G, Qian R, Wilber K, Adam H, Mac Aodha O, Belongie S (2022) Exploring finegrained audiovisual categorization with the ssw60 dataset. In: Proceedings of the European conference on computer vision, pp 271–289 138. VenderWeele T (2015) Explanation in causal inference: methods for mediation and interaction. Oxford University Press 139. Vicente S, Rother C, Kolmogorov V (2011) Object cosegmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2217–2224 140. Wah C, Branson S, Welinder P, Perona P, Belongie S (2011) The Caltech-UCSD birds-200-2011 dataset. Technical report CNS-TR-2011-001

138

4 Fine-Grained Image Recognition

141. Wang J, Song Y, Leung T, Rosenberg C, Wang J, Philbin J, Chen B, Wu Y (2014) Learning fine-grained image similarity with deep ranking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1386–1393 142. Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2018) Temporal segment networks for action recognition in videos. IEEE Trans Pattern Anal Mach Intell 41(11):2740– 2755 143. Wang,L, Zhang J, Zhou L, Tang C, Li W (2015) Beyond covariance: feature representation with nonlinear kernel matrices. In: Proceedings of the IEEE international conference on computer vision, pp 4570–4578 144. Wang Q, Li P, Zhang L (2017) G2 DeNet: global gaussian distribution embedding network and its application to visual recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2730–2739 145. Wang Q, Xie J, Zuo W, Zhang L, Li P (2021) Deep CNNs meet global covariance pooling: better representation and generalization. IEEE Trans Pattern Anal Mach Intell 43(8):2582–2597 146. Wang Q, Zhang L, Wu B, Ren D, Li P, Zuo W, Hu Q (2020) What deep CNNs benefit from global covariance pooling: An optimization perspective. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 10771–10780 147. Wang Y, Choi J, Morariu VI, Davis LS (2016) Mining discriminative triplets of patches for fine-grained classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1163–1172 148. Wang Y, Morariu VI, Davis LS (2018) Learning a discriminative filter bank within a CNN for fine-grained recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4148–4157 149. Wang Z, Wang S, Li H, Dou Z, Li J (2020) Graph-propagation based correlation learning for weakly supervised fine-grained image classification. In: Proceedings of the conference on AAAI, pp 12289–12296 150. Wang Z, Wang S, Yang S, Li H, Li J, Li Z (2020) Weakly supervised fine-grained image classification via guassian mixture model oriented discriminative learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 9749–9758 151. Wei X, Zhang Y, Gong Y, Zhang J, Zheng N (2018) Grassmann pooling as compact homogeneous bilinear pooling for fine-grained visual classification. In: Proceedings of the European conference on computer vision, pp 365–380 152. Wei XS, Cui Q, Yang L, Wang P, Liu L, Yang J (2022) RPC: a large-scale and fine-grained retail product checkout dataset. Sci China Inf Sci 65(9):197101 153. Wei XS, Wang P, Liu L, Shen C, Wu J (2019) Piecewise classifier mappings: learning fine-grained learners for novel categories with few examples. IEEE Trans Image Process 28(12):6116–6125 154. Wei XS, Xie CW, Wu J, Shen C (2018) Mask-CNN: localizing parts and selecting descriptors for fine-grained bird species categorization. Pattern Recogn 76:704–714 155. Wei XS, Zhang CL, Wu J, Shen C, Zhou ZH (2019) Unsupervised object discovery and colocalization by deep descriptor transformation. Pattern Recogn 88:113–126 156. Wei Y, Tran S, Xu S, Kang B, Springer M (2020) Deep learning for retail product recognition: challenges and techniques. Comput Intell Neurosci 128:1–23 157. Wold S, Esbensen K, Geladi P (1987) Principal component analysis. Chemometr Intell Lab Syst 2(1–3):37–52 158. Xiao T, Xu Y, Yang K, Zhang J, Peng Y, Zhang Z (2015) The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 842–850

References

139

159. Xie S, Yang T, Wang X, Lin Y (2015) Hyper-class augmented and regularized deep learning for fine-grained image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2645–2654 160. Xiong W, He Y, Zhang Y, Luo W, Ma L, Luo J (2020) Fine-grained image-to-image transformation towards visual recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5840–5849 161. Xu F, Wang M, Zhang W, Cheng Y, Chu W (2021) Discrimination-aware mechanism for finegrained representation learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 813–822 162. Xu H, Qi G, Li J, Wang M, Xu K, Gao H (2018) Fine-grained image classification by visualsemantic embedding. In: Proceedings of the international joint conference on artificial intelligence, pp 1043–1049 163. Xu Y, Shen Y, Wei X, Yang J (2022) Webly-supervised fine-grained recognition with partial label learning. In: Proceedings of the international joint conference on artificial intelligence, pp 1502–1508 164. Xu Z, Huang S, Zhang Y, Tao D (2015) Augmenting strong supervision using web data for fine-grained categorization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2524–2532 165. Xu Z, Huang S, Zhang Y, Tao D (2018) Webly-supervised fine-grained visual categorization via deep domain adaptation. IEEE Trans Pattern Anal Mach Intell 40(4):769–790 166. Xu Z, Yang Y, Hauptmann AG (2015) A discriminative CNN video representation for event detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1798–1807 167. Yang J, Sun X, Lai YK, Zheng L, Cheng MM (2018) Recognition from web data: a progressive filtering approach. IEEE Trans Image Process 27(11):5303–5315 168. Yang Z, Luo T, Wang D, Hu Z, Gao J, Wang L (2018) Learning to navigate for fine-grained classification. In: Proceedings of the European conference on computer vision, pp 438–454 169. Yu C, Zhao X, Zheng Q, Zhang P, You X (2018) Hierarchical bilinear pooling for fine-grained visual recognition. In: Proceedings of the European conference on computer vision, pp 595–610 170. Yu Z, Yu J, Fan J, Tao D (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 1821–1830 171. Zanzotto FM (2019) Viewpoint: human-in-the-loop artificial intelligence. J Artif Intell Res 64:243–252 172. Zeiler M, Fergus R (2014) Visualizing and understanding convolutional networks. In: Proceedings of the European conference on computer vision, pp 818–833 173. Zeiler MD, Taylor GW, Fergus R (2011) Adaptive deconvolutional networks for mid and high level feature learning. In: Proceedings of the IEEE international conference on computer vision, pp 2018–2025 174. Zhang C, Yao Y, Liu H, Xie GS, Shu X, Zhou T, Zhang Z, Shen F, Tang Z (2020) Web-supervised network with softly update-drop training for fine-grained visual classification. In: Proceedings of the conference on AAAI, pp 12781–12788 175. Zhang H, Cao X, Wang R (2018) Audio visual attribute discovery for fine-grained object recognition. In: Proceedings of the conference on AAAI, pp 7542–7549 176. Zhang H, Xu T, Elhoseiny M, Huang X, Zhang S, Elgammal A, Metaxas D (2016) SPDA-CNN: unifying semantic part detection and abstraction for fine-grained recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1143–1152

140

4 Fine-Grained Image Recognition

177. Zhang L, Huang S, Liu W, Tao D (2019) Learning a mixture of granularity-specific experts for fine-grained categorization. In: Proceedings of the IEEE international conference on computer vision, pp 8331–8340 178. Zhang N, Donahue J, Girshick R, Darrell T (2014) Part-based R-CNNs for fine-grained category detection. In: Proceedings of the European conference on computer vision, pp 834–849 179. Zhang X, Xiong H, Zhou W, Lin W, Tian Q (2016) Picking deep filter responses for finegrained image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1134–1142 180. Zhang Y, Tang H, Jia K (2018) Fine-grained visual categorization using meta-learning optimization with sample selection of auxiliary data. In: Proceedings of the European conference on computer vision, pp 241–256 181. Zhang Y, Wei XS, Wu J, Cai J, Lu J, Nguyen VA, Do MN (2016) Weakly supervised fine-grained categorization with part-based image representation. IEEE Trans Image Process 25(4):1713– 1725 182. Zheng H, Fu J, Mei T, Luo J (2017) Learning multi-attention convolutional neural network for fine-grained image recognition. In: Proceedings of the IEEE international conference on computer vision, pp 5209–5217 183. Zheng H, Fu J, Zha ZJ, Luo J (2019) Learning deep bilinear transformation for fine-grained image representation. In: Advances in neural information processing systems, pp 4277–4286 184. Zheng H, Fu J, Zha ZJ, Luo J (2019) Looking for the devil in the details: learning trilinear attention sampling network for fine-grained image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5012–5021 185. Zheng H, Fu J, Zha ZJ, Luo J, Mei T (2020) Learning rich part hierarchies with progressive attention networks for fine-grained image recognition. IEEE Trans Image Process 29:476–488 186. Zhong Z, Zheng L, Kang G, Li S, Yang Y (2017) Random erasing data augmentation. arXiv:1708.04896 187. Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A (2016) Learning deep features for discriminative localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2921–2929 188. Zhou F, Lin Y (2016) Fine-grained image classification by exploring bipartite-graph labels. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1124–1133 189. Zhou Y, Zhu Y, Ye Q, Qiu Q, Jiao J (2018) Weakly supervised instance segmentation using class peak response. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3791–3800 190. Zhuang B, Liu L, Li Y, Shen C, Reid I (2017) Attend in groups: a weakly-supervised deep learning framework for learning from web data. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1878–1887 191. Zhuang P, Wang Y, Qiao Y (2020) Learning attentive pairwise interaction for fine-grained classification. In: Proceedings of the conference on AAAI, pp 2457–2463

5

Fine-Grained Image Retrieval

In this chapter, we focus on modern fine-grained image retrieval approaches based on deep learning. Two main areas of fine-grained retrieval, i.e., content-based fine-grained retrieval and sketch-based fine-grained retrieval, are involved. As a summary, we discuss the common insights between fine-grained recognition and fine-grained retrieval.

5.1

Introduction

Fine-grained retrieval is another fundamental aspect of FGIA that has gained more traction in recent years. What distinguishes fine-grained retrieval from fine-grained recognition is that in addition to estimating the sub-category correctly, it is also necessary to rank all the instances so that images belonging to the same sub-category are ranked highest based on the fine-grained details in the query. Specifically, in fine-grained retrieval we are given a database of images of the same meta-category (e.g., birds or cars) and a query, and the goal is to return images related to the query based on relevant fine-grained features. Compared to generic image retrieval, which focuses on retrieving near-duplicate images based on similarities in their content (e.g., texture, color, and shapes), fine-grained retrieval focuses on retrieving the images of the same category type (e.g., the same subordinate species of animal or the same model of vehicle). What makes it more challenging is that objects of fine-grained categories have exhibit subtle differences, and can vary in pose, scale, and orientation or can contain large cross-modal differences (e.g., in the case of sketch-based retrieval). Fine-grained retrieval techniques have been widely used in commercial applications, e.g., e-commerce (searching fine-grained products [17]), touch-screen devices (searching fine-grained objects by sketches [34]), crime prevention (searching face photos [18]), among others. Depending on the type of query image, the most studied areas of finegrained image retrieval can be separated into two groups: fine-grained content-based image © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 X.-S. Wei, Fine-Grained Image Analysis: Modern Approaches, Synthesis Lectures on Computer Vision, https://doi.org/10.1007/978-3-031-31374-5_5

141

142

5 Fine-Grained Image Retrieval

Fig. 5.1 An illustration of fine-grained content-based image retrieval (FG-CBIR). Given a query image (aka probe) depicting a “Dodge Charger Sedan 2012”, fine-grained retrieval is required to return images of the same car model from a car database (aka galaxy). In this figure, the fourth returned image, marked with a red outline, is incorrect as it is a different car model, it is a “Dodge Caliber Wagon 2012”

Fig. 5.2 An illustration of fine-grained sketch-based image retrieval (FG-SBIR), where a free-hand human sketch serves as the query for instance-level retrieval of images. FG-SBIR is challenging due to (1) the fine-grained and cross-domain nature of the task and (2) free-hand sketches are highly abstract, making fine-grained matching even more difficult

retrieval (FG-CBIR, cf. Fig. 5.1) and fine-grained sketch-based image retrieval (FG-SBIR, cf. Fig. 5.2). Fine-grained image retrieval can also be expanded into fine-grained cross-media retrieval [15], which can utilize one media type to retrieve any media types, for example using an image to retrieve relevant text, video, or audio. For performance evaluation, following the standard convention, FG-CBIR performance is typically measured using Recall@K [37] which is the average recall score over all M query images in the test set. For each query, the top K relevant images are returned. The recall score will be 1 if there is at least one positive image in the top K returned images, and 0 otherwise. By formulation, the definition of Recall@K is as follows Recall@K =

M 1  scorei . M i=1

(5.1)

5.2

Content-Based Fine-Grained Image Retrieval

143

For measuring FG-SBIR performance, Accuracy@K is commonly used, which is the percentage of sketches whose true-match photos are ranked in the top K : Accuracy@K =

K |Icorrect | , K

(5.2)

K | is the number of true-match photos in top K . where |Icorrect

5.2

Content-Based Fine-Grained Image Retrieval

SCDA [45] is one of the earliest examples of fine-grained image retrieval that used deep learning. It employs a pre-trained CNN to select meaningful deep descriptors by localizing the main object in an image without using explicit localization supervision. Unsurprisingly, they show that selecting only useful deep descriptors, by removing background features, can significantly benefit retrieval performance in such an unsupervised retrieval setting (i.e., requiring no image labels). Recently, supervised metric learning based approaches (e.g., [52, 53]) have been proposed to overcome the retrieval accuracy limitations of unsupervised retrieval. These methods still include additional sub-modules specifically tailored for finegrained objects, e.g., the weakly-supervised localization module proposed in [52], which is in turn inspired by [45]. CRL-WSL [52] employed a centralized ranking loss with a weaklysupervised localization approach to train their feature extractor. DGCRL [53] eliminated the gap between inner-product and the Euclidean distance in the training and test stages by adding a Normalize-Scale layer to enhance the intra-class separability and inter-class compactness with their Decorrelated Global-aware Centralized Ranking Loss. Recently, the Piecewise Cross Entropy loss [49] was proposed by modifying the traditional cross entropy function by reducing the confidence of the fine-grained model, which is similar to the basic idea of following the natural prediction confidence scores for fine-grained categories [8, 9]. The performance of recent fine-grained content-based image retrieval approaches are reported in Table 5.1. Although supervised metric learning based retrieval methods outperform their unsupervised counterparts, the absolute recall scores (i.e., Recall@K ) of the retrieval task still has room for further improvement. One promising direction is to integrate advanced techniques, e.g., attention mechanisms or higher-order feature interactions, which are successful for fine-grained recognition into FG-CBIR to achieve better retrieval accuracy. However, new large-scale FG-CBIR datasets are required to drive future progress, which are also desirable to be associated with other characteristics or challenges, e.g., open-world sub-category retrieval (cf. Sect. 2.2). In the following, we elaborate three representative works of content-based fine-grained image retrieval.

Published in

TIP 2017

IJCAI 2018

AAAI 2019

Image and Vis. Comp. 2020

Methods

SCDA [45]

CRL-WSL [52]

DGCRL [53]

Zeng et al. [49]

VGG-16 ResNet-50 ResNet-50

 

VGG-16

Backbones



Supervised

224 × 224

Not given

224 × 224

224 × 224

Img. Resolution

70.1

67.9

65.9

62.2

Birds 1 (%)

79.8

79.1

76.5

74.2

2 (%)

Recall@K

86.9

86.2

85.3

83.2

4 (%)

92.0

91.8

90.3

90.1

8 (%)

86.7

75.9

63.9

58.5

Cars 1 (%)

91.7

83.9

73.7

69.8

2 (%)

95.2

89.7

82.1

79.1

4 (%)

97.0

94.0

89.2

86.2

8 (%)

Table 5.1 Comparison of recent fine-grained content-based image retrieval methods on CUB200-2011 [42] and Stanford Cars [20]. Recall@K is the average recall over all query images in the test set

144 5 Fine-Grained Image Retrieval

5.2

Content-Based Fine-Grained Image Retrieval

5.2.1

145

Selective Convolutional Descriptor Aggregation

An SCDA feature

Selective Convolutional Descriptor Aggregation (SCDA) [45] focuses on a challenging task in the pure unsupervised setting: fine-grained image retrieval. Even with image labels, finegrained images are difficult to classify, let alone the unsupervised retrieval task. SCDA first localizes the main object in fine-grained images, a step that discards the noisy background and keeps useful deep descriptors. The selected descriptors are then aggregated and dimensionality is reduced into a short feature vector. SCDA is unsupervised, using no image label or bounding box annotation. In concretely, given an input image I , the activations of a convolution layer are formulated as an order-3 tensor X with H × W × d elements, which include a set of 2D feature maps S = {Sn } (n = 1, . . . , d). Sn of size H × W is the n-th feature map of the corresponding channel (the n-th channel). In other words, X can be also considered as having H × W cells and each cell contains one d-dimensional deep descrip  tor. SCDA denotes the deep descriptors as D = x (i, j) , where (i, j) is a particular cell (i ∈ {1, . . . , H } , j ∈ {1, . . . , W } , x (i, j) ∈ Rd ). For instance, by employing the pre-trained VGG-16 model [36] to extract deep descriptors, SCDA then gets a 7 × 7 × 512 activation tensor in pool5 if the input image is 224 × 224. After obtaining the pool5 activations, the input image I is represented by an order-3 tensor X, which is a sparse and distributed representation [2, 16]. SCDA then proposes a simple yet effective method (shown in Fig. 5.3). It adds up the obtained pool5 activation tensor through the depth direction. Thus, the H × W × d 3-D tensor becomes an H × W  2-D tensor, which is named as the “aggregation map”, i.e., A = dn=1 Sn (where Sn is the n-th feature map in pool5 ). For the aggregation map A, there are H × W summed activation responses, corresponding to H × W positions. The higher the activation response of a particular position (i, j), the more possibility of its corresponding region being part of the object. As fine-grained image retrieval is an unsupervised problem, SCDA calculates the mean value a¯ of all the positions in A as the threshold to decide which positions localize objects: the position (i, j) whose activation

(a) Input image

(b) Convolutional activation tensor

(c) Mask map

Fig. 5.3 Pipeline of the SCDA method [45]

(d) The largest connected component of the mask map

(e) Selected descriptors

(f) Descriptors aggregation

146

5 Fine-Grained Image Retrieval

response is higher than a¯ indicates the main object, e.g., birds, dogs or aircrafts, might appear in that position. A mask map M of the same size as A can be obtained as:  1 if Ai, j > a¯ , (5.3) Mi, j = 0 otherwise where (i, j) is a particular position in these H × W positions. However, after these processes, there are still several small noisy parts activated on a complicated background. SCDA therefore employs Algorithm 1 to collect the largest  to get rid of the interference caused connected component of M, which is denoted as M,  while the noisy parts are by noisy parts. In the last column, the main objects are kept by M, discarded. Algorithm 1 Finding connected components in binary images Require: A binary image I; 1: Select one pixel p as the starting point; 2: while True do 3: Use a flood-fill algorithm to label all the pixels in the connected component containing p; 4: if All the pixels are labeled then 5: Break; 6: end if 7: Search for the next unlabeled pixel as p; 8: end while 9: return Connectivity of the connected components, and their corresponding size (pixel numbers).

 to select useful and meaningful deep convolutional descriptors. The SCDA uses M i, j = 1, while M i, j = 0 means the position (i, j) descriptor x (i, j) should be kept when M might have background or noisy parts:   i, j = 1 , F = x (i, j) | M

(5.4)

where F stands for the selected descriptor set, which will be aggregated into the final representation for retrieving fine-grained images. The whole convolutional descriptor selection process is illustrated in Fig. 5.3b–e.   i, j = 1 is obtained. After the selection process, the selected descriptor set F = x (i, j) | M SCDA then adopts the “avg&maxPool” aggregation as “SCDA feature” to represent the whole fine-grained image. It also incorporates another SCDA feature produced from the relu5_2 layer which is three layers in front of pool5 in the VGG-16 model [36].  pool Following pool5 , SCDA gets the mask map M relu5_2 from relu5_2 and combines M 5  and M relu5_2 together to get the final mask map of relu5_2 . M pool5 is firstly upsampled to  pool and the size of M relu5_2 . SCDA keeps the descriptors when their position in both M 5

5.2

Content-Based Fine-Grained Image Retrieval

147

M relu5_2 are 1, which are the final selected relu5_2 descriptors. The aggregation process remains the same. Finally, SCDA concatenates the features of relu5_2 and pool5 into a single representation, denoted by “SCDA+ ”:  SCDA+ ← SCDApool5 , α × SCDArelu5_2 ,

(5.5)

where α is the coefficient for SCDArelu5_2 . It is set to 0.5 for fine-grained image retrieval. After that, the 2 normalization is done on the concatenation feature.

5.2.2

Centralized Ranking Loss

Centralized Ranking Loss (CRL) [52] was proposed to solve the fine-grained object retrieval problem by developing a specific loss function. It can achieve a very efficient (e.g., 1,000 times training speedup compared to the triplet loss) and discriminative feature learning by a “centralized” global pooling. CRL also contains a weakly supervised attractive feature extraction component, which segments object contours with top-down saliency. Consequently, the contours are integrated into the CNN response map to precisely extract features “within” the target object. In addition, the combination of CRL and weakly supervised learning can reinforce each other. As shown in Fig. 5.4, the overall framework contains both offline and online phases. In the offline training stage, a pre-trained CNN is finetuned by the CRL method. While in the online retrieval stage, the feature of the query is extracted from its attractive region. In concretely, let x i be the feature vector for image Ii and let A = {ak } , k ∈ {1, 2, . . . , K }  be the set of center features for K classes and ak = |P1k | Pk x i , where |Pk | denotes the number of samples in Pk . Let Di, j be the distance between two features x i and x j . The CRL

Fig. 5.4 Overall framework of the network trained by CRL [52]

148

5 Fine-Grained Image Retrieval

defines the ranking through the class center, aiming to minimize the intra-class distance, as well as maximizing the inter-class distance in an efficient manner. It is then defined as    L= max(0, m + x i − ak 2 − x i − al 2 ) , (5.6) ak ∈A al ∈A x i ∈Pk

where m is a positive scalar that controls the margin. Given a centralized triplet, the subgradients are defined as ∂L x i − ak x i − al = − . (5.7) ∂ xi x i − ak 2 x i − al 2 CRL forces the feature x i to approach the target class center and leave away from centers of other classes. The class mean vectors are computed in each batch. It updates the parameters through the gradient of positive and negative examples, rather than using the class centers. In addition, training with the CRL involves O(N L 2 ) computations, where L denotes the class number. In practice, the class number L should be less than N2 to generate triplets in a batch. In the following, we introduce the weakly supervised feature extraction stage. It first coarsely localizes the object via SCDA [45] based salient object extraction, followed by a refinement module with Gaussian mixture models. Then, raw features are aggregated to form the final output of region-aware deep features. In concretely, for a given image I and a CNN model, the saliency map M ∈ R H ×W is computed by following SCDA which has been described in Sect. 5.2.1. Then, the saliency map M will be resized to m × n by using a bilinear interpolation. According to the estimated coarse mask M, the weakly supervised feature extraction stage afterward labels a pixel as foreground if the mask value is 1, or background otherwise. Then, two Gaussian Mixture Models (GMMs) are learned to model the foreground and background appearances, respectively, with each GMM containing K = 5 components. Given an image I , let θ f be the foreground model and θb be the background model, and y p denotes the pixel p of the image with a corresponding RGB value v p . The objective function of refinement can be formulated as max Y ,θ

 p

E(y p , θ ) +



E(y p , yq ) .

(5.8)

p,q

Additionally, E(y p , θ ) = (1 − y p ) log( p(v p ; θb )) + y p log( p(v p ; θ f )) ,

(5.9)

and Y is the set of saliency assignments across the image. E(y p , yq )) is a pairwise term between pixels p and q. The optimization process is done by following [32]. Given the above object segmentation, this stage then re-extracts more discriminative features as

5.2

Content-Based Fine-Grained Image Retrieval

149

⎧ ⎪ ⎨ f i, j if |δ(i, j) ∩ M| > α |M| f ( f (i, j),α ) = , ⎪ ⎩0 otherwise

(5.10)

where M denotes the refined object mask and δi, j denotes the receptive field at the spatial location (i, j).

5.2.3

Category-Specific Nuance Exploration Network

As shown in Fig. 5.5, Category-specific Nuance Exploration Network (CNENet) elaborately discovers category-specific nuances that contribute to category prediction, and semantically aligns these nuances grouped by sub-category without any additional prior knowledge, to directly emphasize the discrepancy among subcategories. Specifically, it designs a Nuance Modelling Module that adaptively predicts a group of category-specific response (CARE) maps via implicitly digging into category-specific nuances, specifying the locations and scales for category-specific nuances. Upon this, two nuance regularizations are proposed: (1) semantic discrete loss that forces each CARE map to attend to different spatial regions to capture diverse nuances; (2) semantic alignment loss that constructs a consistent semantic correspondence for each CARE map of the same order with the same subcategory via guaranteeing each instance and its transformed counterpart to be spatially aligned. Moreover, it proposes a Nuance Expansion Module, which exploits context appearance information of discovered nuances and refines the prediction of current nuance by its similar neighbors, leading to further improvement on nuance consistency and completeness. Input

Backbone

Nuance Modelling Module

Nuance Expansion Module Pooling 1

ʘ

1×1

×

1

×





1

×



Light-weight Generator

⊗ …

Conv Block 4

Conv Block 2

Conv Block 3

Conv Block 1

Blur & Scaling

×

1×1

1

ʘ

Disjoint Nuance Maps Category-specific Response Maps

Semantic Alignment Loss

Semantic Discrete Loss

Selectively

arg

(

,



)

Scaling

(

1

)

=

arg ||



arg

=

( 0< < ′ ≤



1

)

( Original Category-specific Response Maps

Fig. 5.5 Overall framework of CNENet [43]

)

ʘ

: Element-wise Multiplication : Original Input Path



||

1≤ ≤

… ,



1





⇓⇓

⊗ : Matrix Multiplication

Selectively

⇓⇓

: Transformed Input Path : Shared Path

Refined Category-specific Response Maps

150

5 Fine-Grained Image Retrieval

For an input image I , it denotes its feature maps X ∈ R H ×W ×C extracted by the convolution blocks as the input of the proposed Nuance Modelling Module (NMM), where H , W , C are the height, width, and dimension of the feature maps. In the following, we introduce the three sub-modules contained in NMM. The first sub-module is category-specific response generation. NMM first splits the feature maps X into l category-specific response (CARE) maps M = [M 1 , M 2 , . . . , M l ] ∈ R H ×W ×l . In concretely, these maps are generated by a light-weight generator G(·) followed by a normalization operation as ˆ = ReLU(G(X)) , M

(5.11)

where ReLU denotes the rectified linear unit (ReLU) activation function, and G(·) is a ˆ is passed through a minconvolutional operation with kernel size C × 1 × 1 × l. Then, M max layer to normalize the nuanced response coefficients M, which forces M into [0, 1] by ˆ − min( M) ˆ M , (5.12) M= ˆ − min( M) ˆ + max( M) where  is a protection item to avoid dividing by zero. The second sub-module is semantic discrete loss (LSD ). It is introduced to make the l CARE maps in M as discrepant with each other much as possible. Therefore, this is equivalent to minimizing the similarity among CARE maps, as

LSD =

2 l(l − 1)



S(M k , M k  ) ,

(5.13)

1≤k 0 is a small enough value, then f (x 0 ) ≥ f (x 1 ). With this in mind, we can start from an initial estimate x 0 of the local minimum of the function f , and consider the following sequence x 0 , x 1 , x 2 , . . . makes x n+1 = x n − γn f (x n ) , n ≥ 0 .

(B.2)

f (x 0 ) ≥ f (x 1 ) ≥ f (x 2 ) ≥ · · · .

(B.3)

Therefore, we can get

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 X.-S. Wei, Fine-Grained Image Analysis: Modern Approaches, Synthesis Lectures on Computer Vision, https://doi.org/10.1007/978-3-031-31374-5

181

182

Appendix B: Stochastic Gradient Descent

Fig. B.1 Illustration of the gradient descent method

x0

(a) The initial point of gradient descent.

x*

x3

x2 x1 x0

(b) The iterative process of gradient descent.

Ideally, the sequence f (x n ) will converge to the extreme point we expect. Note that the step size γ of each iteration can be changed throughout the convergence process. Each step of moving along the direction of gradient descent is shown in Fig. B.1b. The red arrow points in the opposite direction of the gradient at that point (the direction of the gradient at a point is perpendicular to the contour line passing through that point). Along the direction of gradient descent, the function will eventually reach the “center”, i.e., the corresponding point x ∗ where the function f obtains the minimum value. Note that when the gradient descent method solves the optimal solution of the machine learning objective function in each iteration, it requires calculating the gradient of all training set samples. If the training set is very large, especially for deep learning, the training data can be tens of thousands or even millions, then it is conceivable that the efficiency of this method will be very low. At the same time, due to the limitation of hardware resources (e.g., GPU memory, etc.), this approach is basically unrealistic in practical applications. Therefore, the stochastic gradient descent method is often used in deep learning to replace the classic gradient descent method to update parameters and train models.

Appendix B: Stochastic Gradient Descent

183

The stochastic gradient descent method (SGD) iteratively updates the model parameters by calculating one sample each time, so that the optimal solution may be obtained with only a few hundred or thousands of samples. Compared with the gradient descent method as aforementioned, which requires all samples at one time, SGD is naturally more efficient. However, because SGD only considers one sample for each optimization step, each iteration is not necessarily the direction of the overall optimization of the model. If the sample noise is large, the model based on SGD can easily fall into a local optimal solution and converge to an unsatisfactory state. Therefore, in deep learning, it is still necessary to traverse all the training samples, and every time the training set samples traversed once are termed an “epoch”. While in deep learning, SGD has been simply modified. Each time a “batch” of samples is selected, the gradient information on this batch of samples is used to complete a model update, which is termed a “batch” in deep learning. Therefore, the stochastic gradient descent method based on the “batch” data is termed “batch SGD” (aka mini-batch based SGD). In fact, batched SGD is a compromise between standard gradient descent and stochastic gradient descent. Since 64 or 128 training samples are used as “mini-batch” data, more robust gradient information can be obtained in a batch of samples than a single sample, thus batch SGD is more stable than the traditional SGD. At present, the training of deep neural networks, e.g., convolutional neural networks and recurrent neural networks, all use batch stochastic gradient descent algorithms (mini-batch SGD).

C

Chain Rule

The chain rule is a differentiation rule in calculus, which is used to obtain the derivative of a compound function and is a common method in calculus derivation operations. In history, the first use of the chain rule was by the German philosopher, logician, √ mathematician, and scientist Gottfried Wilhelm Leibniz in solving the partial derivative of a + bz + cz 2 , i.e., the composite function of the square root function and a + bz + cz 2 . Since only the first-order derivative is usually involved in the model training of deep learning, the content of the appendix of this chapter merely discusses the case of the first-order derivative of a one-variable or multivariate function. Given the definition of derivatives as f (x) =

f (x + h) − f (x) df = lim . h→0 dx h

(C.1)

Suppose there is a function F(x) = f (g(x)), where f (·) and g(·) are functions, and x is a constant, which makes f (·) be differentiable at g(x), and g(·) be differentiable at x. Then, F (x) = f (g(x)) · g (x), i.e., ∂∂ Fx = ∂∂ gf · ∂∂ gx . The mathematical proof of it is as follows. Proof (The Chain rule) According to the definition of differentiable, we have g(x + δ) − g(x) = δg (x) + (δ)δ , where (δ) is the remaining term, when δ → 0, (δ) → 0. Similarly, f (g(x) + α) − f (g(x)) = α f (g(x)) + η(α)α ,

(C.2)

(C.3)

where when α → 0, η(α) → 0.

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 X.-S. Wei, Fine-Grained Image Analysis: Modern Approaches, Synthesis Lectures on Computer Vision, https://doi.org/10.1007/978-3-031-31374-5

185

186

Appendix C: Chain Rule

Now for F(x), F(x + δ) − F(x) = f (g(x + δ)) − f (g(x))

= f (g(x) + δg (x) + (δ)δ) − f (g(x))

= αδ f (g(x)) + η(αδ )αδ ,

(C.4) (C.5) (C.6)

where αδ = δg (x) + (δ)δ. Note that when δ → 0, αδδ → g (x) and αδ → 0, so η(αδ ) → 0. We have f (g(x + δ)) − f (g(x)) → f (g(x)) · g (x) . δ

(C.7) 

For example, if F(x) = (a + bx)2 , then according to the chain rule, the derivative of the function F(·) to the independent variable x is: f (t) = t 2 , g(x) = a + bx, ∂∂ Fx = ∂∂tf · ∂∂tx = 2t · b = 2g(x) · b = 2b2 x + 2ab.

D

Convolutional Neural Networks

Convolutional Neural Networks (CNNs) are a special type of artificial neural networks. Different from other models of neural networks (e.g., recurrent neural networks, Boltzmann machines, etc.), its main characteristic is the convolutional operation. Therefore, CNN performs well in many fields, especially image-related tasks of computer vision, such as image classification, image semantic segmentation, image retrieval, object detection, and so on. In addition, with the deepening of CNN research, problems such as text classification in natural language processing and software defect prediction in software mining are trying to solve problems using convolutional neural networks, and have achieved better performance compared with traditional methods or even other deep network models. The appendix of this chapter first reviews the development history of convolutional neural networks, and then introduces the basic structure of CNNs from a high-level way, as well as two basic processes in CNNs, i.e., feed-forward operations (prediction and inference) and feed-back operations (training and learning). Finally, we also present the basic operations in CNNs, e.g., convolution layers, pooling layers, activation functions, fully connected layers, etc.

D.1

Development History

The first milestone event in the history of the development of convolutional neural networks occurred in neuroscience around the 1960s. In 1959, Canadian neuroscientists David H. Hubel and Torsten Wiesel proposed the concept of “receptive field” w.r.t. a single nerve in the primary visual cortex of cats. Then, in 1962, they found that the receptive field, binocular

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 X.-S. Wei, Fine-Grained Image Analysis: Modern Approaches, Synthesis Lectures on Computer Vision, https://doi.org/10.1007/978-3-031-31374-5

187

188

Appendix D: Convolutional Neural Networks

Fig. D.1 Torsten Wiesel (on the left) and David H. Hubel (on the right). The two were awarded the Nobel Prize in Physiology or Medicine in 1981 for their outstanding contributions to information processing in the visual system Fig. D.2 The neurocognitive model proposed in 1980 by Kunihiko Fukushima [1]

vision and other functional structures existed in the cat’s visual center, marking the first time that the neural network structure was discovered in the brain’s visual system (Fig. D.1).1 Around 1980, Japanese scientist Kunihiko Fukushima, based on the work of Hubel and Wiesel, simulated the biological visual system and proposed a hierarchical multi-layer artificial neural network, called “neurocognitron” [1] to handle handwritten character recognition and other pattern recognition tasks. The neurocognitive model was later considered to be the predecessor of today’s convolutional neural network. In Fukushima Kunihiko’s neurocognitive model, the two most important constituent units are “S-cells” and “C-cells”, and the two types of cells are stacked alternately to form a neurocognitive network. Among them, S-type cells are used to extract local features, and C-type cells are used for abstraction and fault tolerance, which are shown in Fig. D.2. It is not difficult to find that they could correspond to the convolution and pooling operations in today’s convolutional neural networks. Subsequently, LeCun et al. [6] proposed a convolutional neural network algorithm based on gradient learning in 1998 and successfully used it for handwritten digital character recognition. Under the technical conditions at that time, an error rate of less than 1% could be achieved. Therefore, LeNet, a convolutional neural network, was then used in almost all postal systems in the United States to recognize handwritten postal codes and then sort mail and packages, cf. Fig. D.3. It can be said that LeNet is the first convolutional neural 1 Related video materials can be found via: Hubel and Wiesel & the Neural Basis of Visual

Perception (http://knowingneurons.com/2014/10/29/hubel-and-wiesel-the-neural-basis-of-visualperception/).

Appendix D: Convolutional Neural Networks

INPUT 32x32

C1: feature maps 6@28x28

Convolutions

189

C3: f. maps 16@10x10 S4: f. maps 16@5x5 S2: f. maps 6@14x14

Subsampling

C5: layer F6: layer 120 84

Convolutions

OUTPUT 10

Gaussian connections Full connection Subsampling Full connection

Fig. D.3 Architecture of LeNet-5 [6]: A convolutional neural network for character recognition. Among them, each “rectangle” represents a feature map, and finally, there are two fully connected layers

network that produces actual commercial value, and it also lays a solid foundation for the future development of convolutional neural networks. In view of this, when Google proposed GoogLeNet [11] in 2015, it also deliberately capitalized “L” to pay tribute to the “predecessor” LeNet. The time came in 2012. On the occasion of the fourth anniversary of the ImageNet image classification competition known as the “World Cup” in the computer vision field, Geoffrey E. Hinton and his two Ph.D. students defeated the University of Tokyo in Japan and the VGG group of Oxford University in the United Kingdom by virtue of the convolutional neural network Alex-Net, and won the championship in one fell swoop with an accuracy rate nearly 12 higher than that of the second place [5]. Suddenly, the researchers of the field were shocked and uproarious. Since then, it has opened the prelude to the gradual dominance of convolutional neural networks in the field of computer vision.2 Since then, the champion of the ImageNet competition every year is the deep convolutional neural network. Until 2015, after improving the activation function in the convolutional neural network, the performance of the convolutional neural network on the ImageNet dataset (4.94%) exceeded the human prediction error rate (5.1%) [2] for the first time. In recent years, with the increasing number of researchers in the field of neural networks (especially convolutional neural networks and the rapid development of technology, convolutional neural networks have become wider, deeper and more complex, from the initial 5 layers, 16 layers, to such as MSRA proposed 152-layer Residual Net [3] or even a thousand-layer network has been commonplace for researchers and engineering practitioners. But what is interesting is that Fig. D.4a shows the Alex-Net network structure, and it can be found that it is almost the same as LeNet more than a decade ago in terms of basic structure. However, in the past few decades, data and hardware devices (especially GPUs) have developed rapidly, and they are actually the main engine that further promotes innovation in the field of neural networks. It is precisely this that makes deep neural networks no longer “party trick”, but has truly become a practical and feasible tool and application 2 Some people call 2012, when Alex-Net was born, the first year of deep learning in the field of

computer vision. At the same time, some people regard 2006, when Hinton proposed Deep Belief Networks (DBN) [4], as the first year of deep learning in the field of machine learning.

190

Appendix D: Convolutional Neural Networks

(a) An illustration of the architecture of Alex-Net [5]

(b) Geoffrey E. Hinton

Fig. D.4 Alex-Net and Geoffrey E. Hinton. It is worth mentioning that Hinton won the 2016 James Clerk Maxwell Award jointly issued by the Institute of Electrical and Electronics Engineers (IEEE) and the Royal Society of Edinburgh for his outstanding research achievements in recognition of his deep learning outstanding contribution

method. Since the deep convolutional neural network became popular in 2012, it has become an important research topic in the field of artificial intelligence. It can even be said that deep learning is a dominant research technology in fields such as computer vision and natural language processing. At the same time, it is the technological singularity that major companies and entrepreneurial organizations in the industry are striving to develop and strive to take the lead.

D.2

Basic Structure

In general, a convolutional neural network is a hierarchical model whose input is raw data, such as RGB images, raw audio data, etc. The convolutional neural network extracts highlevel semantic information layer by layer from the original data input layer through a series of operations such as convolutional operations, pooling operations, and nonlinear activation function mapping, and abstracts them layer by layer. This process is “feed-forward”. Among them, different types of operations are generally called “layers” in convolutional neural networks: convolutional operations correspond to “convolution layers”, pooling

Appendix D: Convolutional Neural Networks

191

operations correspond to “pooling layers”, and so on. Finally, the last layer of a convolutional neural network formalizes its target task (e.g., classification, regression, etc.) as an objective function.3 By calculating the error or loss between the predicted value and the real value, use the back-propagation algorithm [8] to back-forward the error or loss from the last layer layer by layer, update the parameters of each layer, and forward again after updating the parameters feed, and repeat this process until the network model converges, so as to achieve the purpose of model training. More generally speaking, the convolutional neural network is like a process of building blocks. The convolution and other operational layers are used as the “basic unit” to “build” on the original data in turn, and “stacking” layer by layer, and the calculation of the loss function is used as the end of the process. The data form of each layer is a three-dimensional tensor. Specifically, in computer vision applications, the data layer of a convolutional neural network is usually an image in the RGB color space: H rows, W columns, and 3 channels (R, G, B respectively), which are denoted as x 1 . x 1 can get x 2 after the first layer of operation, and the corresponding parameters in the first layer of operation are recorded as ω1 ; x 2 is used as the input of second layer operation layer ω2 can get x 3 …Until the first L − 1 layer, then the network output is x L . In the above process, theoretically, each operation layer can be a separate convolutional operation, pooling operation, nonlinear mapping or other operations/transformations, and of course, it can also be a combination of different forms of operations/transformations. x 1 → ω1 → x 2 → · · · → x L−1 → ω L−1 → x L → ω L → z . Finally, the entire network ends with the calculation of the loss function. If y is the ground truth corresponding to the input x 1 , the loss function is expressed as z = L(x L , y) ,

(D.1)

where the parameter in the function L(·) is ω L . In fact, it can be found that for specific operations in the layer, and its parameters ωi can be empty, such as polling operations, nonparametric nonlinear mapping, and calculation of non-parametric loss functions. In practical applications, for different tasks, the form of the loss function also changes. Taking the regression problem as an example, the commonly used 2 loss function can be used as the objective function of the convolutional network. At this time, z = Lregression (x L , y) = 21 x L − y2 . For classification problems, the objective function of the network often uses a cross-entropy  exp(xiL ) (i = loss function, z = Lclassification (x L , y) = − i yi log( pi ), where pi = C L j=1 exp(x j

)

1, 2, . . . , C), and C is the number of classification task categories. Obviously, regardless of the regression problem or the classification problem, before calculating z, it is necessary to obtain x L of the same dimension as y through appropriate operations, in order to correctly calculate loss/error value for sample predictions. 3 The objective function is sometimes called the cost function or the loss function.

192

D.3

Appendix D: Convolutional Neural Networks

Feed-Forward Operations

The feed-forward operation of the convolutional neural network is relatively intuitive, regardless of whether the error is calculated when the model is trained or the sample prediction is obtained after the model is trained. Also take the image classification task as an example, assuming that the network has been trained, and the parameters ω1 , . . . , ω L−1 have converged to an optimal solution. At this time, this network can be used for image category prediction. The prediction process is actually a feed-forward operation of the network: the test set image is sent to the network as the network input x 1 , and then ω1 can be obtained through the first layer operation ω1 x 2 , and so on... until the output x L ∈ RC . As mentioned in the previous section, x L is a vector of the same dimension as the real label. In the network obtained after training with the cross-entropy loss function, each dimension of x L can represent the posterior probability that x 1 belongs to C categories respectively. In this way, the prediction mark corresponding to the input image x 1 can be obtained by arg max xiL .

(D.2)

i

D.4

Feed-Back Operations

Like many other machine learning models (e.g., support vector machines, etc.), convolutional neural networks and all other deep learning models rely on minimizing the loss function to learn the model parameters, i.e., minimizing z in Eq. (D.1). However, it should be pointed out that from the perspective of convex optimization theory, the neural network model is not only a non-convex function but also extremely complex, which brings difficulties in optimizing the solution. In this case, the deep learning model uses Stochastic Gradient Descent (SGD) and error back-propagation to update model parameters. For details about the stochastic gradient descent method, please refer to Appendix B. Specifically, when solving convolutional neural networks, especially for large-scale application problems (such as ILSVRC classification or detection tasks), mini-batch SGD is often used. Mini-batch SGD processing randomly selects n samples as a batch of samples in the training model stage. First, the prediction is obtained through feed-forward operation and its error is calculated, and then the parameters are updated through the gradient descent method. The gradient is fed back layer by layer from the back to the front until the first layer parameters of the network are updated. Such a parameter update process is called a “mini-batch”. Between different batches, all training set samples are traversed according to non-replacement sampling. Traversing one training sample is called “one round”. Among them, the batch size should not be set too small. If it is too small (such as batch size is 1, 2, etc.), since the sample sampling is random, updating the model parameters according to the error on the sample might not be globally optimal (at this time, it is only a local optimal update), which will cause the training process to produce oscillation. The upper limit of the

Appendix D: Convolutional Neural Networks

193

batch size mainly depends on the limitation of hardware resources, such as the size of GPU memory. Generally speaking, the batch size is set to 32, 64, 128 or 256. Of course, when updating parameters with stochastic gradient descent, there are different parameter update strategies. Let’s look at the detailed process of error back-propagation. According to the notations in Sect. D.2, suppose that the error on n samples obtained after a certain batch feed-forward is z, and the last layer L is a 2 loss function, ∂z = 0, ∂ω L

∂z = xL − y . ∂xL

(D.3) (D.4)

It is not difficult to find that each layer operation corresponds to two parts of derivatives: ∂z one part is the derivative ∂ω i of the error with respect to the i layer parameters, The other ∂z part is the derivative ∂ x i of the error with respect to the input of this layer. Among them, • The derivative of the parameter ωi .

∂z ∂ωi

is used to update the parameters of this layer

ωi ← ωi − η

∂z , ∂ωi

(D.5)

where η is the step size of each stochastic gradient descent, which generally decreases with the increase of the number of training rounds (epochs). • The derivative of the input x i . ∂∂zx i is used for the reverse of the error forward layer spread. It can be regarded as the error signal that the final error is passed from the last layer to the ith layer. Let’s take the i layer parameter update as an example. When the error update signal ∂z ∂z (derivative) is back-propagated to the ith layer, the corresponding values of ∂ω i and ∂ x i need to be calculated when the parameters of layer i are updated. According to the chain rule (see Appendix C), we have ∂z ∂z ∂ vec(x i+1 ) = · , ∂(vec(ωi ) ) ∂(vec(x i+1 ) ) ∂(vec(ωi ) ) ∂z ∂z ∂ vec(x i+1 ) = · . ∂(vec(x i ) ) ∂(vec(x i+1 ) ) ∂(vec(x i ) )

(D.6) (D.7)

The vector mark “vec(·)” is used here because tensor operations are converted into vector operations in actual project implementation. For vector operations and derivations, please refer to Appendix A. As mentioned earlier, since ∂ x∂zi+1 has been calculated at the i + 1 layer, When the i layer is used to update the layer parameters, it only needs to be vectorized ∂z and transposed to get ∂(vec(x i+1 ) ) , i.e., the left term on the right side of the equal sign in

194

Appendix D: Convolutional Neural Networks

Algorithm D.1 Back-propagation algorithm Require: Training set (N training samples and corresponding labels) (x 1n , yn ), n = 1, . . . , N ; epoch: T; Ensure: ωi , i = 1, . . . , L; 1: for t = 1 . . . T do 2: while the training set data has not been traversed completely do 3: The feed-forward operation gets each layer x i , and calculates the final error z; 4: for i = L . . . 1 do 5: (a) Use Eq. (D.6) to reversely calculate the derivative of the i layer error to the layer ∂z ; parameter: ∂(vec(ωi ) ) 6: (b) Use Eq. (D.7) to reversely calculate the derivative of the layer i error to the input data ∂z of the layer: i  ; ∂(vec(x ) )

7: (c) Update the parameters with Eq. (D.5): ωi ← ωi − η ∂zi ; ∂ω 8: end for 9: end while 10: end for 11: return ωi .

Eqs. (D.6) and (D.7). On the other hand, at layer i, x i+1 is directly affected by x i through ∂ vec(x i+1 ) ωi , so When deriving in reverse, you can also directly get its partial derivative ∂(vec(x i ) ) and ∂ vec(x i+1 ) . ∂(vec(ωi ) )

∂z ∂z In this way, the term ∂ω i and ∂ x i at the left end of the equal sign in Eqs. (D.6) and (D.7) can be obtained. Then, update the parameters of this layer according to Eq. (D.5), and pass ∂∂zx i as the error of this layer to the previous layer, i.e., i − 1th layer, and so on. Until it is updated to the first layer, thus completing a batch (mini-batch) parameter update. The model training based on the above back-propagation algorithm is shown in Algorithm D.1. Of course, the above method is to manually write the derivative and use the chain rule to calculate the gradient of the final error to the different parameters of each layer, and then it still needs to be implemented through code. It can be seen that this process is not only cumbersome, but also error-prone, especially for some complex operations, its derivatives are difficult to obtain or even cannot be written explicitly. In response to this situation, some deep learning libraries, such as Theano4 and TensorFlow,5 use the method of symbolic differentiation for automatic derivation to train the model. Symbolic differentiation can calculate the mathematical representation of derivatives at compile time, and further optimize using symbolic calculation methods. In practical applications, users only need to focus on model building and forward code writing, without worrying about the complicated gradient derivation process. However, what needs to be pointed out here is that readers should understand the aforementioned reverse gradient propagation process and be able to obtain the correct derivative form.

4 https://github.com/Theano/Theano. 5 https://github.com/tensorflow/tensorflow.

Appendix D: Convolutional Neural Networks

D.5

195

Basic Operations in CNNs

As shown in Fig. D.5, for CNNs, the input data is the original sample form without any artificial processing, followed by many operation layers stacked on the input layer. These operation layers as a whole can be regarded as a complex function f CNN , the final loss function is composed of data loss and model parameter regularization loss. The training of the deep model is driven by the final loss to update the parameters of the model and back-propagate the error to each layer of the network. The training process of the model can be simply abstracted as a direct “fitting” from the original data to the final target, and these components in the middle are just playing the role of mapping the original data to features (i.e., feature learning) and then mapping them into sample labels (i.e., target tasks). Let’s take a look at the basic components that make up f CNN . l l l In this section, the three-dimensional tensor x l ∈ R H ×W ×D represents the input of the   l layer of the convolutional neural network. We use the triple i l , j l , d l to indicate that the tensor corresponds to row i l , column j l and the element at the position of d l channel, where 0 ≤ i l < H l , 0 ≤ j l < W l , 0 ≤ d l < Dl , as shown in Fig. D.6. However, in general engineering practice, due to the adoption of the mini-batch training strategy, the input of

Fig. D.5 Basic flowchart of convolutional neural networks Fig. D.6 Schematic illustration of input x l for the first layer of convolutional neural networks

196

Appendix D: Convolutional Neural Networks

the first layer of the network is usually a four-dimensional tensor, i.e., x l ∈ R H ×W ×D ×N , where N is the number of samples in each batch of mini-batch. Taking N = 1 as an example, x l can get x l+1 after l layer operation processing. For the convenience of writing in the following chapters, this is abbreviated as y as the output l+1 l+1 l+1 corresponding to layer l, i.e., y = x l+1 ∈ R H ×W ×D . l

l

l

D.5.1 Convolution Layers Convolution layer is the basic operation in convolutional neural network, and even the fully connected layer that plays a classification role at the end of the network is replaced by convolutional operations during engineering implementation. The convolutional operation is actually an operation method in analytical mathematics, and in convolutional neural networks, it usually only involves discrete convolution. The following takes the case of d l = 1 as an example to introduce the convolutional operation of the two-dimensional scene. Assume that the input image (input data) is a 5 × 5 matrix on the right side as shown in Fig. D.7, and its corresponding convolution kernel (also known as convolution parameter, convolution kernel or convolution filter) is a 3 × 3 matrix. At the same time, it is assumed that each time a convolution is performed during the convolutional operation, the convolution kernel moves by one-pixel position, i.e., the convolution stride is 1. The first convolutional operation starts from the (0, 0) pixel of the image, and the parameters in the convolution kernel are multiplied bit by bit by the corresponding position image pixels and accumulated as the result of a convolutional operation, i.e., 1 × 1 + 2 × 0 + 3 × 1 + 6 × 0 + 7 × 1 + 8 × 0 + 9 × 1 + 8 × 0 + 7 × 1 = 1 + 3 + 7 + 9 + 7 = 27, as shown in Fig. D.8a. Similarly, when the stride is 1, as shown in Fig. D.8b–d, the convolution kernel performs convolutional operations on the input image from left to right and top to bottom according to the step size, and finally outputs a convolution feature of 3 × 3 size, and the result will be used as the input of the next layer operation.

Fig. D.7 Convolution kernel and input data in a 2D scene. The left side of the figure is a 3 × 3 convolution kernel, and the right side of the figure is the input data of 5 × 5

Appendix D: Convolutional Neural Networks

197

Fig. D.8 Example of convolutional operations

Similarly, if the input tensor of the convolution layer l in the three-dimensional case is x l ∈ l the convolution kernel of this layer is f l ∈ R H ×W ×D . In three-dimensional inputs, the convolutional operation actually just extends the two-dimensional convolution to all channels at the corresponding position (i.e., Dl ), and finally sums all H W Dl elements processed by one convolution as the position convolution result, as shown in Fig. D.9. Further, if there are D convolution kernels like f l , then the convolution output of 1 × 1 × 1 × D dimension can be obtained at the same position, and D is the channel number Dl+1 of the feature x l+1 of the layer l + 1. The formalized convolutional operation can be expressed as H  W  Dl  yi l+1 , j l+1 ,d = f i, j,d l ,d × xill+1 +i, j l+1 + j,d l . (D.8) l l l R H ×W ×D ,

i=0 j=0 d l =0

Fig. D.9 Convolution kernel and input data in a 3D scene. The left side of the figure is the convolution kernel size of 3 × 4 × 3, and the right side of the figure is the output result of 1 × 1 × 1 obtained after the convolutional operation at this position

198

Appendix D: Convolutional Neural Networks

In this equation, (i l+1 , j l+1 ) is the position coordinate of the convolution result, meeting the following constraints as 0 ≤ i l+1 < H l − H + 1 = H l+1 .

(D.9)

0 ≤ j l+1 < W l − W + 1 = W l+1 .

(D.10)

It should be noted that f i, j,d l ,d in Eq. (D.8) can be regarded as the learned weight. It can be found that the weight of this item is the same for all inputs at different positions, which is the “weight sharing” characteristic of the convolution layer. In addition, a bias term bd is usually added to yi l+1 , j l+1 ,d . The learning rate of stochastic gradient descent can be set separately for the weight and bias items of this layer during error back-propagation. Of course, according to the requirement of practical problems, you can also set the bias item of a certain layer to all 0, or set the learning rate to 0, so as to fix the bias or weight of the layer. Additionally, there are two important hyperparameters in the convolutional operation: filter size and stride. Appropriate hyperparameter settings will bring ideal performance improvements to the final model. It can be seen that convolution is a local operation, the local information of the image is obtained by applying a convolution kernel of a certain size to the local image area. As shown in Fig. D.10, we apply the overall edge filter, horizontal edge filter and vertical edge filter on the original image, respectively. These three filters (convolution kernels) are 3 × 3 size convolution kernels K e , K h and K v in Eq. (D.11): ⎡

⎤ ⎡ ⎤ ⎡ ⎤ 0 −4 0 1 2 1 1 0 −1 K e = ⎣ −4 16 −4 ⎦ , K h = ⎣ 0 0 0 ⎦ , K v = ⎣ 2 0 −2 ⎦ , 0 −4 0 −1 −2 −1 1 0 −1

(D.11)

Just imagine, if there may be an object edge at the pixel (x, y) of the original image, the pixel values at (x − 1, y), (x + 1, y), (x, y − 1), (x, y + 1) around it should be significantly different from those at (x, y). At this time, if the overall edge filter K e is used, the image area with small difference in surrounding pixel values can be eliminated and the area with significant difference can be retained, so that the edge information of the object can be detected. Similarly, horizontal and vertical edge filters similar to K h and K v 6 can respectively retain horizontal and vertical edge information. In fact, the kernel parameters in the convolutional network are learned through network training. In addition to learning similar horizontal and vertical edge filters, edge filters at any angle can also be learned. Of course, not only that, filters (kernels) that detect color, shape, texture, and many other basic patterns can be included in a sufficiently complex deep convolutional neural network. By “combining”7 these filters (kernels) and following 6 Actually, K and K are called Sobel operator or Sobel filter in digital image processing. v h 7 The “combining” operation in a convolutional neural network can be realized by the operations

such as the pooling layer and the nonlinear mapping layer introduced later.

Appendix D: Convolutional Neural Networks

(a) Original image.

(c) Horizontal edge filter

h.

199

(b) Overall edge filter

e.

(d) Vertical edge filter

v.

Fig. D.10 Examples of convolutional operations in Eq. (D.11)

the subsequent operations of the network, the basic and general patterns will gradually be abstracted into “concept” representations with high-level semantics, and thus correspond to specific sample categories.

D.5.2 Pooling Layers This section discusses the case when layer l operations are pooling. The commonly used pooling operations are average-pooling and max-pooling. It should be pointed out that, unlike the convolution layer operation, the pooling layer does not contain parameters that need to be learned. When using it, you only need to specify hyperparameters such as the pooling type (average or max, etc.), the kernel size of the pooling operation, and the stride of the pooling operation. Following the notations in the previous section, the lth level pooling kernel can be l expressed as pl ∈ R H ×W ×D . The average (max) pooling takes the average (max) of all values in the area covered by the pooling kernel as the confluent result at each operation, as Average − pooling : yi l+1 , j l+1 ,d = Max − pooling : yi l+1 , j l+1 ,d =

1 HW

 0≤i